large, then this family of functions may actually contain many functions which all fit the
training data well. Without sufficient data, we would therefore not be able to distinguish
which one was most appropriate, and the learner would make an arbitrary choice among
these apparently good solutions. Yet, these functions are all different from each other,
yielding different answers in at least some potential cases. Even if one of these functions
was the correct one, we might pick the wrong one, and the expected generalization
error would thus be at least proportional to the average discrepancy between these
functions, which is related to the variance of the estimator, discussed below (Section 5.5).
Occam’s razor suggests to pick the family of functions just large enough to leave only one
choice that fits well the data. Statistical learning theory has quantified this notion more
precisely, and shown a mathematical relationship between capacity (e.g. measured by the
VC-dimension) and the difference between generalization error and training error (also
known as optimism). Although generalization error cannot generally be characterized
exactly because it depends on the specifics of the training distribution, error bounds can
be proven, showing that the discrepancy between training error and generalization error
is bounded by a quantity that grows with the ratio of capacity to number of training
examples (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik,
1995).
Since training error decreases as a function of capacity, and adding a decreasing
function (training error) and an increasing function (the difference between generaliza-
tion and training error) yields a U-shaped function (first decreasing, then increasing),
it means that generalization error is a U-shaped function of capacity. This is illustrated
in Figure 5.3, showing the underfitting regime (left) and overfitting regime (right).
How do we know if are underfitting or overfitting? If by increasing capacity we
decrease generalization error, then we are underfitting, otherwise we are overfitting.
The capacity associated with the transition from underfitting to overfitting is the optimal
capacity.
An interesting question is how one can select optimal capacity, and how it is expected
to vary. We can use a validation set on monitor generalization error as a function of ca-
pacity and thus estimate optimal capacity empirically. Methods such as cross-validation
(TODO) allow one to estimate generalization error even when the dataset is small, by
considering many possible splits of the data into training and validation examples. What
does theory tell us about the optimal capacity? Because theory predicts that optimism
roughly increases with the ratio of capacity and number of training examples (Vapnik,
1995), it means that optimal capacity should increase with the number of training ex-
amples, as illustrated in Figure ??. This also agrees with intuition: with more training
examples, we can “afford” a more complex model. If one only sees a single (x
1
, y
1
)
example, a very reasonable prediction for a function y = f (x) is the constant function
f(x) = y
1
. If one sees two examples (x
1
, y
1
) and (x
2
, y
2
), a very reasonable prediction is
the straight line that passes through these two points, i.e., a linear function of x. As we
see more examples, it makes sense to consider higher order polynomials, but certainly
not of higher order than the number of training examples.
With this notion in mind, let us consider an example. We can fit a polynomial of
degree n to a dataset by using linear regression on a transformed space. Instead of fitting
76