Machine learning algorithms can fail in two ways: underfitting and overfitting. Un-
derfitting is simple to understand: it happens when the learner cannot find a solution
that fits the observed data well enough. If we try to fit with a linear regression (x, y)
pairs for which y is actually a quadratic function of x, then we are bound to have
underfitting. There will be important aspects of the data that our learner cannot cap-
ture. If the learner needs a richer model in order to capture the data, it is underfitting.
Sometimes, underfitting is not just due to the limitation imposed on the family of func-
tions but also due to limitations on computational resources. For example, deep neural
networks or recurrent neural networks can be very difficult to optimize (and in theory
finding the global optimum of the training objective can be NP-hard). With a limited
computational budget (e.g., a limited number of training iterations of an iterative opti-
mization procedure), the learner may not be able to fit the training data well enough.
It could also be that the optimization procedure easily gets stuck in local minima of
the objective function and vastly more expensive optimization procedures are needed to
really extract all the juice from the data and the theoretical richness of the family of
functions in which the learner searches for a solution.
Overfitting is more subtle and not so easy to wrap one’s mind around. What would
be wrong with picking a very rich family of functions that is large enough to always fit
our training data? We could certainly fit the training data all right, but we might lose
the ability to generalize well. Why? The fundamental reason is that if the family of
functions accessible to the learner (with its limitations on computational resources) is
too large, then this family of functions may actually contain many functions which all
fit the training data well. Without sufficient data, we would therefore not be able to
distinguish which one was most appropriate, and the learner would make an arbitrary
choice among these apparently good solutions. Yet, these functions are all different from
each other, yielding different answers in at least some cases. Even if one of these functions
was the correct one, we might pick the wrong one, and the expected generalization
error would thus be at least proportional to the average discrepancy between these
functions, which is related to the variance of the estimator, discussed below (Section 5.5).
Occam’s razor suggests to pick the family of functions just large enough to leave only one
choice that fits well the data. Statistical learning theory has quantified this notion more
precisely, and shown a mathematical relationship between capacity (e.g. measured by the
VC-dimension) and the difference between generalization error and training error (also
known as optimism). Although generalization error cannot generally be characterized
exactly because it depends on the specifics of the training distribution, error bounds can
be proven, showing that the discrepancy between training error and generalization error
is bounded by a quantity that grows with the ratio of capacity to number of training
examples (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik,
1995).
Since training error decreases as a function of capacity, and adding a decreasing
function (training error) and an increasing function (the difference between generaliza-
tion and training error) yields a U-shaped function (first decreasing, then increasing),
it means that generalization error is a U-shaped function of capacity. This is illustrated
in Figure 5.3, showing the underfitting regime (left) and overfitting regime (right).
79