
CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS
In the context of deep learning, most regularization strategies are based on
regularizing estimators. Regularization of an estimator works by trading increased
bias for reduced variance. An effective regularizer is one that makes a profitable
trade, that is it reduces variance significantly while not overly increasing the
bias. When we discussed generalization and overfitting in Chapter 5, we focused
on three situations, where the model family being trained either (1) excluded the
true data generating process—corresponding to underfitting and inducing bias, or
(2) matched to the true data generating process—the “just right” model space,
or (3) includes the generating process but also many other possible generating
processes—the regime where variance dominates the estimation error (e.g. as
measured by the MSE—see Section. 5.5).
Note that, in practice, an overly complex model family does not necessarily
include (or even come close to) the target function or the true data generating
process. We almost never have access to the true data generating process so
we can never know if the model family being estimated includes the generating
process or not. But since, in deep learning, we are often trying to work with
data such as images, audio sequences and text, we can probably safely assume
that our model family does not include the data generating process. We can
assume that—to some extent – we are always trying to fit a square peg (the data
generating process) into a round hole (our model family) and using the data to
do that as best we can.
What this means is that controlling the complexity of the model is not going
to be a simple question of finding the model of the right size, i.e. the right
number of parameters. Instead, we might find—and indeed in practical deep
learning scenarios, we almost always do find – that the best fitting model (in the
sense of minimizing generalization error) is one that possesses a large number of
parameters that are not entirely free to span their domain.
As we will see there are a great many forms of regularization available to the
deep learning practitioner. In fact, developing more effective regularizers has been
one of the major research efforts in the field.
Most machine learning tasks can be viewed in terms of learning to represent
a function
ˆ
f(x) parametrized by a vector of parameters θ. The data consists of
inputs x
(t)
and (for some tasks) targets y
(t)
for t ∈ {1, . . ., m}. In the case of
classification, each y
(t)
is an integer class label in {1, . . . , k}. For regression tasks,
each y
(t)
is a real number or a real-valued vector (and we then denote the target
in bold, as y
(t)
), while in the case of a density estimation task, there are simply
no targets (or we consider both x
(t)
and y
(t)
as the values of observed variables
(x, x) whose joint distribution is to be captured, and we may simply lump both
of them into the name x). We may group these examples into a design matrix X
and a vector of targets y (when y
(t)
is a scalar), or a matrix of targets Y (when
197