Chapter 7
Regularization
A central problem in machine learning is how to make an algorithm that will perform
well not just on the training data, but also on new inputs. The main strategy for
achieving good generalization is known as regularization. Regularization consists of
putting extra constraints on a machine learning model, such as restrictions of parameter
values or extra terms in the cost function, that are not designed to help fit the training
set. If chosen carefully, these extra constraints can lead to improved performance on the
test set, either by encoding prior knowledge into the model, or by forcing the model to
consider multiple hypotheses that explain the training data. Sometimes regularization
also helps to make an underdetermined problem determined.
This chapter builds on the concepts of generalization, overfitting, underfitting, bias
and variance introduced in Chapter 5. If you are not already familiar with these notions,
please refer to that chapter before continuing with the more advance material presented
here.
Simply put, regularizers work by trading increased bias for reduced variance. An
effective regularizer is one that makes a profitable trade, that is it reduces variance
significantly while not overly increasing the bias.
When we discussed generalization and overfitting in Chapter 5, we focused on the
situation where the model family being trained either (1) excluded true data generating
process corresponding to underfitting and inducing bias, or (2) matched to the true
data generating process – the “Goldilocks” model space, or (3) was more complex than
the generating process the regime where variance dominates the estimation error (as
measured by the MSE see Section. 5.5).
Note that an overly complex model family does not necessarily include (or even
come close to) the target function or the true data generating process. In practice, we
almost never have access to the true data generating process so we can never know if
our model family being estimated includes the generating process or not. But since, in
deep learning, we are often trying to work with data such as images, audio sequences
and text, we can probably safely assume that the model family we are training does not
include the data generating process. We can assume that to some extend we are
always trying to fit a square peg (the data generating process) into a round hole (our
model family) and using the data to do that as best we can.
131
What this means is that controlling the complexity of the model is not going to
be a simple question of finding the model of the right size, i.e. the right number of
parameters. Instead, we might find and indeed in practical deep learning scenarios,
we almost always do find that the best fitting model (in the sense of minimizing
generalization error) is one that possesses a large number of parameters that are not
entirely free to span their domain.
Regularization is a method of limiting the domain of these parameters in such a way
as to limit the capacity of the model. With respect to minimizing the empirical risk,
regularization induces bias in an attempt to limit variance that results from using a
finite dataset.
As we will see there are a great many forms of regularization available to the deep
learning practitioner. In fact, developing more effective regularizers has been one of the
major research efforts in the field.
Most machine learning tasks can be viewed in terms of learning to represent a func-
tion
ˆ
f(x) parametrized by a vector of parameters θ. The data consists of inputs x
(i)
and (for some tasks) targets y
(i)
for i {1, . . . , n}. In the case of classification, each y
(i)
is an integer class label in {1, . . . , k}. For regression tasks, each y
(i)
is a real number.
In the case of a density estimation task, there are no targets. We may group these
examples into a design matrix X and a vector of targets y.
In deep learning, we are mainly interested in the case where
ˆ
f(x) has a large number
of parameters and as a result possesses a high capacity to fit relatively complicated
functions. This means that deep learning algorithms either require very large datasets
so that the data can fully specify such complicated models, or they require careful
regularization.
7.1 Classical Regularization: Parameter Norm Penalty
Regularization has been used for decades prior to the advent of deep learning. Tradi-
tional statistical and machine learning models traditionally represented simpler func-
tions. Because the functions themselves had less capacity, the regularization did not
need to be as sophisticated. We use the term classical regularization to refer to the
techniques used in the general machine learning and statistics literature.
Most classical regularization approaches are based on limiting the capacity of models,
such as neural networks, linear regression, or logistic regression, by adding a parameter
norm penalty Ω(θ) to the loss function J. We denote the regularized loss function by
˜
J:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.1)
where α is a hyperparameter that weighs the relative contribution of the norm penalty
term, Ω, relative to the standard loss function J(x; θ). The hyperparameter α should be
a non-negative real number, with α = 0 corresponding to no regularization, and larger
values of α corresponding to more regularization.
When our training algorithm minimizes the regularized loss function
˜
J it will de-
crease both the original loss J on the training data and some measure of the size of the
132
parameters θ (or some subset of the parameters). Different choices for the parameter
norm Ω can result in different solutions being preferred.
For models such as linear or logistic regression, where θ = [w
>
, b]
>
, we typically
choose Ω(θ) =
1
2
||w||
2
2
or Ω(θ) = ||w||
1
. That is, we leave the biases unregularized
1
,
and penalize half
2
the squared L
2
or the L
1
norm.
In the following sections, we discuss the effects of the various norms when used as
penalties on the weights.
7.1.1 L
2
Parameter Regularization
One of the simplest and most common kind of classical regularization is the L
2
parameter
norm penalty.
3
, Ω(θ) =
1
2
||w||
2
2
. This form of regularization is also known as ridge
regression. It is equally applicable to neural networks, where the penalty is equal to the
sum of the squared L
2
of all of the weight vectors. In the context of neural networks,
this is know as weight decay. Typically, for neural networks, we use a different coefficient
α for the weights at each layer of the network. This coefficient should be tuned using a
validation set.
We can gain some insight into the behaviour of weight decay regularization by con-
sidering the gradient of the regularized loss function. To simplify the presentation, we
assume a linear model with no bias term, so θ is just w. Such a model has the following
gradient:
w
˜
J(w; X, y) = αw +
w
J(w; X, y) (7.2)
We will further simplify the analysis by considering a quadratic approximation to
the loss function in the neighborhood of the empirically optimal value of the weights
w
. (If the loss is truly quadratic, as in the case of fitting a linear regression model with
mean squared error, then the approximation is perfect).
ˆ
J(θ) = J(w
) +
1
2
(w w
)
>
H(w w
) (7.3)
where H is the Hessian matrix of J with respect to w evaluated at w
. There is no first
order term in this quadratic approximation, because w
is defined to be a minimum,
1
The biases typically require less data to fit accurately than the weights. Each weight specifies how
two variables interact, and requires observing both variables in a variety of conditions to fit well. Each
bias controls only a single variable. This means that we do not induce too much variance by leaving the
biases unregularized. Regularizing the biases can introduce a significant amount of underfitting. For
example, making sparse features usually requires being able to set the biases to signficantly negative
values.
2
The
1
2
in the L
2
penalty may seem arbitrary. Conceptually, it is not necessary and could be folded
into the α hyperparamter. However, the
1
2
results in a simpler gradient (w instead of 2w) and simplifies
the interpretation of the penalty as being a Gaussian prior on w.
3
More generally, we could consider regularizing the parameters to a parameter value w
(o)
that is
perhaps not zero. In that case the L
2
penalty term would be Ω(θ) =
1
2
||w w
(o)
||
2
2
=
1
2
P
i
(w
i
w
(o)
i
)
2
.
Since it is far more common to consider regularizing the model parameters to zero, we will focus on this
special case in our exposition.
133
where the gradient vanishes. Likewise, because w
is a minimum, we can conclude that
H is positive semi-definite.
w
ˆ
J(w) = H(w w
). (7.4)
If we replace the exact gradient in equation 7.2 with the approximate gradient in
equation 7.4, we can write an equation for the location of the minimum of the regularized
loss function:
αw + H(w w
) = 0 (7.5)
(H + αI)w = Hw
(7.6)
˜
w = (H + αI)
1
Hw
(7.7)
The presence of the regularization term moves the optimum from w
to
˜
w. As α
approaches 0,
˜
w approaches w
. But what happens as α grows? Because H is real and
symmetric, we can decompose it into a diagonal matrix Λ and an ortho-normal basis
of eigenvectors, Q, such that H = QΛQ
>
. Aplying the decomposition to equation 7.7,
we obtain:
w = (QΛQ
>
+ αI)
1
QΛQ
>
w
=
h
Q(Λ + αI)Q
>
i
1
QΛQ
>
w
= Q(Λ + αI)
1
ΛQ
>
w
,
Q
>
w = (Λ + αI)
1
ΛQ
>
w
. (7.8)
If we interpret the Q
>
w as rotating our parameters w into the basis as defined by
the eigenvectors Q of H, then we see that the effect of weight decay is to rescale the
coefficients of eigenvectors. Specifically the ith component is rescaled by a factor of
λ
i
λ
i
+α
. (You may wish to review how this kind of scaling works, first explained in Fig.
2.3).
Along the directions where the eigenvalues of H are relatively large, for example,
where λ
i
α, the effect of regularization is relatively small. However, components with
λ
i
α will be shrunk to have nearly zero magnitude. This effect is illustrated in Fig.
7.1
Only directions along which the parameters contribute significantly to reducing the
loss are preserved relatively intact. In directions that do not contribute to reducing
the loss, a small eigenvalue of the Hessian tell us that movement in this direction will
not significantly increase the gradient. Components of the weight vector corresponding
to such unimportant directions are decayed away through the use of the regularization
throughout training. This effect of suppressing contributions to the parameter vector
along these principle directions of the Hessian H is captured in the concept of the
effective number of parameters, defined to be
γ =
X
i
λ
i
λ
i
+ α
. (7.9)
134
w
1
w
2
w
˜
w
Figure 7.1: An illustration of the effect of L2 (or weight decay) regularization on the
value of the optimal w.
As α is increased, the effective number of parameters decreases.
Another way to gain some intuition for the effect of L
2
regularization is to consider
its effect on linear regression. The unregularized objective function for linear regression
is the sum of squared errors:
(Xw y)
>
(Xw y).
When we add L
2
regularization, the objective function changes to
(Xw y)
>
(Xw y) +
1
2
αw
>
w.
This changes the normal equations for the solution from
w = (X
>
X)
1
X
>
y
to
w = (X
>
X + αI)
1
X
>
y.
We can see L
2
regularization causes the learning algorithm to “perceive” the input
X as having higher variance, which makes it shrink the weights on features whose
covariance with the output target is low compared to this added variance.
TODO–make sure the chapter includes maybe a table showing relationships between
early stopping, priors, constraints, penalties, and adding noise? e.g. look up L1 penalty
and it tells you what prior it corresponds to scratchwork thinking about how to do it:
L2 penalty L2 constraint add noise early stopping Gaussian prior L1 penalty L1
constraint Laplace prior Max-norm penalty
7.1.2 L
1
Regularization
While L
2
regularization is the most common form of regularization for model parameters
such as the weights of a neural network. It is not the only form of regularization in
common usage. L
1
regularization is another kind of penalization on model parameters
that behaves differently from L
2
regularization.
135
Formally, L
1
regularization on the model parameter w is defined as:
Ω(θ) = ||w||
1
=
X
i
|w
i
|. (7.10)
That is, as the sum of absolute values of the individual parameters.
4
We will now
consider the effect of L
1
regularization on the simple linear model, with no bias term,
that we considered in our analysis of L
2
regularization. In particular, we are interested
in delineating the differences between L
1
and L
2
forms of regularization. Thus, if we
consider the gradient (actually the sub-gradient) on the regularized objective function
˜
J(w; X, y), we have:
w
˜
J(w; X, y) = βsign(w) +
w
J(X, y; w) (7.11)
where sign(w) is simply sign of w applied element-wise.
By inspecting Eqn. 7.11, we can see immediately that the effect of L
1
regulariza-
tion is quite different from that of L
2
regularization. Specifically, we can see that the
regularization contribution to the gradient no longer scales linearly with w, instead it
is a constant factor with a sign equal to sign(w). One consequence of this form of
the gradient is that we will not necessarily see clean solutions to quadratic forms of
w
J(X, y; w) as we did for L
2
regularization. Instead, the solutions are going to be
much more aligned to the basis space in which the problem is embedded.
For the sake of comparison with L
2
regularization, we will again consider a simplified
setting of a quadratic approximation to the loss function in the neighborhood of the
empirical optimum w
. (Once again, if the loss is truly quadratic, as in the case of
fitting a linear regression model with mean squared error, then the approximation is
perfect). The gradient of this approximation is given by
w
ˆ
J(w) = H(w w
). (7.12)
where, again, H is the Hessian matrix of J with respect to w evaluated at w
. We
will also make the further simplifying assumption that the Hessian is diagonal, H =
diag([γ
1
, . . . , γ
N
]), where each γ
i
> 0. With this rather restrictive assumption, the
solution of the minimum of the L
1
regularized loss function decomposes into a systems
of equations of the form:
˜
J(w; X, y) =
1
2
γ
i
(w
i
w
i
)
2
+ β|w
i
|.
Which admits an optimal solution (for each dimension i) in the following form:
w
i
= sign(w
i
) max(|w
i
|
β
γ
i
, 0)
4
As with L
2
regularization, we could consider regularizing the parameters to a value that is not zero,
but instead to some parameter value w
(o)
. In that case the L
1
regularization would introduce the term
Ω(θ) = ||w w
(o)
||
1
= β
P
i
|w
i
w
(o)
i
|.
136
w
2
w
˜
w
w
2
w
˜
w
Figure 7.2: An illustration of the effect of L
1
regularization (RIGHT) on the value of
the optimal W , in comparison to the effect of L
2
regularization (LEFT).
Let’s consider the situation where w
i
> 0 for all i, there are two possible outcomes.
Case 1: w
i
β
γ
i
, here the optimal value of w
i
under the regularized objective is simply
w
i
= 0, this occurs because the contribution of J(w; X, y) to the regularized objective
˜
J(w; X, y) is overwhelmed in direction i, by the L
1
regularization which pushes the
value of w
i
to zero. Case 2: w
i
>
β
γ
i
, here the regularization does not move the optimal
value of w to zero but instead it just shifts it in that direction by a distance equal to
β
γ
i
. This is illustrated in Fig. 7.2.
In comparison to L
2
regularization, L
1
regularization results in a solution that is
more sparse. Sparsity in this context implies that there are free parameters of the model
that through normlone regularization – with an optimal value (under the regularized
objective) of zero. As we discussed, for each element i of the parameter vector, this
happened when w
i
β
γ
i
. Comparing this to the situation for normltwo regularization,
where (under the same assumptions of a diagonal Hessian H) we get w
L
2
=
γ
i
γ
i
+α
w
,
which is nonzero as long as w
is nonzero.
In Fig. 7.2, we see that even when the optimal value of w is nonzero, L
1
regulariza-
tion acts to punish small values of parameters just as harshly as larger values, leading
to optimal solutions with more parameters having value zero and more larger valued
parameters.
The sparsity property induced by L
1
regularization has been used extensively as a
feature selection mechanism. In particular, the well known LASSO Tibshirani (1995)
(least absolute shrinkage and selection operator) model integrates an L
1
penalty with a
linear model and a least squares cost function. Finally, L
1
is known as the only norm
that is both sparsifying and convex for non-degenerative problems
5
.
5
For degenerative problems, where more than one solution exists, L
2
regularization can find the
“sparse” solution in the sense that redundant parameters shrink to zero.
137
7.1.3 L
Regularization
7.2 Classical Regularization as Constrained Optimization
Classical regularization adds a penalty term to the training objective:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(θ).
Recall from Sec. 4.4 that we can minimize a function subject to constraints by con-
structing a generalized Lagrange function (see 4.4), consisting of the original objective
function plus a set of penalties. Each penalty is a product between a coefficient, called
a Karush–Kuhn–Tucker (KKT) multiplier
6
, and a function representing whether the
constraint is satisfied. If we wanted to constrain that Ω(θ) is less than some constant
k, we could construct a generalized Lagrange function
L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) k).
The solution to the constrained problem is given by
θ
= min
θ
max
α,α0
L(θ, α).
Solving this problem requires modifying both θ and α. Specifically, α must increase
whenever ||θ||
p
> k and decrease whenever ||θ||
p
< k. However, after we have solved
the problem, we can fix α
and view the problem as just a function of θ:
θ
= min
θ
L(θ, α
) = min
θ
J(θ; X, y) + α
Ω(θ).
This is exactly the same as the regularized training problem of minimizing
˜
J. Note
that the value of α
does not directly tell us the value of k. In principle, one can solve
for k, but the relationship between k and α
depends on the form of J. We can thus
think of classical regularization as imposing a constraint on the weights, but with an
unknown size of the constraint region. Larger α will result in a smaller constraint region,
and smaller α will result in a larger constraint region.
Sometimes we may wish to use explicit constraints rather than penalties. As de-
scribed in Sec. 4.4, we can modify algorithms such as stochastic gradient descent to
take a step downhill on J(θ) and then project θ back to the nearest point that satisfies
Ω(θ) < k. This can be useful if we have an idea of what value of k is appropriate and
do not want to spend time searching for the value of α that corresponds to this k.
Another reason to use explicit constraints and reprojection rather than enforcing con-
straints with penalties is that penalties can cause non-convex optimization procedures
to get stuck in local minima corresponding to small θ. When training neural networks,
this usually manifests as neural networks that train with several “dead units”. These are
units that do not contribute much to the behavior of the function learned by the network
because the weights going into or out of them are all very small. When training with
6
KKT multipliers generalize Lagrange multipliers to allow for inequality constraints
138
a penalty on the norm of the weights, these configurations can be locally optimal, even
if it is possible to significantly reduce J by making the weights larger. (This concern
about local minima obviously does not apply when
˜
J is convex)
Finally, explicit constraints with reprojection can be useful because they impose some
stability on the optimization procedure. When using high learning rates, it is possible to
encounter a positive feedback loop in which large weights induce large gradients which
then induce a large update to the weights. If these updates consistently increase the size
of the weights, then θ rapidly moves away from the origin until numerical overflow occurs.
Explicit constraints with reprojection allow us to terminate this feedback loop after
the weights have reached a certain magnitude. Hinton et al. (2012) recommend using
constraints combined with a high learning rate to allow rapid exploration of parameter
space while maintaining some stability.
TODO how L2 penalty is equivalent to L2 constraint (with unknown value), L1
penalty is equivalent to L1 constraint maybe move the earlier L2 regularization figure
to here, now that the sublevel sets will make more sense? show the shapes induced by
the different norms separate L2 penalty on each hidden unit vector is different from L2
penalty on all theta; is equivalent to a penalty on the max across columns of the column
norms
7.3 Regularization from a Bayesian Perspective
In section 5.7, we briefly reviewed TODO- quick discussion of how to do Bayesian
inference TODO- justification of MAP approximate Bayesian inference does it minimize
some divergence under some constraint? or is it just a heuristic? maybe a figure showing
the true posterior and a Dirac at the MAP TODO– specific priors leading to specific
penalties, Gaussian -¿ L2, TODO– do we want to talk about ALL regularization or just
“classical regularization”?
7.4 Regularization and Under-Constrained Problems
In some cases, regularization is necessary for machine learning problems to be properly
defined.
Many linear models in machine learning, including linear regression and PCA, de-
pend on inverting the matrix X
>
X. This is not possible whenever X
>
X is singular.
This matrix can be singular whenever the data truly has no variance in some direction,
or when there are fewer examples (rows of X) than input features (columns of X). In
this case, many forms of regularization correspond to inverting X
>
X +αI instead. This
regularized matrix is guaranteed to be invertible.
These linear problems have closed form solutions when the relevant matrix is invert-
ible. It is also possible for a problem with no closed form solution to be underdetermined.
For example, consider logistic regression applied to a problem where the classes are lin-
early separable. If a weight vector w is able to achieve perfect classification, then 2w
will also achieve perfect classification and higher likelihood. An iterative optimization
139
procedure like stochastic gradient descent will continually increase the magnitude of w
and, in theory, will never halt. In practice, a numerical implementation of gradient
descent will eventually reach sufficiently large weights to cause numerical overflow, at
which point its behavior will depend on how the programmer has decided to handle
values that are not real numbers.
Most forms of regularization are able to guarantee the convergence of iterative meth-
ods applied to underdetermined problems. For example, weight decay will cause gradient
descent to quit increasing the magnitude of the weights when the slope of the likelihood
is equal to the weight decay coefficient. Likewise, early stopping based on the valida-
tion set classification rate will cause the training algorithm to terminate soon after the
validation set classification accuracy has stopped increasing. Even if the problem is lin-
early separable and there is no overfitting, the validation set classification accuracy will
eventually saturate to 100%, resulting in termination of the early stopping procedure.
The idea of using regularization to solve underdetermined problems extends beyond
machine learning. The same idea is useful for several basic linear algebra problems.
One way of generalizing the concept of matrix inversion to non-square matrices, the
Moore-Penrose pseudoinverse, can be viewed as a form of normltwo regularization.
The Moore-Penrose pseudo-inverse is defined as
X
+
= lim
α&0
(X
>
X
>
+ αI)
1
X
>
.
Using the Moore-Penrose pseudoinverse we can generalize linear regression to undercon-
strained problems using minimal regularization by multiplying both sides of
Xw = y
by X
+
to yield
w = X
+
y.
When a true inverse for X exists, this returns the exact solution. When X is not
invertible because no exact solution exists, this returns the w corresponding to the least
possible mean squared error. When X is not invertible because many solutions exist,
this returns w with the minimum possible normltwo norm.
The Moore-Penrose pseudo-inverse is also closely related to the singular value de-
composition. Specifically, if the SVD is given by X = UΣW
>
, then X
+
= W Σ
+
U
>
.
To compute the pseudo-inverse of the diagonal matrix of singular values Σ, we simply
replace each non-zero element of the diagonal with its reciprocal, and leave all zero
elements equal to zero.
Because the SVD is robust to underdetermined problems resulting from too few ob-
servations or too little underlying variance, it is useful for implementing stable variants
of many closed-form linear machine learning algorithms. The stability of these algo-
rithms can be viewed as a result of applying the minimum amount of regularization
necessary to make the problem become determined.
140
7.5 Dataset Augmentation
The best way to make a machine learning model generalize better is to train it on more
data. Of course, in practice, the amount of data we have is limited. One way to get
around this problem is to create more fake data. For some machine learning tasks, it is
reasonably straightforward to create new fake data.
This approach is easiest for classification. A classifier needs to take a complicated,
high dimensional input x and summarize it with a single category identity y. This
means that the main task facing a classifier is to be invariant to a wide variety of
transformations. We can generate new (x, y) pairs easily just by transforming the x
inputs in our training set.
This approach is not as readily applicable to many other tasks. For example, it is
difficult to generate new fake data for a density estimation task unless we have already
solved the density estimation problem.
Dataset augmentation has been a particularly effective technique for a specific clas-
sification problem: object recognition. Images are high dimensional and include an
enormous variety of factors of variation, many of which can be easily simulated. Opera-
tions like translating the training images a few pixels in each direction can often greatly
improve generalization, even if the model has already been designed to be partially
translation invariant by using convolution and pooling. Many other operations such as
rotating the image or scaling the image have also proven quite effective. One must be
careful not to apply transformations that are relevant to the classification problem. For
example, optical character recognition tasks require recognizing the difference between
’b’ and ’d’ and the difference between ’6’ and ’9’, so horizontal flips and 180
rotations
are not appropriate ways of augmenting datasets for these tasks. There are also trans-
formations that we would like our classifiers to be invariant to, but which are not easy
to perform. For example, out-of-plane rotation can not be implemented as a simple
geometric operation on the input pixels.
For many classification and even some regression tasks, the task should still be
possible to solve even if random noise is added to the input. Neural networks prove
not to be very robust to noise, however . One way to improve the robustness of neural
networks is simply to train them with random noise applied to their inputs. Dropout,
a powerful regularization strategy that will be described in Sec. 7.11, can be seen as a
process of constructing new inputs by multiplying by noise.
In a multilayer network, it can often be beneficial to apply transformations such as
noise to the hidden units, as well as the inputs. This can be viewed as augmenting the
dataset as seen by the deeper layers.
When reading machine learning research papers, it is important to take the effect of
dataset augmentation into account. Often, hand-designed dataset augmentation schemes
can dramatically reduce the generalization error of a machine learning technique. It is
important to look for controlled experiments. When comparing machine learning al-
gorithm A and machine learning algorithm B, it is necessary to make sure that both
algorithms were evaluated using the same hand-designed dataset augmentation schemes.
If algorithm A performs poorly with no dataset augmentation and algorithm B performs
141
well when combined with numerous synthetic transformations of the input, then it is
likely the synthetic transformations and not algorithm B itself that cause the improved
performance. Sometimes the line is blurry, such as when a new machine learning al-
gorithm involves injecting noise into the inputs. In these cases, it’s best to consider
how generally applicable to the new algorithm is, and to make sure that pre-existing
algorithms are re-run in as similar of conditions as possible.
TODO– tangent propagation
7.6 Classical Regularization as Noise Robustness
Some classical regularization techniques can be derived in terms of training on noisy
inputs. For example, consider
TODO how L2 penalty and L1 penalty can be derived in different ways, noise on
inputs, noise on weights results for deep nets, see ”Training with noise is equivalent to
Tikhonov regularization” by Bishop et al
7.7 Bagging and Other Ensemble Methods
Bagging (short for bootstrap aggregating) is a technique for reducing generalization error
by combining several models (Breiman, 1994). The idea is to train several different
models separately, then have all of the models vote on the output for test examples.
This is an example of a general strategy in machine learning called model averaging.
Techniques employing this strategy are known as ensemble methods.
The reason that model averaging works is that different models will usually make
different errors on the test set to some extent.
Consider for example a set of k regression models. Suppose that each model makes
an error
i
on each example, with the errors drawn from a zero-mean multivariate normal
distribution with variances E[
2
i
] = v and covariances E[
i
j
] = c. Then the error made
by the average prediction of all the ensemble models is
1
k
P
i
i
. The expected squared
error is
E[
1
k
X
i
i
!
2
]
=
1
k
2
E[
X
i
2
i
+
X
j6=i
i
j
]
1
k
v +
k 1
k
c.
In the case where the errors are perfectly correlated and c = v, this reduces to v, and
the model averaging does not help at all. But in the case where the errors are perfectly
uncorrelated and c = 0, then the expected error of the ensemble is only
1
k
v. This means
that the expected squared error of the ensemble decreases linearly with the ensemble
size. In other words, on average, the ensemble will perform at least as well as any of
142
its members, and if the members make independent errors, the ensemble will perform
significantly better than of its members.
Different ensemble methods construct the ensemble of models in different ways. For
example, each member of the ensemble could be formed by training a completely different
kind of model using a different algorithm or cost function. Bagging is a method that
allows the same kind of model and same kind of training algorithm and cost function to
be re-used several times.
Specifically, bagging involves constructing k different datasets. Each dataset has
the same number of examples as the original dataset, but each dataset is constructed
by sampling with replacement from the original dataset. This means that, with high
probability, each dataset is missing some of the examples from the original dataset and
also contains several duplicate examples. Model i is then trained on dataset i. The
differences between which examples are included in each dataset result in differences
between the trained models. See Fig. 7.3 for an example.
Neural networks reach a wide enough variety of solution points that they can often
benefit from model averaging even if all of the models are trained on the same dataset.
Differences in random initialization, random selection of minibatches, differences in hy-
perparameters, or different outcomes of non-deterministic implementations of neural
networks are often enough to cause different members of the ensemble to make partially
independent errors.
Model averaging is an extremely powerful and reliable method for reducing general-
ization error. Its use is usually discouraged when benchmarking algorithms for scientific
papers, because any machine learning algorithm can benefit substantially from model av-
eraging at the price of increased computation and memory. For this reason, benchmark
comparisons are usually made using a single model.
Machine learning contests are usually won by methods using model averaging over
dozens of models. A recent prominent example is the Netflix Grand Prize (Koren, 2009).
Not all techniques for constructing ensembles are designed to make the ensemble
more regularized than the individual models. For example, a technique called boosting
constructs an ensemble with higher capacity than the individual models.
7.8 Early Stopping as a Form of Regularization
When training large models with high capacity, we often observe that training error
decreases steadily over time, but validation set error begins to rise again. See Fig. 7.4
for an example of this behavior. This behavior occurs very reliably.
This means we can obtain a model with better validation set error (and thus, hope-
fully better test set error) by returning to the parameter setting at the point in time
with the lowest validation set error. Instead of running our optimization algorithm until
we reach a (local) minimum, we run it until the error on the validation set has not
improved for some amount of time. Every time the error on the validation set improves,
we store a copy of the model parameters. When the training algorithm terminates, we
return these parameters, rather than the latest parameters. This procedure is specified
143
8
8
First ensemble member
Second ensemble member
Original dataset
First resampled dataset
Second resampled dataset
Figure 7.3: A cartoon depiction of how bagging works. Suppose we train an ’8’ detector
on the dataset depicted above, containing an ’8’, a ’6’, and a ’9’. Suppose we make
two different resampled datasets. The bagging training procedure is to construct each
of these datasets by sampling with replacement. The first dataset omits the ’9’ and
repeats the ’8’. On this dataset, the detector learns that a loop on top of the digit
corresponds to an ’8’. On the second dataset, we repeat the ’9’ and omit the ’6’. In
this case, the detector learns that a loop on the bottom of the digit corresponds to an
’8’. Each of these individual classification rules is brittle, but if we average their output
then the detector is robust, achieving maximal confidence only when both loops of the
’8’ are present.
144
Figure 7.4: Learning curves showing how the negative log likelihood loss changes over
time. In this example, we train a maxout network on MNIST, regularized with dropout.
Observe that the training loss decreases consistently over time, but the validation set
loss eventually begins to increase again.
145
more formally in Alg. 7.1.
Algorithm 7.1 The early stopping meta-algorithm for determining the best amount of
time to train. This meta-algorithm is a general strategy that works well with a variety
of training algorithms and ways of quantifying error on the validation set.
Let n be the number of steps between evaluations.
Let p be the “patience,” the number of times to observe worsening validation set error
before giving up.
Let θ
o
be the initial parameters.
θ θ
o
i 0
j 0
v
θ
θ
i
i
while j < p do
Update θ by running the training algorithm for n steps.
i i + n
v
0
ValidationSetError(θ)
if v
0
< v then
j 0
θ
θ
i
i
v v
0
else
j j + 1
end if
end while
Best parameters are θ
, best number of training steps is i
This strategy is known as early stopping. It is probably the most commonly used
form of regularization in deep learning. Its popularity is due both to its effectiveness
and its simplicity.
One way to think of early stopping is as a very efficient hyperparameter selection
algorithm. In this view, the number of training steps is just another hyperparameter. We
can see in Fig. 7.4 that this hyperparameter has a U-shaped validation set performance
curve, just like most other model capacity control parameters. In this case, we are
controlling the effective capacity of the model by determining how many steps it can
take to fit the training set precisely. Most of the time, setting hyperparameters requires
an expensive guess and check process, where we must set a hyperparameter at the start
of training, then run training for several steps to see its effect. The “training time”
hyperparameter is unique in that by definition a single run of training tries out many
values of the hyperparameter. The only significant cost to choosing this hyperparameter
automatically via early stopping is running the validation set evaluation periodically
146
during training.
An additional cost to early stopping is the need to maintain a copy of the best
parameters. This cost is generally negligible, because it is acceptable to store these pa-
rameters in a slower and larger form of memory (for example, training in GPU memory,
but storing the optimal parameters in host memory or on a disk drive). Since the best
parameters are written to infrequently and never read during training, these occasional
slow writes are have little effect on the total training time.
Early stopping is a very inobtrusive form of regularization, in that it requires no
change the underlying training procedure, the objective function, or the set of allowable
parameter values. This means that it is easy to use early stopping without damaging the
learning dynamics. This is in contrast to weight decay, where one must be careful not to
use too much weight decay and trap the network in a bad local minima corresponding
to a solution with pathologically small weights.
Early stopping may be used either alone or in conjunction with other regularization
strategies. Even when using regularization strategies that modify the objective function
to encourage better generalization, it is rare for the best generalization to occur at a
local minimum of the training objective.
Early stopping requires a validation set, which means some training data is not fed
to the model. To best exploit this extra data, one can perform extra training after the
initial training with early stopping has completed. In the second, extra training step,
all of the training data is included. There are two basic strategies one can use for this
second training procedure.
One strategy is to initialize the model again and retrain on all of the data. In
this second training pass, we train for the same number of steps as the early stopping
procedure determined was optimal in the first pass. There are some subtleties associated
with this procedure. For example, there is not a good way of knowing whether to retrain
for the same number of parameter updates or the same number of passes through the
dataset. On the second round of training, each pass through the dataset will require
more parameter updates because the training set is bigger. Usually, if overfitting is a
serious concern, you will want to retrain for the same number of epochs, rather than
the same number of parameter udpates. If the primary difficulty is optimization rather
than generalization, then retraining for the same number of parameter updates makes
more sense (but it’s also less likely that you need to use a regularization method like
early stopping in the first place). This algorithm is described more formally in Alg. 7.2.
Another strategy for using all of the data is to keep the parameters obtained from
the first round of training and then continue training but now using all of the data.
At this stage, we now no longer have a guide for when to stop in terms of a number
of steps. Instead, we can monitor the loss function on the validation set, and continue
training until it falls below the value of the training set objective at which the early
stopping procedure halted. This strategy avoids the high cost of retraining the model
from scratch, but is not as well-behaved. For example, there is not any guarantee that
the objective on the validation set will ever reach the target value, so this strategy is not
even guaranteed to terminate. This procedure is presented more formally in Alg. 7.3.
147
Algorithm 7.2 A meta-algorithm for using early stopping to determine how long to
train, then retraining on all the data.
Let X
(train)
and y
(train)
be the training set
Split X
(train)
and y
(train)
into X
(subtrain)
, y
(subtrain)
, X
(valid)
, y
(valid)
Run early stopping (Alg. 7.1) starting from random θ using X
(subtrain)
and y
(subtrain)
for training data and X
(valid)
and y
(valid)
for validation data. This returns i
, the
optimal number of steps.
Set θ to random values again
Train on X
(train)
and y
(train)
for i
steps.
Algorithm 7.3 A meta-algorithm for using early stopping to determining at what
objective value we start to overfit, then continuing training.
Let X
(train)
and y
(train)
be the training set
Split X
(train)
and y
(train)
into X
(subtrain)
, y
(subtrain)
, X
(valid)
, y
(valid)
Run early stopping (Alg. 7.1) starting from random θ using X
(subtrain)
and y
(subtrain)
for training data and X
(valid)
and y
(valid)
for validation data. This updates θ
J(θ, X
(subtrain)
, y
(subtrain)
)
while J(θ, X
(valid)
, y
(valid)
) > do
Train on X
(train)
and y
(train)
for n steps.
end while
Early stopping and the use of surrogate loss functions: A useful property of
early stopping is that it can help to mitigate the problems caused by a mismatch between
the surrogate loss function whose gradient we follow downhill and the underlying per-
formance measure that we actually care about. For example, 0-1 classification loss has
a derivative that is zero or undefined everywhere, so it is not appropriate for gradient-
based optimization. We therefore train with a surrogate such as the log likelihood of
the correct class label. However, 0-1 loss is inexpensive to compute, so it can easily
be used as an early stopping criterion. Often the 0-1 loss continues to decrease for
long after the log likelihood has begun to worsen on the validation set. TODO: figures.
in figures/regularization, I have extracted the 0-1 loss but only used the nll for the
regularization chapter’s figures.
Early stopping is also useful because it reduces the computational cost of the training
procedure. It is a form of regularization that does not require adding additional terms
to the surrogate loss function, so we get the benefit of regularization without the cost
of any additional gradient computations. It also means that we do not spend time
approaching the exact local minimum of the surrogate loss.
How early stopping acts as a regularizer: So far we have stated that early stop-
ping is a regularizer, but we have only backed up this claim by showing learning curves
where the validation set error has a U-shaped curve. What is the actual mechanism by
148
w
2
w
˜
w
w
(τ)
w
2
w
Figure 7.5: An illustration of the effect of early stopping (Right) as a form of regular-
ization on the value of the optimal w, as compared to L2 regularization (Left) discussed
in Sec. 7.1.1.
which early stopping regularizes the model?
7
Early stopping has the effect of restricting the optimization procedure to a relatively
small volume of parameter space in the neighborhood of the initial parameter value
θ
o
. More specifically, imagine taking τ optimization steps (corresponding to τ training
iterations) and taking η as the learning rate. We can view the product ητ as the
reciprocal of a regularization parameter. Assuming the gradient is bounded, restricting
both the number of iterations and the learning rate limits the volume of parameter space
reachable from θ
o
.
Indeed, we can show how in the case of a simple linear model with a quadratic error
function and simple gradient descent early stopping is equivalent to L2 regularization
as seen in Section 7.1.1.
In order to compare with classical L
2
regularization, we again consider the sim-
ple setting where we will take as the parameters to be optimized as θ = w and we
take a quadratic approximation to the objective function J in the neighborhood of the
empirically optimal value of the weights w
.
ˆ
J(θ) = J(w
) +
1
2
(w w
)
>
H(θ θ
) (7.13)
where, as before, H is the Hessian matrix of J with respect to w evaluated at w
. Given
the assumption that w
is a minimum of J(w), we can consider that H is positive semi-
definite and that the gradient is given by:
w
ˆ
J(w) = H(w w
). (7.14)
Let us consider initial parameter vector chosen at the origin, i.e. w
(0)
= 0. We will
7
Material for this section was taken from Bishop (1995); Sj¨oberg and Ljung (1995), for further details
regarding the interpretation of early-stopping as a regularizer, please consult these works.
149
consider updating the parameters via gradient descent:
w
(τ)
= w
(τ1)
η
w
J(w
(τ1)
) (7.15)
= w
(τ1)
ηH(w
(τ1)
w
) (7.16)
w
(τ)
w
= (I ηH)(w
(τ1)
w
) (7.17)
Let us now consider this expression in the space of the eigenvectors of H, i.e. we will
again consider the eigendecomposition of H: H = QΛQ
>
, where Λ is a diagonal matrix
and Q is an ortho-normal basis of eigenvectors.
w
(τ)
w
= (I ηQΛQ
>
)(w
(τ1)
w
)
Q
>
(w
(τ)
w
) = (I ηΛ)Q
>
(w
(τ1)
w
)
Assuming w
0
= 0, and that |1 ηλ
i
| < 1, we have after τ training updates, (TODO:
derive the expression below).
Q
>
w
(τ)
= [I (I ηΛ)
τ
]Q
>
w
. (7.18)
Now, the expression for Q
>
˜
w in Eqn. 7.8 for L
2
regularization can rearrange as:
Q
>
˜
w = (Λ + αI)
1
ΛQ
>
w
Q
>
˜
w = [I (Λ + αI)
1
α]Q
>
w
(7.19)
Comparing Eqns 7.18 and 7.19, we see that if
(I ηΛ)
τ
= (Λ + αI)
1
α,
then L
2
regularization and early stopping can be seen to be equivalent (at least under
the quadratic approximation of the objective function). Going even further, by taking
logs and using the series expansion for log(1+x), we can conclude that if all λ
i
are small
(i.e. ηλ
i
1 and λ
i
1) then
τ 1α. (7.20)
That is, under these assumptions, the number of training iterations τ plays a role in-
versely proportional to the L
2
regularization parameter.
Parameter values corresponding to directions of significant curvature (of the loss)
are regularized less than directions of less curvature. Of course, in the context of early
stopping, this really means that parameters that correspond to directions of significant
curvature tend to learn early relative to parameters corresponding to directions of less
curvature.
7.9 Parameter Sharing
TODO: start with bayesian perspective (parameters should be close), add practical
constraints to get parameter sharing.
150
7.10 Sparse Representations
TODO Most deep learning models have some concept of representations.
7.11 Dropout
Because deep models have a high degree of expressive power, they are capable of over-
fitting significantly. While this problem can be solved by using a very large dataset,
large datasets are not always available. Dropout (Srivastava et al., 2014) provides
a computationally inexpensive but powerful method of regularizing a broad family of
models.
Dropout can be thought of as a method of making bagging practical for neural net-
works. Bagging involves training multiple models, and evaluating multiple models on
each test example. This seems impractical when each model is a neural network, since
training and evaluating a neural network is costly in terms of runtime and storing a
neural network is costly in terms of memory. Dropout provides an inexpensive approx-
imation to training and evaluating a bagged ensemble of exponentially many neural
networks.
Specifically, dropout trains the ensemble consisting of all sub-networks that can be
formed by removing units from an underlying base network. In most modern neural
networks, based on a series of affine transformations and nonlinearities, we can effective
remove a unit from a network by multiplying its state by zero. This procedure requires
some slight modification for models such as radial basis function networks, which take
the difference between the unit’s state and some reference value. Here, we will present
the dropout algorithm in terms of multiplication by zero for simplicity, but it can be
trivially modified to work with other operations that remove a unit from the network.
TODO–describe training algorithm, with reference to bagging TODO– include fig-
ures from IG’s job talk TODO– training doesn’t rely on the model being probabilistic
TODO–describe inference algorithm, with reference to bagging TODO– inference does
rely on the model being probabilistic. and specifically, exponential family?
For many classes of models that do not have nonlinear hidden units, the weight
scaling inference rule is exact. For a simple example, consider a softmax regression
classifier with n input variables represented by the vector v:
P (Y = y | v) = softmax
W
>
v + b
y
.
We can index into the family of sub-models by element-wise multiplication of the input
with a binary vector d:
P (Y = 1 | v; d) = softmax
W
>
d v + b
y
.
The ensemble predictor is defined by re-normalizing the geometric mean over all ensem-
ble members’ predictions:
P
ensemble
(Y = y | v) =
˜
P
ensemble
(Y = y | v)
P
y
0
˜
P
ensemble
(Y = y
0
| v)
(7.21)
151
where
˜
P
ensemble
(Y = y | v) =
2
n
q
Π
d∈{0,1}
n
P (Y = y | v; d).
To see that the weight scaling rule is exact, we can simplify
˜
P
ensemble
:
˜
P
ensemble
(Y = y | v) =
2
n
q
Π
d∈{0,1}
n
P (Y = y | v; d)
=
2
n
q
Π
d∈{0,1}
n
softmax (w
>
d v + b)
y
=
2
n
v
u
u
t
Π
d∈{0,1}
n
exp
W
>
y,:
d v + b
P
y
0
exp
W
>
y
0
,:
d v + b
=
2
n
q
Π
d∈{0,1}
n
exp
W
>
y,:
d v + b
2
n
r
Π
d∈{0,1}
n
P
y
0
exp
W
>
y
0
,:
d v + b
Because
˜
P will be normalized, we can safely ignore multiplication by factors that are
constant with respect to y:
˜
P
ensemble
(Y = y | v)
2
n
q
Π
d∈{0,1}
n
exp
W
>
y,:
d v + b
= exp
1
2
n
X
d∈{0,1}
n
W
>
y,:
d v + b
= exp
1
2
W
>
y,:
v + b
Substituting this back into equation 7.21 we obtain a softmax classifier with weights
1
2
W .
The weight scaling rule is also exact in other settings, including regression networks
with conditionally normal outputs, and deep networks that have hidden layers with-
out nonlinearities. However, the weight scaling rule is only an approximation for deep
models that have non-linearities, and this approximation has not been theoretically
characterized. Fortunately, it works well, empirically. Goodfellow et al. (2013a) found
empirically that for deep networks with nonlinearities, the weight scaling rule can work
better (in terms of classification accuracy) than Monte Carlo approximations to the
ensemble predictor, even if the Monte Carlo approximation is allowed to sample up to
1,000 sub-networks.
Srivastava et al. (2014) showed that dropout is more effective than other standard
computationally inexpensive regularizers, such as weight decay, filter norm constraints,
and sparse activity regularization. Dropout may also be combined with more expensive
forms of regularization such as unsupervised pretraining to yield an improvement. As
of this writing, the state of the art classification error rate on the permutation invariant
152
MNIST dataset (not using any prior knowledge about images) is attained by a clas-
sifier that uses both dropout regularization and deep Boltzmann machine pretraining.
However, combining dropout with unsupervised pretraining has not become a popular
strategy for larger models and more challenging datasets.
One advantage of dropout is that it is very computationally cheap. Using dropout
during training requires only O(n) computation per example per update, to generate
n random binary numbers and multiply them by the state. Depending on the imple-
mentation, it may also require O(n) memory to store these binary numbers until the
backpropagation stage. Running inference in the trained model has the same cost per-
example as if dropout were not used, though we must pay the cost of dividing the weights
by 2 once before beginning to run inference on examples.
One significant advantage of dropout is that it does not significantly limit the type
of model or training procedure that can be used. It works well with nearly any model
that uses a distributed representation and can be trained with stochastic gradient de-
scent. This includes feedforward neural networks, probabilistic models such as restricted
Boltzmann machines (Srivastava et al., 2014), and recurrent neural networks (Pascanu
et al., 2014a). This is very different from many other neural network regularization
strategies, such as those based on unsupervised pretraining or semi-supervised learning.
Such regularization strategies often impose restrictions such as not being able to use
rectified linear units or max pooling. Often these restrictions incur enough harm to
outweigh the benefit provided by the regularization strategy.
Though the cost per-step of applying dropout to a specific model is negligible, the
cost of using dropout in a complete system can be significant. This is because the
size of the optimal model (in terms of validation set error) is usually much larger, and
because the number of steps required to reach convergence increases. This is of course
to be expected from a regularization method, but it does mean that for very large
datasets (as a rough rule of thumb, dropout is unlikely to be beneficial when more than
15 million training examples are available, though the exact boundary may be highly
problem dependent) it is often preferable not to use dropout at all, just to speed training
and reduce the computational cost of the final model.
When extremely few labeled training examples are available, dropout is less effective.
Bayesian neural networks (Neal, 1996) outperform dropout on the Alternative Splicing
Dataset (Xiong et al., 2011) where fewer than 5,000 examples are available (Srivastava
et al., 2014). When additional unlabeled data is available, unsupervised feature learning
can gain an advantage over dropout.
TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al., 2013)
TODO–perspective as L2 regularization TODO–connection to adagrad? TODO–semi-
supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013) TODO–DWF paper
(Warde-Farley et al., 2014) TODO–using geometric mean is not a problem TODO–
dropout boosting, it’s not just noise robustness TODO–what was the conclusion about
mixability?
The stochasticity used while training with dropout is not a necessary part of the
model’s success. It is just a means of approximating the sum over all sub-models.
Wang and Manning (2013) derived analytical approximations to this marginalization.
153
Their approximation, known as fast dropout resulted in faster convergence time due to
the reduced stochasticity in the computation of the gradient. This method can also be
applied at test time, as a more principled (but also more computationally expensive)
approximation to the average over all sub-networks than the weight scaling approxima-
tion. Fast dropout has been used to match the performance of standard dropout on
small neural network problems, but has not yet yielded a significant improvement or
been applied to a large problem.
Dropout has inspired other stochastic approaches to training exponentially large
ensembles of models that share weights. DropConnect is a special case of dropout
where each product between a single scalar weight and a single hidden unit state is
considered a unit that can be dropped (Wan et al., 2013). Stochastic pooling is a form
of randomized pooling (see chapter 9.3) for building ensembles of convolutional networks
with each convolutional network attending to different spatial locations of each feature
map. So far, dropout remains the most widely used implicit ensemble method.
TODO–improved performance with maxout units and probably ReLUs
7.12 Multi-Task Learning
Multi-task learning (Caruana, 1993) is a way to improve generalization by pooling the
examples (i.e., constraints) arising out of several tasks.
Figure 7.6 illustrates a very common form of multi-task learning, in which different
supervised tasks (predicting Y
i
given X) share the same input X, as well as some
intermediate-level representation capturing a common pool of factors. The model can
generally be divided into two kinds of parts and associated parameters:
1. Task-specific parameters (which only benefit from the examples of their task to
achieve good generalization). Example: upper layers of a neural network, in Fig-
ure 7.6.
2. Generic parameters, shared across all the tasks (which benefit from the pooled
data of all the tasks). Example: lower layers of a neural network, in Figure 7.6.
Improved generalization and generalization error bounds (Baxter, 1995) can be achieved
because of the shared parameters, for which statistical strength can be greatly improved
(in proportion with the increased number of examples for the shared parameters, com-
pared to the scenario of single-task models). Of course this will happen only if some
assumptions about the statistical relationship between the differents tasks are valid, i.e.,
that there is something shared across some of the tasks.
From the point of view of deep learning, the underlying prior regarding the data is the
following: among the factors that explain the variations observed in the data associated
with the different tasks, some are shared across two or more tasks.
TODO adversarial training
154
X
Y
1
Y
2
h
1
h
2
h
3
Figure 7.6: Multi-task learning can be cast in several ways in deep learning frame-
works and this figure illustrates the common situation where the tasks share a common
input but involve different target random variables. The lower layers of a deep net-
work (whether it is supervised and feedforward or includes a generative component with
downward arrows) can be shared across such tasks, while task-specific parameters can
be learned on top of a shared representation (associated respectively with h
1
and h
2
in
the figure). The underlying assumption is that there exist a common pool of factors
that explain the variations in the input X, while each task is associated with a subset
of these factors. In the figure, it is additionally assumed that top-level hidden units are
specialized to each task, while some intermediate-level representation is shared across
all tasks. Note that in the unsupervised learning context, it makes sense for some of
the top-level factors to be associated with none of the output tasks (h
3
): these are the
factors that explain some of the input variations but are not relevant for these tasks.
155