Chapter 7
Regularization
TODO - generalize our definition of Regularization
A central problem in machine learning is how to make an algorithm that will
perform well not just on the training data, but also on new inputs. The main
strategy for achieving good generalization is known as regularization. Regular-
ization is any component of the model, training process or prediction procedure
which is included to account for limitations of the training data, including its
finiteness. There are many regularization strategies. Some put extra constraints
on a machine learning model, such as adding restrictions on the parameter val-
ues. Some add extra terms in the cost function that one might consider a soft
constraint on the parameter values. If chosen carefully, these extra constraints
and penalties can lead to improved performance on the test set, either by encod-
ing prior knowledge into the model, or by forcing the optimization process into
a simpler model class that promotes generalization. Other forms of regulariza-
tion, known as ensemble methods, combine multiple hypotheses that explain the
training data. Sometimes regularization also helps to make an underdetermined
problem determined.
This chapter builds on the concepts of generalization, overfitting, underfitting,
bias and variance introduced in Chapter 5. If you are not already familiar with
these notions, please refer to that chapter before continuing with the more advance
material presented here.
Regularizers work by trading increased bias for reduced variance. An effec-
tive regularizer is one that makes a profitable trade, that is it reduces variance
significantly while not overly increasing the bias. When we discussed general-
ization and overfitting in Chapter 5, we focused on three situations, where the
model family being trained either (1) excluded the true data generating process—
corresponding to underfitting and inducing bias, or (2) matched to the true data
generating process—the “just right” model space, or (3) includes the generating
190
CHAPTER 7. REGULARIZATION
process but also many other possible generating processes—the regime where vari-
ance dominates the estimation error (e.g. as measured by the MSE—see Section.
5.7).
Note that, in practice, an overly complex model family does not necessarily
include (or even come close to) the target function or the true data generating
process. We almost never have access to the true data generating process so
we can never know if the model family being estimated includes the generating
process or not. But since, in deep learning, we are often trying to work with
data such as images, audio sequences and text, we can probably safely assume
that our model family does not include the data generating process. We can
assume that—to some extent – we are always trying to fit a square peg (the data
generating process) into a round hole (our model family) and using the data to
do that as best we can.
What this means is that controlling the complexity of the model is not going
to be a simple question of finding the model of the right size, i.e. the right
number of parameters. Instead, we might find—and indeed in practical deep
learning scenarios, we almost always do find – that the best fitting model (in the
sense of minimizing generalization error) is one that possesses a large number of
parameters that are not entirely free to span their domain.
As we will see there are a great many forms of regularization available to the
deep learning practitioner. In fact, developing more effective regularizers has been
one of the major research efforts in the field.
Most machine learning tasks can be viewed in terms of learning to represent
a function
ˆ
f(x) parametrized by a vector of parameters θ. The data consists of
inputs x
(i)
and (for some tasks) targets y
(i)
for i {1, . . . , n}. In the case of
classification, each y
(i)
is an integer class label in {1, . . . , k}. For regression tasks,
each y
(i)
is a real number. In the case of a density estimation task, there are no
targets. We may group these examples into a design matrix X and a vector of
targets y.
In deep learning, we are mainly interested in the case where
ˆ
f(x) has a large
number of parameters and as a result possesses a high capacity to fit relatively
complicated functions. This means that deep learning algorithms either require
very large datasets so that the data can fully specify such complicated models, or
they require careful regularization. In practice, most models and tasks exist on a
spectrum between these two extremes.
7.1 Regularization from a Bayesian Perspective
The Bayesian perspective on statistical inference offers a useful framework in
which to consider many common methods of regularization. As we discussed in
191
CHAPTER 7. REGULARIZATION
Sec. 5.9, Bayesian estimation theory takes a fundamentally different approach
to model estimation than the frequentist view by considering that the model
parameters themselves are uncertain and therefore should be considered random
variables.
There are a number of immediate consequences of assuming a Bayesian world
view. The first is that if we are using probability distributions to assess uncer-
tainty in the model parameters then we should be able to express our uncertainty
about the model parameters before we see any data. This is the role of the prior
distribution. The second consequence is that, when using the model to make pre-
dictions about outcomes, one should ideally integrate over the uncertainty over
the parameter values.
There is a deep connection between the Bayesian perspective on estimation
and the process of regularization. This is not surprising since at the root both
are concerned with making predictions relative to the true data generating dis-
tribution while taking into account the finiteness of the data. What this means
is that both are open to combining information sources. that is, both are in-
terested in combining the information that can be extracted from the training
data with other, or “prior” sources of information. As we will see, many forms of
regularization can be given a Bayesian interpretation.
If we consider a dataset {x
(1)
, . . . , x
(m)
}, we recover the posterior distribution
on the model parameter θ by combining the data likelihood p(x
(1)
, . . . , x
(m)
| θ)
with the prior.
log p(θ | x
(1)
, . . . , x
(m)
) log p(θ) +
X
i
log p(x
(i)
| θ) (7.1)
In the context of maximum likelihood learning, the introduction of the prior dis-
tribution plays the some role as a regularizer in that it can be seen as a term
added to the objective function that is added in hopes of achieving better gen-
eralization and despite of its detrimental effect on the likelihood of the training
data (the optimum of which would be achieved by considering only the last term
above).
In the following section, we will detail how the addition of a prior is equiv-
alent to certain regularization strategies. However we must be a bit careful in
establishing the relationship between the prior and a regularizer. Regularizers
are more general than priors. Priors are distributions and as such are subject
to constraints such as they must always be positive and must sum to one over
their domain. Regularizers have no such explicit constraints. Another problem
in interpreting all regularizers as priors is that the equivalence implies the overly
restrictive constraint that all unregularized objective functions be interpretable
as log-likelihood functions. Nevertheless, it remains true that many of the most
popular forms of regularization can be equated to a Bayesian prior.
192
CHAPTER 7. REGULARIZATION
7.2 Classical Regularization: Parameter Norm Penalty
Regularization has been used for decades prior to the advent of deep learning.
Statistical and machine learning models traditionally represented simpler func-
tions. Because the functions themselves had less capacity, the regularization did
not need to be as sophisticated. We use the term classical regularization to refer
to the techniques used in the general machine learning and statistics literature.
Most classical regularization approaches are based on limiting the capacity
of models, such as neural networks, linear regression, or logistic regression, by
adding a parameter norm penalty Ω(θ) to the loss function J. We denote the
regularized loss function by
˜
J:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.2)
where α is a hyperparameter that weighs the relative contribution of the norm
penalty term, Ω, relative to the standard loss function J(x; θ). The hyperparam-
eter α should be a non-negative real number, with α = 0 corresponding to no
regularization, and larger values of α corresponding to more regularization.
When our training algorithm minimizes the regularized loss function
˜
J it will
decrease both the original loss J on the training data and some measure of the
size of the parameters θ (or some subset of the parameters). Different choices for
the parameter norm can result in different solutions being preferred. In this
section, we discuss the effects of the various norms when used as penalties on the
model parameters.
Before delving into the regularization behavior of different norms, we note that
for neural networks, we typically choose to use a parameter norm penalty Omega
that only penalizes the interaction weights, i.e we leave the offsets unregularized.
The offsets typically require less data to fit accurately than the weights. Each
weight specifies how two variables interact, and requires observing both variables
in a variety of conditions to fit well. Each offset controls only a single variable.
This means that we do not induce too much variance by leaving the offsets un-
regularized. Also, regularizing the offsets can introduce a significant amount of
underfitting.
7.2.1 L
2
Parameter Regularization
One of the simplest and most common kind of classical regularization is the L
2
parameter norm penalty.
1
, Ω(θ) =
1
2
kθk
2
2
. This form of regularization is also
1
More generally, we could consider regularizing the parameters to a parameter value w
(o)
that is perhaps not zero. In that case the L
2
penalty term would be Ω(θ) =
1
2
||θ θ
(o)
||
2
2
=
1
2
P
i
(θ
i
θ
(o)
i
)
2
. Since it is far more common to consider regularizing the model parameters to
zero, we will focus on this special case in our exposition.
193
CHAPTER 7. REGULARIZATION
known as ridge regression. It is readily applicable to neural networks, where it is
known as weight decay. In the context of neural networks, the penalty is equal
to the sum of the squared L
2
of all of the weight vectors. Typically, we use a
different coefficient α for the weights at each layer of the network. This coefficient
should be tuned using a validation set.
We can gain some insight into the behavior of weight decay regularization
by considering the gradient of the regularized loss function. To simplify the
presentation, we assume no offset term, so θ is just w. Such a model has the
following gradient of the loss:
w
˜
J(w; X, y) = αw +
w
J(w; X, y) (7.3)
We will further simplify the analysis by considering a quadratic approximation
to the loss function in the neighborhood of the empirically optimal value of the
weights w
. (If the loss is truly quadratic, as in the case of fitting a linear
regression model with mean squared error, then the approximation is perfect).
ˆ
J(θ) = J(w
) +
1
2
(w w
)
>
H(w w
) (7.4)
where H is the Hessian matrix of J with respect to w evaluated at w
. There is
no first order term in this quadratic approximation, because w
is defined to be a
minimum, where the gradient vanishes. Likewise, because w
is a minimum, we
can conclude that H is positive semi-definite.
w
ˆ
J(w) = H(w w
). (7.5)
If we replace the exact gradient in equation 7.3 with the approximate gradient
in equation 7.5, we can write an equation for the location of the minimum of the
regularized loss function:
αw + H(w w
) = 0 (7.6)
(H + αI)w = Hw
(7.7)
˜
w = (H + αI)
1
Hw
(7.8)
The presence of the regularization term moves the optimum from w
to
˜
w.
As α approaches 0,
˜
w approaches w
. But what happens as α grows? Because
H is real and symmetric, we can decompose it into a diagonal matrix Λ and
an ortho-normal basis of eigenvectors, Q, such that H = QΛQ
>
. Aplying the
194
CHAPTER 7. REGULARIZATION
w
1
w
2
w
˜
w
Figure 7.1: An illustration of the effect of L2 (or weight decay) regularization on the value
of the optimal w. The solid elipses represent contours of equal value of the unregularized
objective. The dotted circles represent contours of equal value of the L
2
regularizer. At
the point
˜
w, these competing objectives reach an equilibrium.
decomposition to equation 7.8, we obtain:
w = (QΛQ
>
+ αI)
1
QΛQ
>
w
=
h
Q(Λ + αI)Q
>
i
1
QΛQ
>
w
= Q(Λ + αI)
1
ΛQ
>
w
,
Q
>
w = (Λ + αI)
1
ΛQ
>
w
. (7.9)
If we interpret the Q
>
w as rotating our parameters w into the basis as defined
by the eigenvectors Q of H, then we see that the effect of weight decay is to
rescale the coefficients of eigenvectors. Specifically the ith component is rescaled
by a factor of
λ
i
λ
i
+α
. (You may wish to review how this kind of scaling works, first
explained in Fig. 2.3).
Along the directions where the eigenvalues of H are relatively large, for ex-
ample, where λ
i
α, the effect of regularization is relatively small. However,
components with λ
i
α will be shrunk to have nearly zero magnitude. This
effect is illustrated in Fig. 7.1
Only directions along which the parameters contribute significantly to reduc-
ing the loss are preserved relatively intact. In directions that do not contribute to
reducing the loss, a small eigenvalue of the Hessian tell us that movement in this
direction will not significantly increase the gradient. Components of the weight
vector corresponding to such unimportant directions are decayed away through
the use of the regularization throughout training. This effect of suppressing con-
tributions to the parameter vector along these principle directions of the Hessian
H is captured in the concept of the effective number of parameters, defined to be
195
CHAPTER 7. REGULARIZATION
γ =
X
i
λ
i
λ
i
+ α
. (7.10)
As α is increased, the effective number of parameters decreases.
Another way to gain some intuition for the effect of L
2
regularization is to
consider its effect on linear regression. The unregularized objective function for
linear regression is the sum of squared errors:
(Xw y)
>
(Xw y).
When we add L
2
regularization, the objective function changes to
(Xw y)
>
(Xw y) +
1
2
αw
>
w.
This changes the normal equations for the solution from
w = (X
>
X)
1
X
>
y
to
w = (X
>
X + αI)
1
X
>
y.
We can see L
2
regularization causes the learning algorithm to “perceive” the
input X as having higher variance, which makes it shrink the weights on features
whose covariance with the output target is low compared to this added variance.
TODO–make sure the chapter includes maybe a table showing relationships
between early stopping, priors, constraints, penalties, and adding noise? e.g. look
up L1 penalty and it tells you what prior it corresponds to scratchwork thinking
about how to do it:
L2 penalty L2 constraint add noise early stopping Gaussian prior L1 penalty
L1 constraint Laplace prior Max-norm penalty
7.2.2 L
1
Regularization
While L
2
weight decay is the most common form of weight decay, there are other
ways to penalize the size of the model parameters. Another option is to use L
1
regularization.
Formally, L
1
regularization on the model parameter w is defined as:
Ω(θ) = ||w||
1
=
X
i
|w
i
|. (7.11)
196
CHAPTER 7. REGULARIZATION
That is, as the sum of absolute values of the individual parameters.
2
We will now
consider the effect of L
1
regularization on the simple linear model, with no offset
term, that we considered in our analysis of L
2
regularization. In particular, we are
interested in delineating the differences between L
1
and L
2
forms of regularization.
Thus, if we consider the gradient (actually the sub-gradient) on the regularized
objective function
˜
J(w; X, y), we have:
w
˜
J(w; X, y) = βsign(w) +
w
J(X, y; w) (7.12)
where sign(w) is simply sign of w applied element-wise.
By inspecting Eqn. 7.12, we can see immediately that the effect of L
1
regu-
larization is quite different from that of L
2
regularization. Specifically, we can see
that the regularization contribution to the gradient no longer scales linearly with
w, instead it is a constant factor with a sign equal to sign(w). One consequence
of this form of the gradient is that we will not necessarily see clean solutions
to quadratic forms of
w
J(X, y; w) as we did for L
2
regularization. Instead,
the solutions are going to be much more aligned to the basis space in which the
problem is embedded.
For the sake of comparison with L
2
regularization, we will again consider a
simplified setting of a quadratic approximation to the loss function in the neigh-
borhood of the empirical optimum w
. (Once again, if the loss is truly quadratic,
as in the case of fitting a linear regression model with mean squared error, then
the approximation is perfect). The gradient of this approximation is given by
w
ˆ
J(w) = H(w w
). (7.13)
where, again, H is the Hessian matrix of J with respect to w evaluated at w
.
We will also make the further simplifying assumption that the Hessian is diagonal,
H = diag([γ
1
, . . . , γ
N
]), where each γ
i
> 0. With this rather restrictive assump-
tion, the solution of the minimum of the L
1
regularized loss function decomposes
into a system of equations of the form:
˜
J(w; X, y) =
1
2
γ
i
(w
i
w
i
)
2
+ β|w
i
|.
Which admits an optimal solution (for each dimension i) in the following form:
w
i
= sign(w
i
) max(|w
i
|
β
γ
i
, 0)
2
As with L
2
regularization, we could consider regularizing the parameters to a value that is
not zero, but instead to some parameter value w
(o)
. In that case the L
1
regularization would
introduce the term Ω(θ) = ||w w
(o)
||
1
= β
P
i
|w
i
w
(o)
i
|.
197
CHAPTER 7. REGULARIZATION
w
1
w
2
w
˜
w
w
1
w
2
w
˜
w
Figure 7.2: An illustration of the effect of L
1
regularization (RIGHT) on the value of the
optimal W , in comparison to the effect of L
2
regularization (LEFT).
Let’s consider the situation where w
i
> 0 for all i, there are two possible out-
comes. Case 1: w
i
β
γ
i
, here the optimal value of w
i
under the regularized
objective is simply w
i
= 0, this occurs because the contribution of J(w; X, y) to
the regularized objective
˜
J(w; X, y) is overwhelmed—in direction i, by the L
1
regularization which pushes the value of w
i
to zero. Case 2: w
i
>
β
γ
i
, here the
regularization does not move the optimal value of w to zero but instead it just
shifts it in that direction by a distance equal to
β
γ
i
. This is illustrated in Fig. 7.2.
In comparison to L
2
regularization, L
1
regularization results in a solution that
is more sparse. Sparsity in this context implies that there are free parameters
of the model that—through L
1
regularization—with an optimal value (under the
regularized objective) of zero. As we discussed, for each element i of the parameter
vector, this happened when w
i
β
γ
i
. Comparing this to the situation for L
2
regularization, where (under the same assumptions of a diagonal Hessian H) we
get w
L
2
=
γ
i
γ
i
+α
w
, which is nonzero as long as w
is nonzero.
In Fig. 7.2, we see that even when the optimal value of w is nonzero, L
1
regularization acts to punish small values of parameters just as harshly as larger
values, leading to optimal solutions with more parameters having value zero and
more larger valued parameters.
The sparsity property induced by L
1
regularization has been used extensively
as a feature selection mechanism. In particular, the well known LASSO Tibshirani
(1995) (least absolute shrinkage and selection operator) model integrates an L
1
penalty with a linear model and a least squares cost function. Finally, L
1
is
known as the only norm that is both sparsifying and convex for non-degenerative
problems
3
.
3
For degenerative problems, where more than one solution exists, L
2
regularization can find
the “sparse” solution in the sense that redundant parameters shrink to zero.
198
CHAPTER 7. REGULARIZATION
7.2.3 Bayesian Interpretation of the Parameter Norm Penalty
Parameter norm penalties are often amenable to being interpreted as a Bayesian
prior. Recall that parameter norm penalties are effected by adding a term Ω(w)
to the unregularized loss function J.
˜
J(w; X, y) = J(w; X, y) + αΩ(w) (7.14)
where α is a hyperparameter that weighs the relative contribution of the norm
penalty term.
We can view the minimization of the regularized loss function above as equiv-
alent to finding the maximum a posteriori (MAP) estimate of the parameters:
log p(w | X, y) log p(y | X, w) + log p(w), where the inregularized J(w; X, y)
is taken as the log likelihood and the regularization term αΩ(w) plays the role of
the parameter prior distribution. Difference choices of regularizers correspond to
different priors.
In the case of L
2
regularization, minimizing with αΩ(w) =
α
2
kwk
2
2
, is function-
ally equivalent to maximizing the log of the posterior distribution (or minimizing
the negative log posterior) where the prior is given by a Gaussian distribution.
log p(w; µ, Σ) =
1
2
(w µ)
>
Σ
1
(w µ)
1
2
log |Σ|
d
2
log(2π)
where d is the dimension of w. Ignoring terms that are not function of w (and
therefore do not effect the MAP value), we can see that the by choosing µ = 0 and
Σ
1
= αI, we recover the functional form of L
2
regularization: log p(w; µ, Σ)
α
2
kwk
2
2
. Thus L
2
regularization can be considered assuming independent Gaussian
prior distibutions over all the model parameters, each with precision (i.e. the
inverse of variance) α.
For L
1
regularization, minimizing with αΩ(w) =
α
P
i
kw
i
k, is equivalent to
maximizing the log of the posterior distribution with independent Laplace distri-
butions (also known as a double-sided exponential distribution) as priors over the
individual elements of w.
log p(w; µ, η) =
X
i
Laplace(µ
i
, η
i
) =
X
i
|w
i
µ
i
|
η
i
log (2η
i
)
One again we can ignore the second term here because it does not depend on the
elements of w, so L
1
regularization is equivalent to MAP estimate with a prior
given by
P
i
Laplace(0, λ
1
).
199
CHAPTER 7. REGULARIZATION
7.3 Classical Regularization as Constrained Optimiza-
tion
Classical regularization adds a penalty term to the training objective:
˜
J(θ; X, y) = J(θ; X, y) + αΩ(θ).
Recall from Sec. 4.4 that we can minimize a function subject to constraints by
constructing a generalized Lagrange function (see 4.4), consisting of the original
objective function plus a set of penalties. Each penalty is a product between
a coefficient, called a Karush–Kuhn–Tucker (KKT) multiplier
4
, and a function
representing whether the constraint is satisfied. If we wanted to constrain Ω(θ) to
be less than some constant k, we could construct a generalized Lagrange function
L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) k).
The solution to the constrained problem is given by
θ
= min
θ
max
α,α0
L(θ, α).
Solving this problem requires modifying both θ and α. Specifically, α must
increase whenever ||θ||
p
> k and decrease whenever ||θ||
p
< k. However, after we
have solved the problem, we can fix α
and view the problem as just a function
of θ:
θ
= min
θ
L(θ, α
) = min
θ
J(θ; X, y) + α
Ω(θ).
This is exactly the same as the regularized training problem of minimizing
˜
J.
Note that the value of α
does not directly tell us the value of k. In principle, one
can solve for k, but the relationship between k and α
depends on the form of
J. We can thus think of classical regularization as imposing a constraint on the
weights, but with an unknown size of the constraint region. Larger α will result in
a smaller constraint region, and smaller α will result in a larger constraint region.
Sometimes we may wish to use explicit constraints rather than penalties. As
described in Sec. 4.4, we can modify algorithms such as stochastic gradient descent
to take a step downhill on J(θ) and then project θ back to the nearest point that
satisfies Ω(θ) < k. This can be useful if we have an idea of what value of k
is appropriate and do not want to spend time searching for the value of α that
corresponds to this k.
Another reason to use explicit constraints and reprojection rather than enforc-
ing constraints with penalties is that penalties can cause non-convex optimization
4
KKT multipliers generalize Lagrange multipliers to allow for inequality constraints.
200
CHAPTER 7. REGULARIZATION
procedures to get stuck in local minima corresponding to small θ. When training
neural networks, this usually manifests as neural networks that train with several
“dead units”. These are units that do not contribute much to the behavior of the
function learned by the network because the weights going into or out of them are
all very small. When training with a penalty on the norm of the weights, these
configurations can be locally optimal, even if it is possible to significantly reduce
J by making the weights larger. (This concern about local minima obviously does
not apply when
˜
J is convex)
Finally, explicit constraints with reprojection can be useful because they im-
pose some stability on the optimization procedure. When using high learning
rates, it is possible to encounter a positive feedback loop in which large weights
induce large gradients which then induce a large update to the weights. If these
updates consistently increase the size of the weights, then θ rapidly moves away
from the origin until numerical overflow occurs. Explicit constraints with repro-
jection allow us to terminate this feedback loop after the weights have reached a
certain magnitude. Hinton et al. (2012c) recommend using constraints combined
with a high learning rate to allow rapid exploration of parameter space while
maintaining some stability.
TODO how L2 penalty is equivalent to L2 constraint (with unknown value),
L1 penalty is equivalent to L1 constraint maybe move the earlier L2 regularization
figure to here, now that the sublevel sets will make more sense? show the shapes
induced by the different norms separate L2 penalty on each hidden unit vector
is different from L2 penalty on all theta; is equivalent to a penalty on the max
across columns of the column norms
7.4 Regularization and Under-Constrained Problems
In some cases, regularization is necessary for machine learning problems to be
properly defined. Many linear models in machine learning, including linear re-
gression and PCA, depend on inverting the matrix X
>
X. This is not possible
whenever X
>
X is singular. This matrix can be singular whenever the data truly
has no variance in some direction, or when there are fewer examples (rows of X)
than input features (columns of X). In this case, many forms of regularization
correspond to inverting X
>
X +αI instead. This regularized matrix is guaranteed
to be invertible.
These linear problems have closed form solutions when the relevant matrix is
invertible. It is also possible for a problem with no closed form solution to be
underdetermined. For example, consider logistic regression applied to a problem
where the classes are linearly separable. If a weight vector w is able to achieve
perfect classification, then 2w will also achieve perfect classification and higher
201
CHAPTER 7. REGULARIZATION
likelihood. An iterative optimization procedure like stochastic gradient descent
will continually increase the magnitude of w and, in theory, will never halt. In
practice, a numerical implementation of gradient descent will eventually reach
sufficiently large weights to cause numerical overflow, at which point its behavior
will depend on how the programmer has decided to handle values that are not
real numbers.
Most forms of regularization are able to guarantee the convergence of iterative
methods applied to underdetermined problems. For example, weight decay will
cause gradient descent to quit increasing the magnitude of the weights when the
slope of the likelihood is equal to the weight decay coefficient. Likewise, early
stopping based on the validation set classification rate will cause the training
algorithm to terminate soon after the validation set classification accuracy has
stopped increasing. Even if the problem is linearly separable and there is no
overfitting, the validation set classification accuracy will eventually saturate to
100%, resulting in termination of the early stopping procedure.
The idea of using regularization to solve underdetermined problems extends
beyond machine learning. The same idea is useful for several basic linear algebra
problems.
As we saw in Chapter 2.9, we can solve underdetermined linear equations
using the Moore-Penrose pseudoinverse.
One definition of the pseudo-inverse X
+
of a matrix X is to perform linear
regression with an infinitesimal amount of L
2
regularization:
X
+
= lim
α&0
(X
>
X
>
+ αI)
1
X
>
.
When a true inverse for X exists, then w = X
+
y returns the weights that
exactly solve the regression problem. When X is not invertible because no ex-
act solution exists, this returns the w corresponding to the least possible mean
squared error. When X is not invertible because many solutions exactly solve
the regression problem, this returns w with the minimum possible L
2
norm.
Recall that the Moore-Penrose pseudoinverse can be computed easily using
the singular value decomposition. Because the SVD is robust to underdetermined
problems resulting from too few observations or too little underlying variance,
it is useful for implementing stable variants of many closed-form linear machine
learning algorithms. The stability of these algorithms can be viewed as a result of
applying the minimum amount of regularization necessary to make the problem
become determined.
202
CHAPTER 7. REGULARIZATION
7.5 Dataset Augmentation
The best way to make a machine learning model generalize better is to train it
on more data. Of course, in practice, the amount of data we have is limited. One
way to get around this problem is to create more fake data. For some machine
learning tasks, it is reasonably straightforward to create new fake data.
This approach is easiest for classification. A classifier needs to take a compli-
cated, high dimensional input x and summarize it with a single category identity
y. This means that the main task facing a classifier is to be invariant to a wide
variety of transformations. We can generate new (x, y) pairs easily just by trans-
forming the x inputs in our training set.
This approach is not as readily applicable to many other tasks. For example,
it is difficult to generate new fake data for a density estimation task unless we
have already solved the density estimation problem.
Dataset augmentation has been a particularly effective technique for a spe-
cific classification problem: object recognition. Images are high dimensional and
include an enormous variety of factors of variation, many of which can be easily
simulated. Operations like translating the training images a few pixels in each
direction can often greatly improve generalization, even if the model has already
been designed to be partially translation invariant by using convolution and pool-
ing. Many other operations such as rotating the image or scaling the image have
also proven quite effective. One must be careful not to apply transformations that
are relevant to the classification problem. For example, optical character recogni-
tion tasks require recognizing the difference between ’b’ and ’d’ and the difference
between ’6’ and ’9’, so horizontal flips and 180
rotations are not appropriate
ways of augmenting datasets for these tasks. There are also transformations that
we would like our classifiers to be invariant to, but which are not easy to perform.
For example, out-of-plane rotation can not be implemented as a simple geometric
operation on the input pixels.
For many classification and even some regression tasks, the task should still be
possible to solve even if random noise is added to the input. Neural networks prove
not to be very robust to noise, however (Tang and Eliasmith, 2010). One way to
improve the robustness of neural networks is simply to train them with random
noise applied to their inputs. This same approach also works when the noise is
applied to the hidden units, which can be seen as doing dataset augmentation
at multiple levels of abstraction. Poole et al. (2014) recently showed that this
approach can be highly effective provided that the magnitude of the noise is
carefully tuned. Dropout, a powerful regularization strategy that will be described
in Sec. 7.11, can be seen as a process of constructing new inputs by multiplying
by noise.
In a multilayer network, it can often be beneficial to apply transformations
203
CHAPTER 7. REGULARIZATION
such as noise to the hidden units, as well as the inputs. This can be viewed as
augmenting the dataset as seen by the deeper layers.
When reading machine learning research papers, it is important to take the
effect of dataset augmentation into account. Often, hand-designed dataset aug-
mentation schemes can dramatically reduce the generalization error of a machine
learning technique. It is important to look for controlled experiments. When
comparing machine learning algorithm A and machine learning algorithm B, it
is necessary to make sure that both algorithms were evaluated using the same
hand-designed dataset augmentation schemes. If algorithm A performs poorly
with no dataset augmentation and algorithm B performs well when combined with
numerous synthetic transformations of the input, then it is likely the synthetic
transformations and not algorithm B itself that cause the improved performance.
Sometimes the line is blurry, such as when a new machine learning algorithm
involves injecting noise into the inputs. In these cases, it’s best to consider how
generally applicable to the new algorithm is, and to make sure that pre-existing
algorithms are re-run in as similar of conditions as possible.
TODO– tangent propagation NOTE: there is already some coverage of Tanget-
Prop in manifold.tex, it may not be necessary to exhaustively describe all known
forms of regularization in this chapter
7.6 Classical Regularization as Noise Robustness
In the machine learning litterature, there have been two ways that noise has been
used as part of a regularization strategy. The first and most popular way is by
adding noise to the input. While this can be interpreted simply as form of dataset
augmentation (as described above in Sec. 7.5), we can also interpret it as being
equivalent to more traditional forms of regularization.
The second way that noise has been used in the service of regularizing models
is by adding it to the weights. This technique has been used primarily in the
context of recurrent neural networks (Jim et al., 1996; Graves, 2011a). This can
been interpreted as a stochastic implementation of a Bayesian inference over the
weights. Under the Bayesian treatment of learning would consider the model
weights to be uncertain and representable via a probability distribution that re-
flects this uncertainty. Adding noise to the weights is a practical, stochatic way
to reflect this uncertainty (Graves, 2011a).
In this section, we review these two strategies and provide some insight into
now noise can act to regularize the model.
204
CHAPTER 7. REGULARIZATION
7.6.1 Injecting Noise at the Input
Some classical regularization techniques can be derived in terms of training on
noisy inputs.
5
. Let us consider a regression setting, where we are interested in
learning a model ˆy(x) that maps a set of features x to a scalar. The cost function
we will use is the least-squares error between the model prediction ˆy(x) and the
true value y:
J = E
p(x,y)
(ˆy(x) y)
2
, (7.15)
where we are given a dataset of m input / output pairs {(x
(1)
, y
(1)
), . . . , (x
(m)
, y
(m)
)}.
Our training objective is to minimize the loss function, which in this case is given
by the least-squares error between the model prediction
ˆ
y(x) and the true label
y ( where y(x) = [y
(1)
(x), . . . , y
(m)
(x)]).
Now consider that with each input presentation to the model we also include
a random perturbation (0, νI), so that the error function becomes
˜
J
x
= E
p(x,y,)
(ˆy(x + ) y)
2
= E
p(x,y,)
ˆy
2
(x + ) 2yˆy(x + ) + y
2
= E
p(x,y,)
ˆy
2
(x + )
2E
p(x,y,)
[yˆy(x + )] + E
p(x,y,)
y
2
(7.16)
Assuming small noise, we can consider the taylor series expansion of ˆy(x + )
around ˆy(x).
ˆy(x + ) = ˆy(x) +
>
x
ˆy(x) +
1
2
>
2
x
ˆy(x) + O(
3
) (7.17)
Substituting this approximation for ˆy(x+) into the objective function (Eq. 7.16)
and using the fact that E
p()
[] = 0 and that E
p()
[
>
] = νI to simplify
6
, we get:
˜
J
x
E
p(x,y,)
"
ˆy(x) +
>
x
ˆy(x) +
1
2
>
2
x
ˆy(x)
2
#
2E
p(x,y,)
yˆy(x) + y
>
x
yˆy(x) +
1
2
y
>
2
x
ˆy(x)
+ E
p(x,y,)
y
2
= E
p(x,y,)
(ˆy(x) y)
2
+ E
p(x,y,)
ˆy(x)
>
2
x
ˆy(x) +
>
x
ˆy(x)
2
+ O(
3
)
2E
p(x,y,)
1
2
y
>
2
x
ˆy(x)
= J + νE
p(x,y)
(ˆy(x) y)
2
x
ˆy(x)
+ νE
p(x,y)
k∇
x
ˆy(x)k
2
(7.18)
5
The analysis in this section is mainly based on that in Bishop (1995a,b)
6
In this derivation we have used two properties of the trace operator: (1) that a scalar is
equal to its trace; (2) that, for a square matrix AB, Tr(AB) = Tr(BA). These are discussed in
Sec. 2.10.
205
CHAPTER 7. REGULARIZATION
If we consider minimizing this objective function, by taking the functional gradient
of ˆy(x) and setting the result to zero, we can see that
ˆy(x) = E
p(y|x)
[y] + O(ν).
This implies that the expectation in the second last term in Eq. 7.18, E
p(x,y)
(ˆy(x) y)
2
x
ˆy(x)
,
reduces to O(ν) because the expectation of the difference (ˆy(x) y) is reduces to
O(ν).
This leaves us with the objective function of the form
˜
J
x
= E
p(x,y)
(ˆy(x) y)
2
+ νE
p(x,y)
k∇
x
ˆy(x)k
2
+ O(ν
2
).
For small ν, the minimization of J with added noise on the input (with covariance
νI) is equivalent to minimization of J with an additional regularization term given
by νE
p(x,y)
k∇
x
ˆy(x)k
2
.
Considering the behavior of this regularization term, we note that it has the
effect of penalizing large gradients of the function ˆy(x). That is, it has the effect
of reducing the sensitivity of the output of the network with respect to small
variations in its input x. We can interpret this as attempting to build in some
local robustness into the model and thereby promote generalization. We note also
that for linear networks, this regularization term reduces to simple weight decay
(as discussed in Sec. 7.2.1).
7.6.2 Injecting Noise at the Weights
Rather than injecting noise as part of the input, one could also consider adding
noise directly to the model parameters. As we shall see, this can also be in-
terpreted as equivalent (under some assumptions) to a more traditional form of
regularization. Adding noise to the weights has been shown to be a effective regu-
larization strategy in the context of recurrent neural networks
7
Jim et al. (1996);
Graves (2011b). In the following, we will present an analysis of the effect of weight
noise on a standard feedforward neural network (as introduced in Chapter 6).
As we did in the last section, we again consider the regression setting, where
we wish to train a function ˆy(x) that maps a set of features x to a scalar using
the least-squares cost function between the model predictions ˆy(x) and the true
values y:
J = E
p(x,y)
(ˆy(x) y)
2
. (7.19)
We again assume we are given a dataset of m input / output pairs {(x
(1)
, y
(1)
), . . . , (x
(m)
, y
(m)
)}.
We now assume that with each input presentation we also include a random
perturbation
W
(0, ηI) of the network weights. Let us imagine that we have
7
Recurrent neural networks will be discussed in detail in Chapter 10
206
CHAPTER 7. REGULARIZATION
a standard L-layer MLP, we denote the perturbed model as ˆy
W
(x). Despite the
injection of noise, we are still interested in minimizing the squared error of the
output of the network. The objective function thus becomes:
˜
J
W
= E
p(x,y,
W
)
(ˆy
W
) y)
2
= E
p(x,y,
W
)
ˆy
2
W
(x) 2yˆy
W
(x) + y
2
(7.20)
Assuming small noise, we can consider the taylor series expansion of ˆy
W
(x)
around the unperturbed function ˆy(x).
ˆy
W
(x) = ˆy(x) +
>
W
W
ˆy(x) +
1
2
>
W
2
W
ˆy(x)
W
+ O(
3
W
) (7.21)
From here, we follow the same basic strategy that was laid-out in the pervious
section in analyzing the effect of adding noise to the input. That is, we substitute
the taylor series expansion of ˆy
W
(x) into the objective function in Eq. 7.20.
˜
J
W
E
p(x,y,
W
)
"
ˆy(x) +
>
W
W
ˆy(x) +
1
2
>
W
2
W
ˆy(x)
W
2
#
E
p(x,y,
W
)
2y
ˆy(x) +
>
W
W
ˆy(x) +
1
2
>
W
2
W
ˆy(x)
W

+ E
p(x,y,
W
)
y
2
= E
p(x,y,
W
)
h
(ˆy(x) y)
2
i
2E
p(x,y,
W
)
1
2
y
>
W
2
W
ˆy(x)
+ E
p(x,y,
W
)
ˆy(x)
>
W
2
W
ˆy(x)
W
+
>
W
W
ˆy(x)
2
+ O(
3
W
)
. (7.22)
(7.23)
Where we have used the fact that E
W
)
W
= 0 to drop terms that are linear in
W
. Incorporating the assumption that E
W
)
2
W
= ηI, we have:
˜
J
W
J + νE
p(x,y)
(ˆy(x) y)
2
W
ˆy(x)
+ νE
p(x,y)
k∇
W
ˆy(x)k
2
(7.24)
Again, if we consider minimizing this objective function, we can see that the
optimal value of ˆy(x) is:
ˆy(x) = E
p(y|x)
[y] + O(η),
implying that the expectation in the middle term in Eq. 7.24, E
p(x,y)
(ˆy(x) y)
2
W
ˆy(x)
,
reduces to O(η) because the expectation of the difference (ˆy(x) y) is reduces to
O(η).
This leaves us with the objective function of the form
˜
J
W
= E
p(x,y)
(ˆy(x) y)
2
+ ηE
p(x,y)
k∇
W
ˆy(x)k
2
+ O(η
2
).
207
CHAPTER 7. REGULARIZATION
For small η, the minimization of J with added weight noise (with covariance
ηI) is equivalent to minimization of J with an additional regularization term:
ηE
p(x,y)
k∇
W
ˆy(x)k
2
. This form of regularization encourages the parameters to
go to regions of parameter space that have relatively small gradients. In other
words, it pushes the model into regions where the model is relatively insensitive
to small variations in the weights. Regularization strategies with this kind of
behaviour have been considered before TODO restore this broken citation when
it is fixed In the simplified case of linear regression (where, for instance,
ˆ
(y)(x) =
w
>
x + b), this regularization term collapses into ηE
p(x)
kxk
2
, which is not a
function of parameters and therefore does not contribute to the gradient of
˜
J
W
w.r.t the model parameters.
7.7 Early Stopping as a Form of Regularization
When training large models with high capacity, we often observe that training
error decreases steadily over time, but validation set error begins to rise again.
See Fig. 7.3 for an example of this behavior. This behavior occurs very reliably.
This means we can obtain a model with better validation set error (and thus,
hopefully better test set error) by returning to the parameter setting at the point
in time with the lowest validation set error. Instead of running our optimization
algorithm until we reach a (local) minimum, we run it until the error on the
validation set has not improved for some amount of time. Every time the error
on the validation set improves, we store a copy of the model parameters. When
the training algorithm terminates, we return these parameters, rather than the
latest parameters. This procedure is specified more formally in Alg. 7.1.
This strategy is known as early stopping. It is probably the most commonly
used form of regularization in deep learning. Its popularity is due both to its
effectiveness and its simplicity.
One way to think of early stopping is as a very efficient hyperparameter se-
lection algorithm. In this view, the number of training steps is just another
hyperparameter. We can see in Fig. 7.3 that this hyperparameter has a U-shaped
validation set performance curve, just like most other model capacity control pa-
rameters. In this case, we are controlling the effective capacity of the model by
determining how many steps it can take to fit the training set precisely. Most of
the time, setting hyperparameters requires an expensive guess and check process,
where we must set a hyperparameter at the start of training, then run training
for several steps to see its effect. The “training time” hyperparameter is unique
in that by definition a single run of training tries out many values of the hyperpa-
rameter. The only significant cost to choosing this hyperparameter automatically
via early stopping is running the validation set evaluation periodically during
208
CHAPTER 7. REGULARIZATION
Figure 7.3: Learning curves showing how the negative log likelihood loss changes over
time. In this example, we train a maxout network on MNIST, regularized with dropout.
Observe that the training loss decreases consistently over time, but the validation set loss
eventually begins to increase again.
209
CHAPTER 7. REGULARIZATION
Algorithm 7.1 The early stopping meta-algorithm for determining the best
amount of time to train. This meta-algorithm is a general strategy that works
well with a variety of training algorithms and ways of quantifying error on the
validation set.
Let n be the number of steps between evaluations.
Let p be the “patience,” the number of times to observe worsening validation
set error before giving up.
Let θ
o
be the initial parameters.
θ θ
o
i 0
j 0
v
θ
θ
i
i
while j < p do
Update θ by running the training algorithm for n steps.
i i + n
v
0
ValidationSetError(θ)
if v
0
< v then
j 0
θ
θ
i
i
v v
0
else
j j + 1
end if
end while
Best parameters are θ
, best number of training steps is i
training.
An additional cost to early stopping is the need to maintain a copy of the best
parameters. This cost is generally negligible, because it is acceptable to store
these parameters in a slower and larger form of memory (for example, training
in GPU memory, but storing the optimal parameters in host memory or on a
disk drive). Since the best parameters are written to infrequently and never read
during training, these occasional slow writes are have little effect on the total
training time.
Early stopping is a very inobtrusive form of regularization, in that it requires
no change the underlying training procedure, the objective function, or the set of
allowable parameter values. This means that it is easy to use early stopping with-
210
CHAPTER 7. REGULARIZATION
out damaging the learning dynamics. This is in contrast to weight decay, where
one must be careful not to use too much weight decay and trap the network in a
bad local minima corresponding to a solution with pathologically small weights.
Early stopping may be used either alone or in conjunction with other regu-
larization strategies. Even when using regularization strategies that modify the
objective function to encourage better generalization, it is rare for the best gen-
eralization to occur at a local minimum of the training objective.
Early stopping requires a validation set, which means some training data is not
fed to the model. To best exploit this extra data, one can perform extra training
after the initial training with early stopping has completed. In the second, extra
training step, all of the training data is included. There are two basic strategies
one can use for this second training procedure.
One strategy is to initialize the model again and retrain on all of the data.
In this second training pass, we train for the same number of steps as the early
stopping procedure determined was optimal in the first pass. There are some
subtleties associated with this procedure. For example, there is not a good way
of knowing whether to retrain for the same number of parameter updates or the
same number of passes through the dataset. On the second round of training,
each pass through the dataset will require more parameter updates because the
training set is bigger. Usually, if overfitting is a serious concern, you will want to
retrain for the same number of epochs, rather than the same number of parameter
udpates. If the primary difficulty is optimization rather than generalization, then
retraining for the same number of parameter updates makes more sense (but it’s
also less likely that you need to use a regularization method like early stopping
in the first place). This algorithm is described more formally in Alg. 7.2.
Algorithm 7.2 A meta-algorithm for using early stopping to determine how long
to train, then retraining on all the data.
Let X
(train)
and y
(train)
be the training set
Split X
(train)
and y
(train)
into X
(subtrain)
, y
(subtrain)
, X
(valid)
, y
(valid)
Run early stopping (Alg. 7.1) starting from random θ using X
(subtrain)
and
y
(subtrain)
for training data and X
(valid)
and y
(valid)
for validation data. This
returns i
, the optimal number of steps.
Set θ to random values again
Train on X
(train)
and y
(train)
for i
steps.
Another strategy for using all of the data is to keep the parameters obtained
from the first round of training and then continue training but now using all of the
data. At this stage, we now no longer have a guide for when to stop in terms of a
number of steps. Instead, we can monitor the loss function on the validation set,
211
CHAPTER 7. REGULARIZATION
w
1
w
2
w
˜
w
w
1
w
2
w
Figure 7.4: An illustration of the effect of early stopping (Right) as a form of regularization
on the value of the optimal w, as compared to L2 regularization (Left) discussed in
Sec. 7.2.1.
and continue training until it falls below the value of the training set objective at
which the early stopping procedure halted. This strategy avoids the high cost of
retraining the model from scratch, but is not as well-behaved. For example, there
is not any guarantee that the objective on the validation set will ever reach the
target value, so this strategy is not even guaranteed to terminate. This procedure
is presented more formally in Alg. 7.3.
Algorithm 7.3 A meta-algorithm for using early stopping to determining at
what objective value we start to overfit, then continuing training.
Let X
(train)
and y
(train)
be the training set
Split X
(train)
and y
(train)
into X
(subtrain)
, y
(subtrain)
, X
(valid)
, y
(valid)
Run early stopping (Alg. 7.1) starting from random θ using X
(subtrain)
and
y
(subtrain)
for training data and X
(valid)
and y
(valid)
for validation data. This
updates θ
J(θ, X
(subtrain)
, y
(subtrain)
)
while J(θ, X
(valid)
, y
(valid)
) > do
Train on X
(train)
and y
(train)
for n steps.
end while
Early stopping and the use of surrogate loss functions: A useful property
of early stopping is that it can help to mitigate the problems caused by a mismatch
between the surrogate loss function whose gradient we follow downhill and the
underlying performance measure that we actually care about. For example, 0-
1 classification loss has a derivative that is zero or undefined everywhere, so it
is not appropriate for gradient-based optimization. We therefore train with a
212
CHAPTER 7. REGULARIZATION
surrogate such as the log likelihood of the correct class label. However, 0-1 loss
is inexpensive to compute, so it can easily be used as an early stopping criterion.
Often the 0-1 loss continues to decrease for long after the log likelihood has begun
to worsen on the validation set. TODO: figures. in figures/regularization, I have
extracted the 0-1 loss but only used the nll for the regularization chapter’s figures.
Early stopping is also useful because it reduces the computational cost of the
training procedure. It is a form of regularization that does not require adding
additional terms to the surrogate loss function, so we get the benefit of regular-
ization without the cost of any additional gradient computations. It also means
that we do not spend time approaching the exact local minimum of the surrogate
loss.
How early stopping acts as a regularizer: So far we have stated that early
stopping is a regularization strategy, but we have only backed up this claim by
showing learning curves where the validation set error has a U-shaped curve.
What is the actual mechanism by which early stopping regularizes the model?
8
Early stopping has the effect of restricting the optimization procedure to a
relatively small volume of parameter space in the neighborhood of the initial
parameter value θ
o
. More specifically, imagine taking τ optimization steps (cor-
responding to τ training iterations) and taking η as the learning rate. We can
view the product ητ as the reciprocal of a regularization parameter. Assuming the
gradient is bounded, restricting both the number of iterations and the learning
rate limits the volume of parameter space reachable from θ
o
.
Indeed, we can show how in the case of a simple linear model with a
quadratic error function and simple gradient descent—early stopping is equivalent
to L2 regularization as seen in Section 7.2.1.
In order to compare with classical L
2
regularization, we again consider the
simple setting where we will take as the parameters to be optimized as θ = w and
we take a quadratic approximation to the objective function J in the neighborhood
of the empirically optimal value of the weights w
.
ˆ
J(θ) = J(w
) +
1
2
(w w
)
>
H(θ θ
) (7.25)
where, as before, H is the Hessian matrix of J with respect to w evaluated at
w
. Given the assumption that w
is a minimum of J(w), we can consider that
H is positive semi-definite and that the gradient is given by:
w
ˆ
J(w) = H(w w
). (7.26)
8
Material for this section was taken from Bishop (1995a); Sj¨oberg and Ljung (1995), for
further details regarding the interpretation of early-stopping as a regularizer, please consult
these works.
213
CHAPTER 7. REGULARIZATION
Let us consider initial parameter vector chosen at the origin, i.e. w
(0)
= 0.
We will consider updating the parameters via gradient descent:
w
(τ)
= w
(τ1)
η
w
J(w
(τ1)
) (7.27)
= w
(τ1)
ηH(w
(τ1)
w
) (7.28)
w
(τ)
w
= (I ηH)(w
(τ1)
w
) (7.29)
Let us now consider this expression in the space of the eigenvectors of H, i.e.
we will again consider the eigendecomposition of H: H = QΛQ
>
, where Λ is a
diagonal matrix and Q is an ortho-normal basis of eigenvectors.
w
(τ)
w
= (I ηQΛQ
>
)(w
(τ1)
w
)
Q
>
(w
(τ)
w
) = (I ηΛ)Q
>
(w
(τ1)
w
)
Assuming w
0
= 0, and that |1 ηλ
i
| < 1, we have after τ training updates,
(TODO: derive the expression below).
Q
>
w
(τ)
= [I (I ηΛ)
τ
]Q
>
w
. (7.30)
Now, the expression for Q
>
˜
w in Eqn. 7.9 for L
2
regularization can rearrange as:
Q
>
˜
w = (Λ + αI)
1
ΛQ
>
w
Q
>
˜
w = [I (Λ + αI)
1
α]Q
>
w
(7.31)
Comparing Eqns 7.30 and 7.31, we see that if
(I ηΛ)
τ
= (Λ + αI)
1
α,
then L
2
regularization and early stopping can be seen to be equivalent (at least
under the quadratic approximation of the objective function). Going even further,
by taking logs and using the series expansion for log(1 + x), we can conclude that
if all λ
i
are small (i.e. ηλ
i
1 and λ
i
1) then
τ 1α. (7.32)
That is, under these assumptions, the number of training iterations τ plays a role
inversely proportional to the L
2
regularization parameter.
Parameter values corresponding to directions of significant curvature (of the
loss) are regularized less than directions of less curvature. Of course, in the context
of early stopping, this really means that parameters that correspond to directions
of significant curvature tend to learn early relative to parameters corresponding
to directions of less curvature.
214
CHAPTER 7. REGULARIZATION
7.8 Parameter Tying and Parameter Sharing
TODO(Aaron): start with bayesian perspective (parameters should be close), add
practical constraints to get parameter sharing.
Thus far, in this chapter, when we have discussed adding constraints or penal-
ties to the parameters, we have always does so with respect to a fixed region or
point. For example, L
2
regularization (or weight decay) penalizes model param-
eters for deviating from the fixed value of zero.
Sometimes rather than apply a penalty for deviation from a fixed point in
parameter space, we wish to express our prior knowledge about
Convolutional Neural Networks By far the most popular and extensive use
of parameter sharing occurs in the convolutional neural networks (CNNs). CNNs
will be discussed in detail in Chapter 9, here we note only how they take advantage
of parameter sharing.
CNNs, as we know them today, were originally developed for application to
computer visionLeCun et al. (1989). Natural images have a particular statistical
property that they are invariant under 2-dimensional translation (in the image
plane). This property is a natural consequence of the image generation process:
the same scene can be photographed twice with the center of one image being a
translation of the center of the other image (i.e. there is no natural origin in the
image plane). CNNs were designed to take this property into account by sharing
parameters across the image plane. If the image shares its statistical structure
across the image plane, then so too should the model. Feature detectors found to
be useful in one region should be generalized across all regions.
Parameter sharing has allowed CNNs to dramatically lower the number of
unique model parameters and have allowed them to significantly increase network
sizes without requiring a corresponding increase in training data. It remains one
of the best examples of how to effectively incorporate domain knowledge into the
network architecture.
7.9 Sparse Representations
TODO(Aaron) Most deep learning models have some concept of representations.
7.10 Bagging and Other Ensemble Methods
Bagging (short for bootstrap aggregating) is a technique for reducing generalization
error by combining several models (Breiman, 1994). The idea is to train several
different models separately, then have all of the models vote on the output for
215
CHAPTER 7. REGULARIZATION
test examples. This is an example of a general strategy in machine learning
called model averaging. Techniques employing this strategy are known as ensemble
methods.
The reason that model averaging works is that different models will usually
make different errors on the test set to some extent.
Consider for example a set of k regression models. Suppose that each model
makes an error
i
on each example, with the errors drawn from a zero-mean mul-
tivariate normal distribution with variances E[
2
i
] = v and covariances E[
i
j
] = c.
Then the error made by the average prediction of all the ensemble models is
1
k
P
i
i
. The expected squared error is
E[
1
k
X
i
i
!
2
]
=
1
k
2
E[
X
i
2
i
+
X
j6=i
i
j
]
1
k
v +
k 1
k
c.
In the case where the errors are perfectly correlated and c = v, this reduces to
v, and the model averaging does not help at all. But in the case where the errors
are perfectly uncorrelated and c = 0, then the expected error of the ensemble is
only
1
k
v. This means that the expected squared error of the ensemble decreases
linearly with the ensemble size. In other words, on average, the ensemble will per-
form at least as well as any of its members, and if the members make independent
errors, the ensemble will perform significantly better than of its members.
Different ensemble methods construct the ensemble of models in different
ways. For example, each member of the ensemble could be formed by training a
completely different kind of model using a different algorithm or cost function.
Bagging is a method that allows the same kind of model and same kind of training
algorithm and cost function to be reused several times.
Specifically, bagging involves constructing k different datasets. Each dataset
has the same number of examples as the original dataset, but each dataset is
constructed by sampling with replacement from the original dataset. This means
that, with high probability, each dataset is missing some of the examples from
the original dataset and also contains several duplicate examples. Model i is then
trained on dataset i. The differences between which examples are included in
each dataset result in differences between the trained models. See Fig. 7.5 for an
example.
216
CHAPTER 7. REGULARIZATION
8
8
First ensemble member
Second ensemble member
Original dataset
First resampled dataset
Second resampled dataset
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an ’8’ detector
on the dataset depicted above, containing an ’8’, a ’6’, and a ’9’. Suppose we make
two different resampled datasets. The bagging training procedure is to construct each of
these datasets by sampling with replacement. The first dataset omits the ’9’ and repeats
the ’8’. On this dataset, the detector learns that a loop on top of the digit corresponds
to an ’8’. On the second dataset, we repeat the ’9’ and omit the ’6’. In this case, the
detector learns that a loop on the bottom of the digit corresponds to an ’8’. Each of these
individual classification rules is brittle, but if we average their output then the detector
is robust, achieving maximal confidence only when both loops of the ’8’ are present.
217
CHAPTER 7. REGULARIZATION
Neural networks reach a wide enough variety of solution points that they can
often benefit from model averaging even if all of the models are trained on the same
dataset. Differences in random initialization, random selection of minibatches,
differences in hyperparameters, or different outcomes of non-deterministic imple-
mentations of neural networks are often enough to cause different members of the
ensemble to make partially independent errors.
Model averaging is an extremely powerful and reliable method for reducing
generalization error. Its use is usually discouraged when benchmarking algorithms
for scientific papers, because any machine learning algorithm can benefit substan-
tially from model averaging at the price of increased computation and memory.
For this reason, benchmark comparisons are usually made using a single model.
Machine learning contests are usually won by methods using model averag-
ing over dozens of models. A recent prominent example is the Netflix Grand
Prize (Koren, 2009).
Not all techniques for constructing ensembles are designed to make the en-
semble more regularized than the individual models. For example, a technique
called boosting constructs an ensemble with higher capacity than the individual
models.
7.11 Dropout
Because deep models have a high degree of expressive power, they are capable of
overfitting significantly. While this problem can be solved by using a very large
dataset, large datasets are not always available. Dropout (Srivastava et al., 2014)
provides a computationally inexpensive but powerful method of regularizing a
broad family of models.
Dropout can be thought of as a method of making bagging practical for neu-
ral networks. Bagging involves training multiple models, and evaluating multiple
models on each test example. This seems impractical when each model is a neural
network, since training and evaluating a neural network is costly in terms of run-
time and storing a neural network is costly in terms of memory. Dropout provides
an inexpensive approximation to training and evaluating a bagged ensemble of
exponentially many neural networks.
Specifically, dropout trains the ensemble consisting of all sub-networks that
can be formed by removing units from an underlying base network. In most mod-
ern neural networks, based on a series of affine transformations and nonlinearities,
we can effective remove a unit from a network by multiplying its state by zero.
This procedure requires some slight modification for models such as radial basis
function networks, which take the difference between the unit’s state and some
reference value. Here, we will present the dropout algorithm in terms of multipli-
218
CHAPTER 7. REGULARIZATION
cation by zero for simplicity, but it can be trivially modified to work with other
operations that remove a unit from the network.
TODO–describe training algorithm, with reference to bagging TODO– include
figures from IG’s job talk TODO– training doesn’t rely on the model being prob-
abilistic TODO– describe inference algorithm, with reference to bagging TODO–
inference does rely on the model being probabilistic. and specifically, exponential
family?
For many classes of models that do not have nonlinear hidden units, the weight
scaling inference rule is exact. For a simple example, consider a softmax regression
classifier with n input variables represented by the vector v:
P (y = y | v) = softmax
W
>
v + b
y
.
We can index into the family of sub-models by element-wise multiplication of the
input with a binary vector d:
P (y = y | v; d) = softmax
W
>
(d v) + b
y
.
The ensemble predictor is defined by re-normalizing the geometric mean over all
ensemble members’ predictions:
P
ensemble
(y = y | v) =
˜
P
ensemble
(y = y | v)
P
y
0
˜
P
ensemble
(y = y
0
| v)
(7.33)
where
˜
P
ensemble
(y = y | v) =
2
n
s
Y
d∈{0,1}
n
P (y = y | v; d).
To see that the weight scaling rule is exact, we can simplify
˜
P
ensemble
:
˜
P
ensemble
(y = y | v) =
2
n
s
Y
d∈{0,1}
n
P (y = y | v; d)
=
2
n
s
Y
d∈{0,1}
n
softmax (W
>
(d v) + b)
y
=
2
n
v
u
u
t
Y
d∈{0,1}
n
exp
W
>
y,:
(d v) + b
P
y
0
exp
W
>
y
0
,:
(d v) + b
=
2
n
q
Q
d∈{0,1}
n
exp
W
>
y,:
(d v) + b
2
n
r
Q
d∈{0,1}
n
P
y
0
exp
W
>
y
0
,:
(d v) + b
219
CHAPTER 7. REGULARIZATION
Because
˜
P will be normalized, we can safely ignore multiplication by factors that
are constant with respect to y:
˜
P
ensemble
(y = y | v)
2
n
s
Y
d∈{0,1}
n
exp
W
>
y,:
(d v) + b
= exp
1
2
n
X
d∈{0,1}
n
W
>
y,:
(d v) + b
= exp
1
2
W
>
y,:
v + b
Substituting this back into equation 7.33 we obtain a softmax classifier with
weights
1
2
W .
The weight scaling rule is also exact in other settings, including regression
networks with conditionally normal outputs, and deep networks that have hidden
layers without nonlinearities. However, the weight scaling rule is only an approxi-
mation for deep models that have non-linearities, and this approximation has not
been theoretically characterized. Fortunately, it works well, empirically. Good-
fellow et al. (2013a) found empirically that for deep networks with nonlinearities,
the weight scaling rule can work better (in terms of classification accuracy) than
Monte Carlo approximations to the ensemble predictor, even if the Monte Carlo
approximation is allowed to sample up to 1,000 sub-networks.
Srivastava et al. (2014) showed that dropout is more effective than other stan-
dard computationally inexpensive regularizers, such as weight decay, filter norm
constraints, and sparse activity regularization. Dropout may also be combined
with more expensive forms of regularization such as unsupervised pretraining to
yield an improvement. As of this writing, the state of the art classification error
rate on the permutation invariant MNIST dataset (not using any prior knowledge
about images) is attained by a classifier that uses both dropout regularization and
deep Boltzmann machine pretraining. However, combining dropout with unsu-
pervised pretraining has not become a popular strategy for larger models and
more challenging datasets.
One advantage of dropout is that it is very computationally cheap. Using
dropout during training requires only O(n) computation per example per update,
to generate n random binary numbers and multiply them by the state. Depending
on the implementation, it may also require O(n) memory to store these binary
numbers until the backpropagation stage. Running inference in the trained model
has the same cost per-example as if dropout were not used, though we must pay
the cost of dividing the weights by 2 once before beginning to run inference on
examples.
220
CHAPTER 7. REGULARIZATION
One significant advantage of dropout is that it does not significantly limit
the type of model or training procedure that can be used. It works well with
nearly any model that uses a distributed representation and can be trained with
stochastic gradient descent. This includes feedforward neural networks, proba-
bilistic models such as restricted Boltzmann machines (Srivastava et al., 2014),
and recurrent neural networks (Pascanu et al., 2014a). This is very different from
many other neural network regularization strategies, such as those based on un-
supervised pretraining or semi-supervised learning. Such regularization strategies
often impose restrictions such as not being able to use rectified linear units or
max pooling. Often these restrictions incur enough harm to outweigh the benefit
provided by the regularization strategy.
Though the cost per-step of applying dropout to a specific model is negligible,
the cost of using dropout in a complete system can be significant. This is because
the size of the optimal model (in terms of validation set error) is usually much
larger, and because the number of steps required to reach convergence increases.
This is of course to be expected from a regularization method, but it does mean
that for very large datasets (as a rough rule of thumb, dropout is unlikely to be
beneficial when more than 15 million training examples are available, though the
exact boundary may be highly problem dependent) it is often preferable not to
use dropout at all, just to speed training and reduce the computational cost of
the final model.
When extremely few labeled training examples are available, dropout is less
effective. Bayesian neural networks (Neal, 1996) outperform dropout on the Alter-
native Splicing Dataset (Xiong et al., 2011) where fewer than 5,000 examples are
available (Srivastava et al., 2014). When additional unlabeled data is available,
unsupervised feature learning can gain an advantage over dropout.
TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al.,
2013) TODO–perspective as L2 regularization TODO–connection to adagrad?
TODO–semi-supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013)
TODO–DWF paper (Warde-Farley et al., 2014) TODO–using geometric mean
is not a problem TODO–dropout boosting, it’s not just noise robustness TODO–
what was the conclusion about mixability (DWF)?
The stochasticity used while training with dropout is not a necessary part
of the model’s success. It is just a means of approximating the sum over all
sub-models. Wang and Manning (2013) derived analytical approximations to
this marginalization. Their approximation, known as fast dropout resulted in
faster convergence time due to the reduced stochasticity in the computation of
the gradient. This method can also be applied at test time, as a more principled
(but also more computationally expensive) approximation to the average over
all sub-networks than the weight scaling approximation. Fast dropout has been
221
CHAPTER 7. REGULARIZATION
used to match the performance of standard dropout on small neural network
problems, but has not yet yielded a significant improvement or been applied to a
large problem.
Dropout has inspired other stochastic approaches to training exponentially
large ensembles of models that share weights. DropConnect is a special case of
dropout where each product between a single scalar weight and a single hidden
unit state is considered a unit that can be dropped (Wan et al., 2013). Stochastic
pooling is a form of randomized pooling (see chapter 9.3) for building ensembles
of convolutional networks with each convolutional network attending to different
spatial locations of each feature map. So far, dropout remains the most widely
used implicit ensemble method.
TODO–improved performance with maxout units and probably ReLUs
7.12 Multi-Task Learning
Multi-task learning (Caruana, 1993) is a way to improve generalization by pooling
the examples (i.e., constraints) arising out of several tasks.
Figure 7.6 illustrates a very common form of multi-task learning, in which dif-
ferent supervised tasks (predicting Y
i
given X) share the same input X, as well as
some intermediate-level representation capturing a common pool of factors. The
model can generally be divided into two kinds of parts and associated parameters:
1. Task-specific parameters (which only benefit from the examples of their task
to achieve good generalization). Example: upper layers of a neural network,
in Figure 7.6.
2. Generic parameters, shared across all the tasks (which benefit from the
pooled data of all the tasks). Example: lower layers of a neural network, in
Figure 7.6.
Improved generalization and generalization error bounds (Baxter, 1995) can be
achieved because of the shared parameters, for which statistical strength can be
greatly improved (in proportion with the increased number of examples for the
shared parameters, compared to the scenario of single-task models). Of course this
will happen only if some assumptions about the statistical relationship between
the differents tasks are valid, i.e., that there is something shared across some of
the tasks.
From the point of view of deep learning, the underlying prior regarding the
data is the following: among the factors that explain the variations observed in
the data associated with the different tasks, some are shared across two or more
tasks.
222
CHAPTER 7. REGULARIZATION
7.13 Adversarial Training
In many cases, neural networks have begun to reach human performance when
evaluated on an i.i.d. test set.. It is natural therefore to wonder whether these
models have obtained a true human-level understanding of these tasks. In order
to probe the level of understanding a network has of the underlying task, we
can search for examples that the model misclassifies. Szegedy et al. (2014b)
found that even neural networks that perform at human level accuracy have a
nearly 100% error rate on examples that are intentionally constructed by using
an optimization procedure to search for an input x
0
near a data point x such that
the model output is very different at x
0
. In many case, x
0
can be so similar to x
that a human observer cannot tell the difference between the original example and
the adversarial example, but the network can make highly different predictions.
See Fig. 7.7 for an example.
Adversarial examples have many implications, for example, in computer secu-
rity, that are beyond the scope of this chapter. However, they are interesting in
the context of regularization because one can reduce the error rate on the original
i.i.d. test set by training on adversarially perturbed examples from the training
set (Szegedy et al., 2014b).
Goodfellow et al. (2014b) showed that one of the primary causes of these ad-
versarial examples is excessive linearity. Neural networks are built out of primarily
linear building blocks, and in some empirical experiments the overall function they
implement proves to be highly linear as a result. These linear functions are easy
to optimize. Unfortunately, the value of a linear function can change very rapidly
if it has numerous inputs. If we change each input by , then a linear function
with weights w can change by as much as |w|, which can be a very large amount
of w is high-dimensional. Adversarial training discourages this highly sensitive
locally linear behavior by encouraging the network to be locally constant in the
neighborhood of the training data. This can be seen as a way of introducing the
local smoothness prior into supervised neural nets.
This phenomenon helps to illustrate the power of using a large function family
in combination with aggressive regularization. Purely linear models, like logistic
regression, are not able to resist adversarial examples because they are forced to
be linear. Neural networks are able to represent functions that can range from
nearly linear to nearly locally constant and thus have the flexibility to capture
linear trends in the training data while still learning to resist local perturbation.
223
CHAPTER 7. REGULARIZATION
X
Y
1
Y
2
h
1
h
2
h
3
Figure 7.6: Multi-task learning can be cast in several ways in deep learning frameworks
and this figure illustrates the common situation where the tasks share a common input but
involve different target random variables. The lower layers of a deep network (whether
it is supervised and feedforward or includes a generative component with downward
arrows) can be shared across such tasks, while task-specific parameters can be learned
on top of a shared representation (associated respectively with h
1
and h
2
in the figure).
The underlying assumption is that there exist a common pool of factors that explain the
variations in the input X, while each task is associated with a subset of these factors. In
the figure, it is additionally assumed that top-level hidden units are specialized to each
task, while some intermediate-level representation is shared across all tasks. Note that
in the unsupervised learning context, it makes sense for some of the top-level factors to
be associated with none of the output tasks (h
3
): these are the factors that explain some
of the input variations but are not relevant for these tasks.
224
CHAPTER 7. REGULARIZATION
+ .007 × =
x sign(
x
J(θ, x, y))
x +
sign(
x
J(θ, x, y))
“panda” “nematode” “gibbon”
57.7%
confidence
8.2% confidence
99.3 %
confidence
Figure 7.7: A demonstration of adversarial example generation applied to GoogLeNet
(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose
elements are equal to the sign of the elements of the gradient of the cost function with
respect to the input, we can change GoogLeNet’s classification of the image. Reproduced
with permission from Goodfellow et al. (2014b).
225