Chapter 7

Regularization

A central problem in machine learning is how to make an algorithm that will perform

well not just on the training data, but also on new inputs. The main strategy for

achieving good generalization is known as regularization. Regularization consists of

putting extra constraints on a machine learning model, such as restrictions of parameter

values or extra terms in the cost function, that are not designed to help ﬁt the training

set. If chosen carefully, these extra constraints can lead to improved performance on the

test set, either by encoding prior knowledge into the model, or by forcing the model to

consider multiple hypotheses that explain the training data. Sometimes regularization

also helps to make an underdetermined problem determined.

This chapter builds on the concepts of generalization, overﬁtting, underﬁtting, bias

and variance introduced in Chapter 5. If you are not already familiar with these notions,

please refer to that chapter before continuing with the more advance material presented

here.

Simply put, regularizers work by trading increased bias for reduced variance. An

eﬀective regularizer is one that makes a proﬁtable trade, that is it reduces variance

signiﬁcantly while not overly increasing the bias.

When we discussed generalization and overﬁtting in Chapter 5, we focused on the

situation where the model family being trained either (1) excluded true data generating

process – corresponding to underﬁtting and inducing bias, or (2) matched to the true

data generating process – the “Goldilocks” model space, or (3) was more complex than

the generating process – the regime where variance dominates the estimation error (as

measured by the MSE – see Section. 5.5).

Note that an overly complex model family does not necessarily include (or even

come close to) the target function or the true data generating process. In practice, we

almost never have access to the true data generating process so we can never know if

our model family being estimated includes the generating process or not. But since, in

deep learning, we are often trying to work with data such as images, audio sequences

and text, we can probably safely assume that the model family we are training does not

include the data generating process. We can assume that – to some extend – we are

always trying to ﬁt a square peg (the data generating process) into a round hole (our

model family) and using the data to do that as best we can.

131

What this means is that controlling the complexity of the model is not going to

be a simple question of ﬁnding the model of the right size, i.e. the right number of

parameters. Instead, we might ﬁnd – and indeed in practical deep learning scenarios,

we almost always do ﬁnd – that the best ﬁtting model (in the sense of minimizing

generalization error) is one that possesses a large number of parameters that are not

entirely free to span their domain.

Regularization is a method of limiting the domain of these parameters in such a way

as to limit the capacity of the model. With respect to minimizing the empirical risk,

regularization induces bias in an attempt to limit variance that results from using a

ﬁnite dataset.

As we will see there are a great many forms of regularization available to the deep

learning practitioner. In fact, developing more eﬀective regularizers has been one of the

major research eﬀorts in the ﬁeld.

Most machine learning tasks can be viewed in terms of learning to represent a func-

tion

f(x) parametrized by a vector of parameters θ. The data consists of inputs x

(i)

and (for some tasks) targets y

(i)

for i ∈ {1, . . . , n}. In the case of classiﬁcation, each y

(i)

is an integer class label in {1, . . . , k}. For regression tasks, each y

(i)

is a real number.

In the case of a density estimation task, there are no targets. We may group these

examples into a design matrix X and a vector of targets y.

In deep learning, we are mainly interested in the case where

f(x) has a large number

of parameters and as a result possesses a high capacity to ﬁt relatively complicated

functions. This means that deep learning algorithms either require very large datasets

so that the data can fully specify such complicated models, or they require careful

regularization.

7.1 Classical Regularization: Parameter Norm Penalty

Regularization has been used for decades prior to the advent of deep learning. Tradi-

tional statistical and machine learning models traditionally represented simpler func-

tions. Because the functions themselves had less capacity, the regularization did not

need to be as sophisticated. We use the term classical regularization to refer to the

techniques used in the general machine learning and statistics literature.

Most classical regularization approaches are based on limiting the capacity of models,

such as neural networks, linear regression, or logistic regression, by adding a parameter

norm penalty Ω(θ) to the loss function J. We denote the regularized loss function by

J(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.1)

where α is a hyperparameter that weighs the relative contribution of the norm penalty

term, Ω, relative to the standard loss function J(x; θ). The hyperparameter α should be

a non-negative real number, with α = 0 corresponding to no regularization, and larger

values of α corresponding to more regularization.

When our training algorithm minimizes the regularized loss function

J it will de-

crease both the original loss J on the training data and some measure of the size of the

132

parameters θ (or some subset of the parameters). Diﬀerent choices for the parameter

norm Ω can result in diﬀerent solutions being preferred.

For models such as linear or logistic regression, where θ = [w

, b]

, we typically

choose Ω(θ) =

||w||

or Ω(θ) = ||w||

. That is, we leave the biases unregularized

and penalize half

the squared L

or the L

norm.

In the following sections, we discuss the eﬀects of the various norms when used as

penalties on the weights.

7.1.1 L

Parameter Regularization

One of the simplest and most common kind of classical regularization is the L

parameter

norm penalty.

, Ω(θ) =

||w||

. This form of regularization is also known as ridge

regression. It is equally applicable to neural networks, where the penalty is equal to the

sum of the squared L

of all of the weight vectors. In the context of neural networks,

this is know as weight decay. Typically, for neural networks, we use a diﬀerent coeﬃcient

α for the weights at each layer of the network. This coeﬃcient should be tuned using a

validation set.

We can gain some insight into the behaviour of weight decay regularization by con-

sidering the gradient of the regularized loss function. To simplify the presentation, we

assume a linear model with no bias term, so θ is just w. Such a model has the following

gradient:

∇

J(w; X, y) = αw + ∇

J(w; X, y) (7.2)

We will further simplify the analysis by considering a quadratic approximation to

the loss function in the neighborhood of the empirically optimal value of the weights

∗

. (If the loss is truly quadratic, as in the case of ﬁtting a linear regression model with

mean squared error, then the approximation is perfect).

J(θ) = J(w

∗

) +

(w − w

∗

)

H(w − w

∗

) (7.3)

where H is the Hessian matrix of J with respect to w evaluated at w

∗

. There is no ﬁrst

order term in this quadratic approximation, because w

∗

is deﬁned to be a minimum,

The biases typically require less data to ﬁt accurately than the weights. Each weight speciﬁes how

two variables interact, and requires observing both variables in a variety of conditions to ﬁt well. Each

bias controls only a single variable. This means that we do not induce too much variance by leaving the

biases unregularized. Regularizing the biases can introduce a signiﬁcant amount of underﬁtting. For

example, making sparse features usually requires being able to set the biases to signﬁcantly negative

values.

The

in the L

penalty may seem arbitrary. Conceptually, it is not necessary and could be folded

into the α hyperparamter. However, the

results in a simpler gradient (w instead of 2w) and simpliﬁes

the interpretation of the penalty as being a Gaussian prior on w.

More generally, we could consider regularizing the parameters to a parameter value w

(o)

that is

perhaps not zero. In that case the L

penalty term would be Ω(θ) =

||w − w

(o)

−w

(o)

)

Since it is far more common to consider regularizing the model parameters to zero, we will focus on this

special case in our exposition.

133

where the gradient vanishes. Likewise, because w

∗

is a minimum, we can conclude that

H is positive semi-deﬁnite.

∇

J(w) = H(w − w

∗

). (7.4)

If we replace the exact gradient in equation 7.2 with the approximate gradient in

equation 7.4, we can write an equation for the location of the minimum of the regularized

loss function:

αw + H(w − w

∗

) = 0 (7.5)

(H + αI)w = Hw

∗

(7.6)

w = (H + αI)

−1

∗

(7.7)

The presence of the regularization term moves the optimum from w

∗

w. As α

approaches 0,

w approaches w

∗

. But what happens as α grows? Because H is real and

symmetric, we can decompose it into a diagonal matrix Λ and an ortho-normal basis

of eigenvectors, Q, such that H = QΛQ

. Aplying the decomposition to equation 7.7,

we obtain:

w = (QΛQ

+ αI)

−1

QΛQ

∗

Q(Λ + αI)Q

−1

QΛQ

∗

= Q(Λ + αI)

−1

ΛQ

∗

w = (Λ + αI)

−1

ΛQ

∗

. (7.8)

If we interpret the Q

w as rotating our parameters w into the basis as deﬁned by

the eigenvectors Q of H, then we see that the eﬀect of weight decay is to rescale the

coeﬃcients of eigenvectors. Speciﬁcally the ith component is rescaled by a factor of

+α

. (You may wish to review how this kind of scaling works, ﬁrst explained in Fig.

2.3).

Along the directions where the eigenvalues of H are relatively large, for example,

where λ

 α, the eﬀect of regularization is relatively small. However, components with

 α will be shrunk to have nearly zero magnitude. This eﬀect is illustrated in Fig.

7.1

Only directions along which the parameters contribute signiﬁcantly to reducing the

loss are preserved relatively intact. In directions that do not contribute to reducing

the loss, a small eigenvalue of the Hessian tell us that movement in this direction will

not signiﬁcantly increase the gradient. Components of the weight vector corresponding

to such unimportant directions are decayed away through the use of the regularization

throughout training. This eﬀect of suppressing contributions to the parameter vector

along these principle directions of the Hessian H is captured in the concept of the

eﬀective number of parameters, deﬁned to be

γ =

+ α

. (7.9)

134

∗

Figure 7.1: An illustration of the eﬀect of L2 (or weight decay) regularization on the

value of the optimal w.

As α is increased, the eﬀective number of parameters decreases.

Another way to gain some intuition for the eﬀect of L

regularization is to consider

its eﬀect on linear regression. The unregularized objective function for linear regression

is the sum of squared errors:

(Xw − y)

(Xw − y).

When we add L

regularization, the objective function changes to

(Xw − y)

(Xw − y) +

αw

This changes the normal equations for the solution from

w = (X

−1

w = (X

X + αI)

−1

We can see L

regularization causes the learning algorithm to “perceive” the input

X as having higher variance, which makes it shrink the weights on features whose

covariance with the output target is low compared to this added variance.

TODO–make sure the chapter includes maybe a table showing relationships between

early stopping, priors, constraints, penalties, and adding noise? e.g. look up L1 penalty

and it tells you what prior it corresponds to scratchwork thinking about how to do it:

L2 penalty L2 constraint add noise early stopping Gaussian prior L1 penalty L1

constraint Laplace prior Max-norm penalty

7.1.2 L

Regularization

While L

regularization is the most common form of regularization for model parameters

– such as the weights of a neural network. It is not the only form of regularization in

common usage. L

regularization is another kind of penalization on model parameters

that behaves diﬀerently from L

regularization.

135

Formally, L

regularization on the model parameter w is deﬁned as:

Ω(θ) = ||w||

|. (7.10)

That is, as the sum of absolute values of the individual parameters.

We will now

consider the eﬀect of L

regularization on the simple linear model, with no bias term,

that we considered in our analysis of L

regularization. In particular, we are interested

in delineating the diﬀerences between L

and L

forms of regularization. Thus, if we

consider the gradient (actually the sub-gradient) on the regularized objective function

J(w; X, y), we have:

∇

J(w; X, y) = βsign(w) + ∇

J(X, y; w) (7.11)

where sign(w) is simply sign of w applied element-wise.

By inspecting Eqn. 7.11, we can see immediately that the eﬀect of L

regulariza-

tion is quite diﬀerent from that of L

regularization. Speciﬁcally, we can see that the

regularization contribution to the gradient no longer scales linearly with w, instead it

is a constant factor with a sign equal to sign(w). One consequence of this form of

the gradient is that we will not necessarily see clean solutions to quadratic forms of

∇

J(X, y; w) as we did for L

regularization. Instead, the solutions are going to be

much more aligned to the basis space in which the problem is embedded.

For the sake of comparison with L

regularization, we will again consider a simpliﬁed

setting of a quadratic approximation to the loss function in the neighborhood of the

empirical optimum w

∗

. (Once again, if the loss is truly quadratic, as in the case of

ﬁtting a linear regression model with mean squared error, then the approximation is

perfect). The gradient of this approximation is given by

∇

J(w) = H(w − w

∗

). (7.12)

where, again, H is the Hessian matrix of J with respect to w evaluated at w

∗

. We

will also make the further simplifying assumption that the Hessian is diagonal, H =

diag([γ

, . . . , γ

]), where each γ

> 0. With this rather restrictive assumption, the

solution of the minimum of the L

regularized loss function decomposes into a systems

of equations of the form:

J(w; X, y) =

− w

∗

)

+ β|w

Which admits an optimal solution (for each dimension i) in the following form:

= sign(w

∗

) max(|w

∗

| −

, 0)

As with L

regularization, we could consider regularizing the parameters to a value that is not zero,

but instead to some parameter value w

(o)

. In that case the L

regularization would introduce the term

Ω(θ) = ||w − w

(o)

= β

− w

(o)

136

∗

Figure 7.2: An illustration of the eﬀect of L

regularization (RIGHT) on the value of

the optimal W , in comparison to the eﬀect of L

regularization (LEFT).

Let’s consider the situation where w

∗

> 0 for all i, there are two possible outcomes.

Case 1: w

∗

≤

, here the optimal value of w

under the regularized objective is simply

= 0, this occurs because the contribution of J(w; X, y) to the regularized objective

J(w; X, y) is overwhelmed – in direction i, by the L

regularization which pushes the

value of w

to zero. Case 2: w

∗

, here the regularization does not move the optimal

value of w to zero but instead it just shifts it in that direction by a distance equal to

. This is illustrated in Fig. 7.2.

In comparison to L

regularization, L

regularization results in a solution that is

more sparse. Sparsity in this context implies that there are free parameters of the model

that – through normlone regularization – with an optimal value (under the regularized

objective) of zero. As we discussed, for each element i of the parameter vector, this

happened when w

∗

≤

. Comparing this to the situation for normltwo regularization,

where (under the same assumptions of a diagonal Hessian H) we get w

+α

∗

which is nonzero as long as w

∗

is nonzero.

In Fig. 7.2, we see that even when the optimal value of w is nonzero, L

regulariza-

tion acts to punish small values of parameters just as harshly as larger values, leading

to optimal solutions with more parameters having value zero and more larger valued

parameters.

The sparsity property induced by L

regularization has been used extensively as a

feature selection mechanism. In particular, the well known LASSO Tibshirani (1995)

(least absolute shrinkage and selection operator) model integrates an L

penalty with a

linear model and a least squares cost function. Finally, L

is known as the only norm

that is both sparsifying and convex for non-degenerative problems

For degenerative problems, where more than one solution exists, L

regularization can ﬁnd the

“sparse” solution in the sense that redundant parameters shrink to zero.

137

7.1.3 L

∞

Regularization

7.2 Classical Regularization as Constrained Optimization

Classical regularization adds a penalty term to the training objective:

J(θ; X, y) = J(θ; X, y) + αΩ(θ).

Recall from Sec. 4.4 that we can minimize a function subject to constraints by con-

structing a generalized Lagrange function (see 4.4), consisting of the original objective

function plus a set of penalties. Each penalty is a product between a coeﬃcient, called

a Karush–Kuhn–Tucker (KKT) multiplier

, and a function representing whether the

constraint is satisﬁed. If we wanted to constrain that Ω(θ) is less than some constant

k, we could construct a generalized Lagrange function

L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k).

The solution to the constrained problem is given by

∗

= min

max

α,α≥0

L(θ, α).

Solving this problem requires modifying both θ and α. Speciﬁcally, α must increase

whenever ||θ||

> k and decrease whenever ||θ||

< k. However, after we have solved

the problem, we can ﬁx α

∗

and view the problem as just a function of θ:

∗

= min

L(θ, α

∗

) = min

J(θ; X, y) + α

∗

Ω(θ).

This is exactly the same as the regularized training problem of minimizing

J. Note

that the value of α

∗

does not directly tell us the value of k. In principle, one can solve

for k, but the relationship between k and α

∗

depends on the form of J. We can thus

think of classical regularization as imposing a constraint on the weights, but with an

unknown size of the constraint region. Larger α will result in a smaller constraint region,

and smaller α will result in a larger constraint region.

Sometimes we may wish to use explicit constraints rather than penalties. As de-

scribed in Sec. 4.4, we can modify algorithms such as stochastic gradient descent to

take a step downhill on J(θ) and then project θ back to the nearest point that satisﬁes

Ω(θ) < k. This can be useful if we have an idea of what value of k is appropriate and

do not want to spend time searching for the value of α that corresponds to this k.

Another reason to use explicit constraints and reprojection rather than enforcing con-

straints with penalties is that penalties can cause non-convex optimization procedures

to get stuck in local minima corresponding to small θ. When training neural networks,

this usually manifests as neural networks that train with several “dead units”. These are

units that do not contribute much to the behavior of the function learned by the network

because the weights going into or out of them are all very small. When training with

KKT multipliers generalize Lagrange multipliers to allow for inequality constraints

138

a penalty on the norm of the weights, these conﬁgurations can be locally optimal, even

if it is possible to signiﬁcantly reduce J by making the weights larger. (This concern

about local minima obviously does not apply when

J is convex)

Finally, explicit constraints with reprojection can be useful because they impose some

stability on the optimization procedure. When using high learning rates, it is possible to

encounter a positive feedback loop in which large weights induce large gradients which

then induce a large update to the weights. If these updates consistently increase the size

of the weights, then θ rapidly moves away from the origin until numerical overﬂow occurs.

Explicit constraints with reprojection allow us to terminate this feedback loop after

the weights have reached a certain magnitude. Hinton et al. (2012) recommend using

constraints combined with a high learning rate to allow rapid exploration of parameter

space while maintaining some stability.

TODO how L2 penalty is equivalent to L2 constraint (with unknown value), L1

penalty is equivalent to L1 constraint maybe move the earlier L2 regularization ﬁgure

to here, now that the sublevel sets will make more sense? show the shapes induced by

the diﬀerent norms separate L2 penalty on each hidden unit vector is diﬀerent from L2

penalty on all theta; is equivalent to a penalty on the max across columns of the column

norms

7.3 Regularization from a Bayesian Perspective

In section 5.7, we brieﬂy reviewed TODO- quick discussion of how to do Bayesian

inference TODO- justiﬁcation of MAP approximate Bayesian inference does it minimize

some divergence under some constraint? or is it just a heuristic? maybe a ﬁgure showing

the true posterior and a Dirac at the MAP TODO– speciﬁc priors leading to speciﬁc

penalties, Gaussian -¿ L2, TODO– do we want to talk about ALL regularization or just

“classical regularization”?

7.4 Regularization and Under-Constrained Problems

In some cases, regularization is necessary for machine learning problems to be properly

deﬁned.

Many linear models in machine learning, including linear regression and PCA, de-

pend on inverting the matrix X

X. This is not possible whenever X

X is singular.

This matrix can be singular whenever the data truly has no variance in some direction,

or when there are fewer examples (rows of X) than input features (columns of X). In

this case, many forms of regularization correspond to inverting X

X +αI instead. This

regularized matrix is guaranteed to be invertible.

These linear problems have closed form solutions when the relevant matrix is invert-

ible. It is also possible for a problem with no closed form solution to be underdetermined.

For example, consider logistic regression applied to a problem where the classes are lin-

early separable. If a weight vector w is able to achieve perfect classiﬁcation, then 2w

will also achieve perfect classiﬁcation and higher likelihood. An iterative optimization

139

procedure like stochastic gradient descent will continually increase the magnitude of w

and, in theory, will never halt. In practice, a numerical implementation of gradient

descent will eventually reach suﬃciently large weights to cause numerical overﬂow, at

which point its behavior will depend on how the programmer has decided to handle

values that are not real numbers.

Most forms of regularization are able to guarantee the convergence of iterative meth-

ods applied to underdetermined problems. For example, weight decay will cause gradient

descent to quit increasing the magnitude of the weights when the slope of the likelihood

is equal to the weight decay coeﬃcient. Likewise, early stopping based on the valida-

tion set classiﬁcation rate will cause the training algorithm to terminate soon after the

validation set classiﬁcation accuracy has stopped increasing. Even if the problem is lin-

early separable and there is no overﬁtting, the validation set classiﬁcation accuracy will

eventually saturate to 100%, resulting in termination of the early stopping procedure.

The idea of using regularization to solve underdetermined problems extends beyond

machine learning. The same idea is useful for several basic linear algebra problems.

One way of generalizing the concept of matrix inversion to non-square matrices, the

Moore-Penrose pseudoinverse, can be viewed as a form of normltwo regularization.

The Moore-Penrose pseudo-inverse is deﬁned as

= lim

α&0

+ αI)

−1

Using the Moore-Penrose pseudoinverse we can generalize linear regression to undercon-

strained problems using minimal regularization by multiplying both sides of

Xw = y

by X

to yield

w = X

When a true inverse for X exists, this returns the exact solution. When X is not

invertible because no exact solution exists, this returns the w corresponding to the least

possible mean squared error. When X is not invertible because many solutions exist,

this returns w with the minimum possible normltwo norm.

The Moore-Penrose pseudo-inverse is also closely related to the singular value de-

composition. Speciﬁcally, if the SVD is given by X = UΣW

, then X

= W Σ

To compute the pseudo-inverse of the diagonal matrix of singular values Σ, we simply

replace each non-zero element of the diagonal with its reciprocal, and leave all zero

elements equal to zero.

Because the SVD is robust to underdetermined problems resulting from too few ob-

servations or too little underlying variance, it is useful for implementing stable variants

of many closed-form linear machine learning algorithms. The stability of these algo-

rithms can be viewed as a result of applying the minimum amount of regularization

necessary to make the problem become determined.

140

7.5 Dataset Augmentation

The best way to make a machine learning model generalize better is to train it on more

data. Of course, in practice, the amount of data we have is limited. One way to get

around this problem is to create more fake data. For some machine learning tasks, it is

reasonably straightforward to create new fake data.

This approach is easiest for classiﬁcation. A classiﬁer needs to take a complicated,

high dimensional input x and summarize it with a single category identity y. This

means that the main task facing a classiﬁer is to be invariant to a wide variety of

transformations. We can generate new (x, y) pairs easily just by transforming the x

inputs in our training set.

This approach is not as readily applicable to many other tasks. For example, it is

diﬃcult to generate new fake data for a density estimation task unless we have already

solved the density estimation problem.

Dataset augmentation has been a particularly eﬀective technique for a speciﬁc clas-

siﬁcation problem: object recognition. Images are high dimensional and include an

enormous variety of factors of variation, many of which can be easily simulated. Opera-

tions like translating the training images a few pixels in each direction can often greatly

improve generalization, even if the model has already been designed to be partially

translation invariant by using convolution and pooling. Many other operations such as

rotating the image or scaling the image have also proven quite eﬀective. One must be

careful not to apply transformations that are relevant to the classiﬁcation problem. For

example, optical character recognition tasks require recognizing the diﬀerence between

’b’ and ’d’ and the diﬀerence between ’6’ and ’9’, so horizontal ﬂips and 180

◦

rotations

are not appropriate ways of augmenting datasets for these tasks. There are also trans-

formations that we would like our classiﬁers to be invariant to, but which are not easy

to perform. For example, out-of-plane rotation can not be implemented as a simple

geometric operation on the input pixels.

For many classiﬁcation and even some regression tasks, the task should still be

possible to solve even if random noise is added to the input. Neural networks prove

not to be very robust to noise, however . One way to improve the robustness of neural

networks is simply to train them with random noise applied to their inputs. Dropout,

a powerful regularization strategy that will be described in Sec. 7.11, can be seen as a

process of constructing new inputs by multiplying by noise.

In a multilayer network, it can often be beneﬁcial to apply transformations such as

noise to the hidden units, as well as the inputs. This can be viewed as augmenting the

dataset as seen by the deeper layers.

When reading machine learning research papers, it is important to take the eﬀect of

dataset augmentation into account. Often, hand-designed dataset augmentation schemes

can dramatically reduce the generalization error of a machine learning technique. It is

important to look for controlled experiments. When comparing machine learning al-

gorithm A and machine learning algorithm B, it is necessary to make sure that both

algorithms were evaluated using the same hand-designed dataset augmentation schemes.

If algorithm A performs poorly with no dataset augmentation and algorithm B performs

141

well when combined with numerous synthetic transformations of the input, then it is

likely the synthetic transformations and not algorithm B itself that cause the improved

performance. Sometimes the line is blurry, such as when a new machine learning al-

gorithm involves injecting noise into the inputs. In these cases, it’s best to consider

how generally applicable to the new algorithm is, and to make sure that pre-existing

algorithms are re-run in as similar of conditions as possible.

TODO– tangent propagation

7.6 Classical Regularization as Noise Robustness

Some classical regularization techniques can be derived in terms of training on noisy

inputs. For example, consider

TODO how L2 penalty and L1 penalty can be derived in diﬀerent ways, noise on

inputs, noise on weights results for deep nets, see ”Training with noise is equivalent to

Tikhonov regularization” by Bishop et al

7.7 Bagging and Other Ensemble Methods

Bagging (short for bootstrap aggregating) is a technique for reducing generalization error

by combining several models (Breiman, 1994). The idea is to train several diﬀerent

models separately, then have all of the models vote on the output for test examples.

This is an example of a general strategy in machine learning called model averaging.

Techniques employing this strategy are known as ensemble methods.

The reason that model averaging works is that diﬀerent models will usually make

diﬀerent errors on the test set to some extent.

Consider for example a set of k regression models. Suppose that each model makes

an error 

on each example, with the errors drawn from a zero-mean multivariate normal

distribution with variances E[

] = v and covariances E[



] = c. Then the error made

by the average prediction of all the ensemble models is



. The expected squared

error is



]







j6=i







]

v +

k − 1

In the case where the errors are perfectly correlated and c = v, this reduces to v, and

the model averaging does not help at all. But in the case where the errors are perfectly

uncorrelated and c = 0, then the expected error of the ensemble is only

v. This means

that the expected squared error of the ensemble decreases linearly with the ensemble

size. In other words, on average, the ensemble will perform at least as well as any of

142

its members, and if the members make independent errors, the ensemble will perform

signiﬁcantly better than of its members.

Diﬀerent ensemble methods construct the ensemble of models in diﬀerent ways. For

example, each member of the ensemble could be formed by training a completely diﬀerent

kind of model using a diﬀerent algorithm or cost function. Bagging is a method that

allows the same kind of model and same kind of training algorithm and cost function to

be re-used several times.

Speciﬁcally, bagging involves constructing k diﬀerent datasets. Each dataset has

the same number of examples as the original dataset, but each dataset is constructed

by sampling with replacement from the original dataset. This means that, with high

probability, each dataset is missing some of the examples from the original dataset and

also contains several duplicate examples. Model i is then trained on dataset i. The

diﬀerences between which examples are included in each dataset result in diﬀerences

between the trained models. See Fig. 7.3 for an example.

Neural networks reach a wide enough variety of solution points that they can often

beneﬁt from model averaging even if all of the models are trained on the same dataset.

Diﬀerences in random initialization, random selection of minibatches, diﬀerences in hy-

perparameters, or diﬀerent outcomes of non-deterministic implementations of neural

networks are often enough to cause diﬀerent members of the ensemble to make partially

independent errors.

Model averaging is an extremely powerful and reliable method for reducing general-

ization error. Its use is usually discouraged when benchmarking algorithms for scientiﬁc

papers, because any machine learning algorithm can beneﬁt substantially from model av-

eraging at the price of increased computation and memory. For this reason, benchmark

comparisons are usually made using a single model.

Machine learning contests are usually won by methods using model averaging over

dozens of models. A recent prominent example is the Netﬂix Grand Prize (Koren, 2009).

Not all techniques for constructing ensembles are designed to make the ensemble

more regularized than the individual models. For example, a technique called boosting

constructs an ensemble with higher capacity than the individual models.

7.8 Early Stopping as a Form of Regularization

When training large models with high capacity, we often observe that training error

decreases steadily over time, but validation set error begins to rise again. See Fig. 7.4

for an example of this behavior. This behavior occurs very reliably.

This means we can obtain a model with better validation set error (and thus, hope-

fully better test set error) by returning to the parameter setting at the point in time

with the lowest validation set error. Instead of running our optimization algorithm until

we reach a (local) minimum, we run it until the error on the validation set has not

improved for some amount of time. Every time the error on the validation set improves,

we store a copy of the model parameters. When the training algorithm terminates, we

return these parameters, rather than the latest parameters. This procedure is speciﬁed

143

First ensemble member

Second ensemble member

Original dataset

First resampled dataset

Second resampled dataset

Figure 7.3: A cartoon depiction of how bagging works. Suppose we train an ’8’ detector

on the dataset depicted above, containing an ’8’, a ’6’, and a ’9’. Suppose we make

two diﬀerent resampled datasets. The bagging training procedure is to construct each

of these datasets by sampling with replacement. The ﬁrst dataset omits the ’9’ and

repeats the ’8’. On this dataset, the detector learns that a loop on top of the digit

corresponds to an ’8’. On the second dataset, we repeat the ’9’ and omit the ’6’. In

this case, the detector learns that a loop on the bottom of the digit corresponds to an

’8’. Each of these individual classiﬁcation rules is brittle, but if we average their output

then the detector is robust, achieving maximal conﬁdence only when both loops of the

’8’ are present.

144

Figure 7.4: Learning curves showing how the negative log likelihood loss changes over

time. In this example, we train a maxout network on MNIST, regularized with dropout.

Observe that the training loss decreases consistently over time, but the validation set

loss eventually begins to increase again.

145

more formally in Alg. 7.1.

Algorithm 7.1 The early stopping meta-algorithm for determining the best amount of

time to train. This meta-algorithm is a general strategy that works well with a variety

of training algorithms and ways of quantifying error on the validation set.

Let n be the number of steps between evaluations.

Let p be the “patience,” the number of times to observe worsening validation set error

before giving up.

Let θ

be the initial parameters.

θ ← θ

i ← 0

j ← 0

v ← ∞

∗

← θ

∗

← i

while j < p do

Update θ by running the training algorithm for n steps.

i ← i + n

← ValidationSetError(θ)

if v

< v then

j ← 0

∗

← θ

∗

← i

v ← v

else

j ← j + 1

end if

end while

Best parameters are θ

∗

, best number of training steps is i

∗

This strategy is known as early stopping. It is probably the most commonly used

form of regularization in deep learning. Its popularity is due both to its eﬀectiveness

and its simplicity.

One way to think of early stopping is as a very eﬃcient hyperparameter selection

algorithm. In this view, the number of training steps is just another hyperparameter. We

can see in Fig. 7.4 that this hyperparameter has a U-shaped validation set performance

curve, just like most other model capacity control parameters. In this case, we are

controlling the eﬀective capacity of the model by determining how many steps it can

take to ﬁt the training set precisely. Most of the time, setting hyperparameters requires

an expensive guess and check process, where we must set a hyperparameter at the start

of training, then run training for several steps to see its eﬀect. The “training time”

hyperparameter is unique in that by deﬁnition a single run of training tries out many

values of the hyperparameter. The only signiﬁcant cost to choosing this hyperparameter

automatically via early stopping is running the validation set evaluation periodically

146

during training.

An additional cost to early stopping is the need to maintain a copy of the best

parameters. This cost is generally negligible, because it is acceptable to store these pa-

rameters in a slower and larger form of memory (for example, training in GPU memory,

but storing the optimal parameters in host memory or on a disk drive). Since the best

parameters are written to infrequently and never read during training, these occasional

slow writes are have little eﬀect on the total training time.

Early stopping is a very inobtrusive form of regularization, in that it requires no

change the underlying training procedure, the objective function, or the set of allowable

parameter values. This means that it is easy to use early stopping without damaging the

learning dynamics. This is in contrast to weight decay, where one must be careful not to

use too much weight decay and trap the network in a bad local minima corresponding

to a solution with pathologically small weights.

Early stopping may be used either alone or in conjunction with other regularization

strategies. Even when using regularization strategies that modify the objective function

to encourage better generalization, it is rare for the best generalization to occur at a

local minimum of the training objective.

Early stopping requires a validation set, which means some training data is not fed

to the model. To best exploit this extra data, one can perform extra training after the

initial training with early stopping has completed. In the second, extra training step,

all of the training data is included. There are two basic strategies one can use for this

second training procedure.

One strategy is to initialize the model again and retrain on all of the data. In

this second training pass, we train for the same number of steps as the early stopping

procedure determined was optimal in the ﬁrst pass. There are some subtleties associated

with this procedure. For example, there is not a good way of knowing whether to retrain

for the same number of parameter updates or the same number of passes through the

dataset. On the second round of training, each pass through the dataset will require

more parameter updates because the training set is bigger. Usually, if overﬁtting is a

serious concern, you will want to retrain for the same number of epochs, rather than

the same number of parameter udpates. If the primary diﬃculty is optimization rather

than generalization, then retraining for the same number of parameter updates makes

more sense (but it’s also less likely that you need to use a regularization method like

early stopping in the ﬁrst place). This algorithm is described more formally in Alg. 7.2.

Another strategy for using all of the data is to keep the parameters obtained from

the ﬁrst round of training and then continue training but now using all of the data.

At this stage, we now no longer have a guide for when to stop in terms of a number

of steps. Instead, we can monitor the loss function on the validation set, and continue

training until it falls below the value of the training set objective at which the early

stopping procedure halted. This strategy avoids the high cost of retraining the model

from scratch, but is not as well-behaved. For example, there is not any guarantee that

the objective on the validation set will ever reach the target value, so this strategy is not

even guaranteed to terminate. This procedure is presented more formally in Alg. 7.3.

147

Algorithm 7.2 A meta-algorithm for using early stopping to determine how long to

train, then retraining on all the data.

Let X

(train)

and y

(train)

be the training set

Split X

(train)

and y

(train)

into X

(subtrain)

, y

(subtrain)

, X

(valid)

, y

(valid)

Run early stopping (Alg. 7.1) starting from random θ using X

(subtrain)

and y

(subtrain)

for training data and X

(valid)

and y

(valid)

for validation data. This returns i

∗

, the

optimal number of steps.

Set θ to random values again

Train on X

(train)

and y

(train)

for i

∗

steps.

Algorithm 7.3 A meta-algorithm for using early stopping to determining at what

objective value we start to overﬁt, then continuing training.

Let X

(train)

and y

(train)

be the training set

Split X

(train)

and y

(train)

into X

(subtrain)

, y

(subtrain)

, X

(valid)

, y

(valid)

Run early stopping (Alg. 7.1) starting from random θ using X

(subtrain)

and y

(subtrain)

for training data and X

(valid)

and y

(valid)

for validation data. This updates θ

 ← J(θ, X

(subtrain)

, y

(subtrain)

)

while J(θ, X

(valid)

, y

(valid)

) >  do

Train on X

(train)

and y

(train)

for n steps.

end while

Early stopping and the use of surrogate loss functions: A useful property of

early stopping is that it can help to mitigate the problems caused by a mismatch between

the surrogate loss function whose gradient we follow downhill and the underlying per-

formance measure that we actually care about. For example, 0-1 classiﬁcation loss has

a derivative that is zero or undeﬁned everywhere, so it is not appropriate for gradient-

based optimization. We therefore train with a surrogate such as the log likelihood of

the correct class label. However, 0-1 loss is inexpensive to compute, so it can easily

be used as an early stopping criterion. Often the 0-1 loss continues to decrease for

long after the log likelihood has begun to worsen on the validation set. TODO: ﬁgures.

in ﬁgures/regularization, I have extracted the 0-1 loss but only used the nll for the

regularization chapter’s ﬁgures.

Early stopping is also useful because it reduces the computational cost of the training

procedure. It is a form of regularization that does not require adding additional terms

to the surrogate loss function, so we get the beneﬁt of regularization without the cost

of any additional gradient computations. It also means that we do not spend time

approaching the exact local minimum of the surrogate loss.

How early stopping acts as a regularizer: So far we have stated that early stop-

ping is a regularizer, but we have only backed up this claim by showing learning curves

where the validation set error has a U-shaped curve. What is the actual mechanism by

148

∗

(τ)

∗

Figure 7.5: An illustration of the eﬀect of early stopping (Right) as a form of regular-

ization on the value of the optimal w, as compared to L2 regularization (Left) discussed

in Sec. 7.1.1.

which early stopping regularizes the model?

Early stopping has the eﬀect of restricting the optimization procedure to a relatively

small volume of parameter space in the neighborhood of the initial parameter value

. More speciﬁcally, imagine taking τ optimization steps (corresponding to τ training

iterations) and taking η as the learning rate. We can view the product ητ as the

reciprocal of a regularization parameter. Assuming the gradient is bounded, restricting

both the number of iterations and the learning rate limits the volume of parameter space

reachable from θ

Indeed, we can show how — in the case of a simple linear model with a quadratic error

function and simple gradient descent – early stopping is equivalent to L2 regularization

as seen in Section 7.1.1.

In order to compare with classical L

regularization, we again consider the sim-

ple setting where we will take as the parameters to be optimized as θ = w and we

take a quadratic approximation to the objective function J in the neighborhood of the

empirically optimal value of the weights w

∗

J(θ) = J(w

∗

) +

(w − w

∗

)

H(θ − θ

∗

) (7.13)

where, as before, H is the Hessian matrix of J with respect to w evaluated at w

∗

. Given

the assumption that w

∗

is a minimum of J(w), we can consider that H is positive semi-

deﬁnite and that the gradient is given by:

∇

J(w) = H(w − w

∗

). (7.14)

Let us consider initial parameter vector chosen at the origin, i.e. w

(0)

= 0. We will

Material for this section was taken from Bishop (1995); Sj¨oberg and Ljung (1995), for further details

regarding the interpretation of early-stopping as a regularizer, please consult these works.

149

consider updating the parameters via gradient descent:

(τ)

= w

(τ−1)

− η∇

J(w

(τ−1)

) (7.15)

= w

(τ−1)

− ηH(w

(τ−1)

− w

∗

) (7.16)

(τ)

− w

∗

= (I − ηH)(w

(τ−1)

− w

∗

) (7.17)

Let us now consider this expression in the space of the eigenvectors of H, i.e. we will

again consider the eigendecomposition of H: H = QΛQ

, where Λ is a diagonal matrix

and Q is an ortho-normal basis of eigenvectors.

(τ)

− w

∗

= (I − ηQΛQ

)(w

(τ−1)

− w

∗

)

(τ)

− w

∗

) = (I − ηΛ)Q

(τ−1)

− w

∗

)

Assuming w

= 0, and that |1 − ηλ

| < 1, we have after τ training updates, (TODO:

derive the expression below).

(τ)

= [I − (I −ηΛ)

∗

. (7.18)

Now, the expression for Q

w in Eqn. 7.8 for L

regularization can rearrange as:

w = (Λ + αI)

−1

ΛQ

∗

w = [I − (Λ + αI)

−1

α]Q

∗

(7.19)

Comparing Eqns 7.18 and 7.19, we see that if

(I − ηΛ)

= (Λ + αI)

−1

α,

then L

regularization and early stopping can be seen to be equivalent (at least under

the quadratic approximation of the objective function). Going even further, by taking

logs and using the series expansion for log(1+x), we can conclude that if all λ

are small

(i.e. ηλ

 1 and λ

/α  1) then

τ ≈ 1/ηα. (7.20)

That is, under these assumptions, the number of training iterations τ plays a role in-

versely proportional to the L

regularization parameter.

Parameter values corresponding to directions of signiﬁcant curvature (of the loss)

are regularized less than directions of less curvature. Of course, in the context of early

stopping, this really means that parameters that correspond to directions of signiﬁcant

curvature tend to learn early relative to parameters corresponding to directions of less

curvature.

7.9 Parameter Sharing

TODO: start with bayesian perspective (parameters should be close), add practical

constraints to get parameter sharing.

150

7.10 Sparse Representations

TODO Most deep learning models have some concept of representations.

7.11 Dropout

Because deep models have a high degree of expressive power, they are capable of over-

ﬁtting signiﬁcantly. While this problem can be solved by using a very large dataset,

large datasets are not always available. Dropout (Srivastava et al., 2014) provides

a computationally inexpensive but powerful method of regularizing a broad family of

models.

Dropout can be thought of as a method of making bagging practical for neural net-

works. Bagging involves training multiple models, and evaluating multiple models on

each test example. This seems impractical when each model is a neural network, since

training and evaluating a neural network is costly in terms of runtime and storing a

neural network is costly in terms of memory. Dropout provides an inexpensive approx-

imation to training and evaluating a bagged ensemble of exponentially many neural

networks.

Speciﬁcally, dropout trains the ensemble consisting of all sub-networks that can be

formed by removing units from an underlying base network. In most modern neural

networks, based on a series of aﬃne transformations and nonlinearities, we can eﬀective

remove a unit from a network by multiplying its state by zero. This procedure requires

some slight modiﬁcation for models such as radial basis function networks, which take

the diﬀerence between the unit’s state and some reference value. Here, we will present

the dropout algorithm in terms of multiplication by zero for simplicity, but it can be

trivially modiﬁed to work with other operations that remove a unit from the network.

TODO–describe training algorithm, with reference to bagging TODO– include ﬁg-

ures from IG’s job talk TODO– training doesn’t rely on the model being probabilistic

TODO–describe inference algorithm, with reference to bagging TODO– inference does

rely on the model being probabilistic. and speciﬁcally, exponential family?

For many classes of models that do not have nonlinear hidden units, the weight

scaling inference rule is exact. For a simple example, consider a softmax regression

classiﬁer with n input variables represented by the vector v:

P (Y = y | v) = softmax



v + b



We can index into the family of sub-models by element-wise multiplication of the input

with a binary vector d:

P (Y = 1 | v; d) = softmax



d  v + b



The ensemble predictor is deﬁned by re-normalizing the geometric mean over all ensem-

ble members’ predictions:

ensemble

(Y = y | v) =

ensemble

(Y = y | v)

ensemble

(Y = y

| v)

(7.21)

151

where

ensemble

(Y = y | v) =

d∈{0,1}

P (Y = y | v; d).

To see that the weight scaling rule is exact, we can simplify

ensemble

(Y = y | v) =

d∈{0,1}

P (Y = y | v; d)

d∈{0,1}

softmax (w

d  v + b)

d∈{0,1}

exp



y,:

d  v + b



exp



d  v + b



d∈{0,1}

exp



y,:

d  v + b



d∈{0,1}

exp



d  v + b



Because

P will be normalized, we can safely ignore multiplication by factors that are

constant with respect to y:

ensemble

(Y = y | v) ∝

d∈{0,1}

exp



y,:

d  v + b



= exp





d∈{0,1}

y,:

d  v + b





= exp



y,:

v + b



Substituting this back into equation 7.21 we obtain a softmax classiﬁer with weights

W .

The weight scaling rule is also exact in other settings, including regression networks

with conditionally normal outputs, and deep networks that have hidden layers with-

out nonlinearities. However, the weight scaling rule is only an approximation for deep

models that have non-linearities, and this approximation has not been theoretically

characterized. Fortunately, it works well, empirically. Goodfellow et al. (2013a) found

empirically that for deep networks with nonlinearities, the weight scaling rule can work

better (in terms of classiﬁcation accuracy) than Monte Carlo approximations to the

ensemble predictor, even if the Monte Carlo approximation is allowed to sample up to

1,000 sub-networks.

Srivastava et al. (2014) showed that dropout is more eﬀective than other standard

computationally inexpensive regularizers, such as weight decay, ﬁlter norm constraints,

and sparse activity regularization. Dropout may also be combined with more expensive

forms of regularization such as unsupervised pretraining to yield an improvement. As

of this writing, the state of the art classiﬁcation error rate on the permutation invariant

152

MNIST dataset (not using any prior knowledge about images) is attained by a clas-

siﬁer that uses both dropout regularization and deep Boltzmann machine pretraining.

However, combining dropout with unsupervised pretraining has not become a popular

strategy for larger models and more challenging datasets.

One advantage of dropout is that it is very computationally cheap. Using dropout

during training requires only O(n) computation per example per update, to generate

n random binary numbers and multiply them by the state. Depending on the imple-

mentation, it may also require O(n) memory to store these binary numbers until the

backpropagation stage. Running inference in the trained model has the same cost per-

example as if dropout were not used, though we must pay the cost of dividing the weights

by 2 once before beginning to run inference on examples.

One signiﬁcant advantage of dropout is that it does not signiﬁcantly limit the type

of model or training procedure that can be used. It works well with nearly any model

that uses a distributed representation and can be trained with stochastic gradient de-

scent. This includes feedforward neural networks, probabilistic models such as restricted

Boltzmann machines (Srivastava et al., 2014), and recurrent neural networks (Pascanu

et al., 2014a). This is very diﬀerent from many other neural network regularization

strategies, such as those based on unsupervised pretraining or semi-supervised learning.

Such regularization strategies often impose restrictions such as not being able to use

rectiﬁed linear units or max pooling. Often these restrictions incur enough harm to

outweigh the beneﬁt provided by the regularization strategy.

Though the cost per-step of applying dropout to a speciﬁc model is negligible, the

cost of using dropout in a complete system can be signiﬁcant. This is because the

size of the optimal model (in terms of validation set error) is usually much larger, and

because the number of steps required to reach convergence increases. This is of course

to be expected from a regularization method, but it does mean that for very large

datasets (as a rough rule of thumb, dropout is unlikely to be beneﬁcial when more than

15 million training examples are available, though the exact boundary may be highly

problem dependent) it is often preferable not to use dropout at all, just to speed training

and reduce the computational cost of the ﬁnal model.

When extremely few labeled training examples are available, dropout is less eﬀective.

Bayesian neural networks (Neal, 1996) outperform dropout on the Alternative Splicing

Dataset (Xiong et al., 2011) where fewer than 5,000 examples are available (Srivastava

et al., 2014). When additional unlabeled data is available, unsupervised feature learning

can gain an advantage over dropout.

TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al., 2013)

TODO–perspective as L2 regularization TODO–connection to adagrad? TODO–semi-

supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013) TODO–DWF paper

(Warde-Farley et al., 2014) TODO–using geometric mean is not a problem TODO–

dropout boosting, it’s not just noise robustness TODO–what was the conclusion about

mixability?

The stochasticity used while training with dropout is not a necessary part of the

model’s success. It is just a means of approximating the sum over all sub-models.

Wang and Manning (2013) derived analytical approximations to this marginalization.

153

Their approximation, known as fast dropout resulted in faster convergence time due to

the reduced stochasticity in the computation of the gradient. This method can also be

applied at test time, as a more principled (but also more computationally expensive)

approximation to the average over all sub-networks than the weight scaling approxima-

tion. Fast dropout has been used to match the performance of standard dropout on

small neural network problems, but has not yet yielded a signiﬁcant improvement or

been applied to a large problem.

Dropout has inspired other stochastic approaches to training exponentially large

ensembles of models that share weights. DropConnect is a special case of dropout

where each product between a single scalar weight and a single hidden unit state is

considered a unit that can be dropped (Wan et al., 2013). Stochastic pooling is a form

of randomized pooling (see chapter 9.3) for building ensembles of convolutional networks

with each convolutional network attending to diﬀerent spatial locations of each feature

map. So far, dropout remains the most widely used implicit ensemble method.

TODO–improved performance with maxout units and probably ReLUs

7.12 Multi-Task Learning

Multi-task learning (Caruana, 1993) is a way to improve generalization by pooling the

examples (i.e., constraints) arising out of several tasks.

Figure 7.6 illustrates a very common form of multi-task learning, in which diﬀerent

supervised tasks (predicting Y

given X) share the same input X, as well as some

intermediate-level representation capturing a common pool of factors. The model can

generally be divided into two kinds of parts and associated parameters:

1. Task-speciﬁc parameters (which only beneﬁt from the examples of their task to

achieve good generalization). Example: upper layers of a neural network, in Fig-

ure 7.6.

2. Generic parameters, shared across all the tasks (which beneﬁt from the pooled

data of all the tasks). Example: lower layers of a neural network, in Figure 7.6.

Improved generalization and generalization error bounds (Baxter, 1995) can be achieved

because of the shared parameters, for which statistical strength can be greatly improved

(in proportion with the increased number of examples for the shared parameters, com-

pared to the scenario of single-task models). Of course this will happen only if some

assumptions about the statistical relationship between the diﬀerents tasks are valid, i.e.,

that there is something shared across some of the tasks.

From the point of view of deep learning, the underlying prior regarding the data is the

following: among the factors that explain the variations observed in the data associated

with the diﬀerent tasks, some are shared across two or more tasks.

TODO adversarial training

154

Figure 7.6: Multi-task learning can be cast in several ways in deep learning frame-

works and this ﬁgure illustrates the common situation where the tasks share a common

input but involve diﬀerent target random variables. The lower layers of a deep net-

work (whether it is supervised and feedforward or includes a generative component with

downward arrows) can be shared across such tasks, while task-speciﬁc parameters can

be learned on top of a shared representation (associated respectively with h

and h

the ﬁgure). The underlying assumption is that there exist a common pool of factors

that explain the variations in the input X, while each task is associated with a subset

of these factors. In the ﬁgure, it is additionally assumed that top-level hidden units are

specialized to each task, while some intermediate-level representation is shared across

all tasks. Note that in the unsupervised learning context, it makes sense for some of

the top-level factors to be associated with none of the output tasks (h

): these are the

factors that explain some of the input variations but are not relevant for these tasks.

155