Chapter 7

Regularization

A central problem in machine learning is how to make an algorithm that will

perform well not just on the training data, but also on new inputs. The main

strategy for achieving good generalization is known as regularization. Regulariza-

tion consists of putting extra constraints on a machine learning model, such as

restrictions of parameter values or extra terms in the cost function, that are not

designed to help ﬁt the training set. If chosen carefully, these extra constraints

can lead to improved performance on the test set, either by encoding prior knowl-

edge into the model, or by forcing the model to consider multiple hypotheses

that explain the training data. Sometimes regularization also helps to make an

underdetermined problem determined.

This chapter builds on the concepts of generalization, overﬁtting, underﬁtting,

bias and variance introduced in Chapter 5. If you are not already familiar with

these notions, please refer to that chapter before continuing with the more advance

material presented here.

Simply put, regularizers work by trading increased bias for reduced variance.

An eﬀective regularizer is one that makes a proﬁtable trade, that is it reduces

variance signiﬁcantly while not overly increasing the bias.

When we discussed generalization and overﬁtting in Chapter 5, we focused

on the situation where the model family being trained either (1) excluded true

data generating process – corresponding to underﬁtting and inducing bias, or

(2) matched to the true data generating process – the “Goldilocks” model space,

or (3) TODO– what is the Goldilocks model space? Yoshua deleted the TODO

to deﬁne this in ml.tex but didn’t add a TODO to deﬁne it anywhere. Are we

just cutting the term here? was more complex than the generating process – the

regime where variance dominates the estimation error (as measured by the MSE

– see Section. 5.6).

Note that an overly complex model family does not necessarily include (or even

176

CHAPTER 7. REGULARIZATION

come close to) the target function or the true data generating process. In practice,

we almost never have access to the true data generating process so we can never

know if our model family being estimated includes the generating process or not.

But since, in deep learning, we are often trying to work with data such as images,

audio sequences and text, we can probably safely assume that the model family

we are training does not include the data generating process. We can assume that

– to some extent – we are always trying to ﬁt a square peg (the data generating

process) into a round hole (our model family) and using the data to do that as

best we can.

What this means is that controlling the complexity of the model is not going

to be a simple question of ﬁnding the model of the right size, i.e. the right

number of parameters. Instead, we might ﬁnd – and indeed in practical deep

learning scenarios, we almost always do ﬁnd – that the best ﬁtting model (in the

sense of minimizing generalization error) is one that possesses a large number of

parameters that are not entirely free to span their domain.

Regularization is a method of limiting the domain of these parameters in such

a way as to limit the capacity of the model. With respect to minimizing the

empirical risk, regularization induces bias in an attempt to limit variance that

results from using a ﬁnite dataset.

As we will see there are a great many forms of regularization available to the

deep learning practitioner. In fact, developing more eﬀective regularizers has been

one of the major research eﬀorts in the ﬁeld.

Most machine learning tasks can be viewed in terms of learning to represent

a function

f(x) parametrized by a vector of parameters θ. The data consists of

inputs x

(i)

and (for some tasks) targets y

(i)

for i ∈ {1, . . . , n}. In the case of

classiﬁcation, each y

(i)

is an integer class label in {1, . . . , k}. For regression tasks,

each y

(i)

is a real number. In the case of a density estimation task, there are no

targets. We may group these examples into a design matrix X and a vector of

targets y.

In deep learning, we are mainly interested in the case where

f(x) has a large

number of parameters and as a result possesses a high capacity to ﬁt relatively

complicated functions. This means that deep learning algorithms either require

very large datasets so that the data can fully specify such complicated models, or

they require careful regularization.

7.1 Classical Regularization: Parameter Norm Penalty

Regularization has been used for decades prior to the advent of deep learning. Tra-

ditional statistical and machine learning models traditionally represented simpler

functions. Because the functions themselves had less capacity, the regularization

177

CHAPTER 7. REGULARIZATION

did not need to be as sophisticated. We use the term classical regularization to

refer to the techniques used in the general machine learning and statistics litera-

ture.

Most classical regularization approaches are based on limiting the capacity

of models, such as neural networks, linear regression, or logistic regression, by

adding a parameter norm penalty Ω(θ) to the loss function J. We denote the

regularized loss function by

J(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.1)

where α is a hyperparameter that weighs the relative contribution of the norm

penalty term, Ω, relative to the standard loss function J(x; θ). The hyperparam-

eter α should be a non-negative real number, with α = 0 corresponding to no

regularization, and larger values of α corresponding to more regularization.

When our training algorithm minimizes the regularized loss function

J it will

decrease both the original loss J on the training data and some measure of the

size of the parameters θ (or some subset of the parameters). Diﬀerent choices for

the parameter norm Ω can result in diﬀerent solutions being preferred.

For models such as linear or logistic regression, where θ = [w

, b]

, we typ-

ically choose Ω(θ) =

||w||

or Ω(θ) = ||w||

. That is, we leave the biases

unregularized

, and penalize half

the squared TODO: the above footnote is the

ﬁrst mention that we’re trying to use a prior. need to introduce prior more gently,

remind people it was in the prev chapter, explain why we’re using a prior here,

etc. L

or the L

norm.

In the following sections, we discuss the eﬀects of the various norms when

used as penalties on the weights.

7.1.1 L

Parameter Regularization

One of the simplest and most common kind of classical regularization is the L

parameter norm penalty.

, Ω(θ) =

||w||

. This form of regularization is also

The biases typically require less data to ﬁt accurately than the weights. Each weight speciﬁes

how two variables interact, and requires observing both variables in a variety of conditions to

ﬁt well. Each bias controls only a single variable. This means that we do not induce too much

variance by leaving the biases unregularized. Regularizing the biases can introduce a signiﬁcant

amount of underﬁtting. For example, making sparse features usually requires being able to set

the biases to signﬁcantly negative values.

The

in the L

penalty may seem arbitrary. Conceptually, it is not necessary and could

be folded into the α hyperparamter. However, the

results in a simpler gradient (w instead of

2w) and simpliﬁes the interpretation of the penalty as being a Gaussian prior on w.

More generally, we could consider regularizing the parameters to a parameter value w

(o)

that is perhaps not zero. In that case the L

penalty term would be Ω(θ) =

||w − w

(o)

− w

(o)

)

. Since it is far more common to consider regularizing the model parameters

to zero, we will focus on this special case in our exposition.

178

CHAPTER 7. REGULARIZATION

known as ridge regression. It is equally applicable to neural networks, where the

penalty is equal to the sum of the squared L

of all of the weight vectors. In the

context of neural networks, this is know as weight decay. Typically, for neural

networks, we use a diﬀerent coeﬃcient α for the weights at each layer of the

network. This coeﬃcient should be tuned using a validation set.

We can gain some insight into the behavior of weight decay regularization

by considering the gradient of the regularized loss function. To simplify the

presentation, we assume a linear model with no bias term, so θ is just w. Such a

model has the following gradient:

∇

J(w; X, y) = αw + ∇

J(w; X, y) (7.2)

We will further simplify the analysis by considering a quadratic approximation

to the loss function in the neighborhood of the empirically optimal value of the

weights w

∗

. (If the loss is truly quadratic, as in the case of ﬁtting a linear

regression model with mean squared error, then the approximation is perfect).

J(θ) = J(w

∗

) +

(w − w

∗

)

H(w −w

∗

) (7.3)

where H is the Hessian matrix of J with respect to w evaluated at w

∗

. There is

no ﬁrst order term in this quadratic approximation, because w

∗

is deﬁned to be a

minimum, where the gradient vanishes. Likewise, because w

∗

is a minimum, we

can conclude that H is positive semi-deﬁnite.

∇

J(w) = H(w − w

∗

). (7.4)

If we replace the exact gradient in equation 7.2 with the approximate gradient

in equation 7.4, we can write an equation for the location of the minimum of the

regularized loss function:

αw + H(w −w

∗

) = 0 (7.5)

(H + αI)w = Hw

∗

(7.6)

w = (H + αI)

−1

∗

(7.7)

The presence of the regularization term moves the optimum from w

∗

As α approaches 0,

w approaches w

∗

. But what happens as α grows? Because

H is real and symmetric, we can decompose it into a diagonal matrix Λ and

an ortho-normal basis of eigenvectors, Q, such that H = QΛQ

. Aplying the

179

CHAPTER 7. REGULARIZATION

∗

Figure 7.1: An illustration of the eﬀect of L2 (or weight decay) regularization on the

value of the optimal w.

decomposition to equation 7.7, we obtain:

w = (QΛQ

+ αI)

−1

QΛQ

∗

Q(Λ + αI)Q

−1

QΛQ

∗

= Q(Λ + αI)

−1

ΛQ

∗

w = (Λ + αI)

−1

ΛQ

∗

. (7.8)

If we interpret the Q

w as rotating our parameters w into the basis as deﬁned

by the eigenvectors Q of H, then we see that the eﬀect of weight decay is to

rescale the coeﬃcients of eigenvectors. Speciﬁcally the ith component is rescaled

by a factor of

+α

. (You may wish to review how this kind of scaling works, ﬁrst

explained in Fig. 2.3).

Along the directions where the eigenvalues of H are relatively large, for ex-

ample, where λ

 α, the eﬀect of regularization is relatively small. However,

components with λ

 α will be shrunk to have nearly zero magnitude. This

eﬀect is illustrated in Fig. 7.1

Only directions along which the parameters contribute signiﬁcantly to reduc-

ing the loss are preserved relatively intact. In directions that do not contribute to

reducing the loss, a small eigenvalue of the Hessian tell us that movement in this

direction will not signiﬁcantly increase the gradient. Components of the weight

vector corresponding to such unimportant directions are decayed away through

the use of the regularization throughout training. This eﬀect of suppressing con-

tributions to the parameter vector along these principle directions of the Hessian

H is captured in the concept of the eﬀective number of parameters, deﬁned to be

γ =

+ α

. (7.9)

180

CHAPTER 7. REGULARIZATION

As α is increased, the eﬀective number of parameters decreases.

Another way to gain some intuition for the eﬀect of L

regularization is to

consider its eﬀect on linear regression. The unregularized objective function for

linear regression is the sum of squared errors:

(Xw − y)

(Xw − y).

When we add L

regularization, the objective function changes to

(Xw − y)

(Xw − y) +

αw

This changes the normal equations for the solution from

w = (X

−1

w = (X

X + αI)

−1

We can see L

regularization causes the learning algorithm to “perceive” the

input X as having higher variance, which makes it shrink the weights on features

whose covariance with the output target is low compared to this added variance.

TODO–make sure the chapter includes maybe a table showing relationships

between early stopping, priors, constraints, penalties, and adding noise? e.g. look

up L1 penalty and it tells you what prior it corresponds to scratchwork thinking

about how to do it:

L2 penalty L2 constraint add noise early stopping Gaussian prior L1 penalty

L1 constraint Laplace prior Max-norm penalty

7.1.2 L

Regularization

While L

regularization is the most common form of regularization for model

parameters – such as the weights of a neural network. It is not the only form of

regularization in common usage. L

regularization is another kind of penalization

on model parameters that behaves diﬀerently from L

regularization.

Formally, L

regularization on the model parameter w is deﬁned as:

Ω(θ) = ||w||

|. (7.10)

That is, as the sum of absolute values of the individual parameters.

We will now

consider the eﬀect of L

regularization on the simple linear model, with no bias

As with L

regularization, we could consider regularizing the parameters to a value that is

not zero, but instead to some parameter value w

(o)

. In that case the L

regularization would

introduce the term Ω(θ) = ||w − w

(o)

= β

− w

(o)

181

CHAPTER 7. REGULARIZATION

term, that we considered in our analysis of L

regularization. In particular, we are

interested in delineating the diﬀerences between L

and L

forms of regularization.

Thus, if we consider the gradient (actually the sub-gradient) on the regularized

objective function

J(w; X, y), we have:

∇

J(w; X, y) = βsign(w) + ∇

J(X, y; w) (7.11)

where sign(w) is simply sign of w applied element-wise.

By inspecting Eqn. 7.11, we can see immediately that the eﬀect of L

regu-

larization is quite diﬀerent from that of L

regularization. Speciﬁcally, we can see

that the regularization contribution to the gradient no longer scales linearly with

w, instead it is a constant factor with a sign equal to sign(w). One consequence

of this form of the gradient is that we will not necessarily see clean solutions

to quadratic forms of ∇

J(X, y; w) as we did for L

regularization. Instead,

the solutions are going to be much more aligned to the basis space in which the

problem is embedded.

For the sake of comparison with L

regularization, we will again consider a

simpliﬁed setting of a quadratic approximation to the loss function in the neigh-

borhood of the empirical optimum w

∗

. (Once again, if the loss is truly quadratic,

as in the case of ﬁtting a linear regression model with mean squared error, then

the approximation is perfect). The gradient of this approximation is given by

∇

J(w) = H(w − w

∗

). (7.12)

where, again, H is the Hessian matrix of J with respect to w evaluated at w

∗

We will also make the further simplifying assumption that the Hessian is diagonal,

H = diag([γ

, . . . , γ

]), where each γ

> 0. With this rather restrictive assump-

tion, the solution of the minimum of the L

regularized loss function decomposes

into a system of equations of the form:

J(w; X, y) =

− w

∗

)

+ β|w

Which admits an optimal solution (for each dimension i) in the following form:

= sign(w

∗

) max(|w

∗

| −

, 0)

Let’s consider the situation where w

∗

> 0 for all i, there are two possible out-

comes. Case 1: w

∗

≤

, here the optimal value of w

under the regularized

objective is simply w

= 0, this occurs because the contribution of J(w; X, y) to

the regularized objective

J(w; X, y) is overwhelmed – in direction i, by the L

regularization which pushes the value of w

to zero. Case 2: w

∗

, here the

182

CHAPTER 7. REGULARIZATION

∗

Figure 7.2: An illustration of the eﬀect of L

regularization (RIGHT) on the value of the

optimal W , in comparison to the eﬀect of L

regularization (LEFT).

regularization does not move the optimal value of w to zero but instead it just

shifts it in that direction by a distance equal to

. This is illustrated in Fig. 7.2.

In comparison to L

regularization, L

regularization results in a solution that

is more sparse. Sparsity in this context implies that there are free parameters

of the model that – through L

regularization – with an optimal value (under

the regularized objective) of zero. As we discussed, for each element i of the

parameter vector, this happened when w

∗

≤

. Comparing this to the situation

for L

regularization, where (under the same assumptions of a diagonal Hessian

H) we get w

+α

∗

, which is nonzero as long as w

∗

is nonzero.

In Fig. 7.2, we see that even when the optimal value of w is nonzero, L

regularization acts to punish small values of parameters just as harshly as larger

values, leading to optimal solutions with more parameters having value zero and

more larger valued parameters.

The sparsity property induced by L

regularization has been used extensively

as a feature selection mechanism. In particular, the well known LASSO Tibshirani

(1995) (least absolute shrinkage and selection operator) model integrates an L

penalty with a linear model and a least squares cost function. Finally, L

known as the only norm that is both sparsifying and convex for non-degenerative

problems

For degenerative problems, where more than one solution exists, L

regularization can ﬁnd

the “sparse” solution in the sense that redundant parameters shrink to zero.

183

CHAPTER 7. REGULARIZATION

7.1.3 L

∞

Regularization

7.2 Classical Regularization as Constrained Optimiza-

tion

Classical regularization adds a penalty term to the training objective:

J(θ; X, y) = J(θ; X, y) + αΩ(θ).

Recall from Sec. 4.4 that we can minimize a function subject to constraints by

constructing a generalized Lagrange function (see 4.4), consisting of the original

objective function plus a set of penalties. Each penalty is a product between

a coeﬃcient, called a Karush–Kuhn–Tucker (KKT) multiplier

, and a function

representing whether the constraint is satisﬁed. If we wanted to constrain Ω(θ) to

be less than some constant k, we could construct a generalized Lagrange function

L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) −k).

The solution to the constrained problem is given by

∗

= min

max

α,α≥0

L(θ, α).

Solving this problem requires modifying both θ and α. Speciﬁcally, α must

increase whenever ||θ||

> k and decrease whenever ||θ||

< k. However, after we

have solved the problem, we can ﬁx α

∗

and view the problem as just a function

of θ:

∗

= min

L(θ, α

∗

) = min

J(θ; X, y) + α

∗

Ω(θ).

This is exactly the same as the regularized training problem of minimizing

Note that the value of α

∗

does not directly tell us the value of k. In principle, one

can solve for k, but the relationship between k and α

∗

depends on the form of

J. We can thus think of classical regularization as imposing a constraint on the

weights, but with an unknown size of the constraint region. Larger α will result in

a smaller constraint region, and smaller α will result in a larger constraint region.

Sometimes we may wish to use explicit constraints rather than penalties. As

described in Sec. 4.4, we can modify algorithms such as stochastic gradient descent

to take a step downhill on J(θ) and then project θ back to the nearest point that

satisﬁes Ω(θ) < k. This can be useful if we have an idea of what value of k

is appropriate and do not want to spend time searching for the value of α that

corresponds to this k.

KKT multipliers generalize Lagrange multipliers to allow for inequality constraints.

184

CHAPTER 7. REGULARIZATION

Another reason to use explicit constraints and reprojection rather than enforc-

ing constraints with penalties is that penalties can cause non-convex optimization

procedures to get stuck in local minima corresponding to small θ. When training

neural networks, this usually manifests as neural networks that train with several

“dead units”. These are units that do not contribute much to the behavior of the

function learned by the network because the weights going into or out of them are

all very small. When training with a penalty on the norm of the weights, these

conﬁgurations can be locally optimal, even if it is possible to signiﬁcantly reduce

J by making the weights larger. (This concern about local minima obviously does

not apply when

J is convex)

Finally, explicit constraints with reprojection can be useful because they im-

pose some stability on the optimization procedure. When using high learning

rates, it is possible to encounter a positive feedback loop in which large weights

induce large gradients which then induce a large update to the weights. If these

updates consistently increase the size of the weights, then θ rapidly moves away

from the origin until numerical overﬂow occurs. Explicit constraints with repro-

jection allow us to terminate this feedback loop after the weights have reached a

certain magnitude. Hinton et al. (2012c) recommend using constraints combined

with a high learning rate to allow rapid exploration of parameter space while

maintaining some stability.

TODO how L2 penalty is equivalent to L2 constraint (with unknown value),

L1 penalty is equivalent to L1 constraint maybe move the earlier L2 regularization

ﬁgure to here, now that the sublevel sets will make more sense? show the shapes

induced by the diﬀerent norms separate L2 penalty on each hidden unit vector

is diﬀerent from L2 penalty on all theta; is equivalent to a penalty on the max

across columns of the column norms

7.3 Regularization from a Bayesian Perspective

In section 5.8, we brieﬂy reviewed TODO- quick discussion of how to do Bayesian

inference TODO- justiﬁcation of MAP approximate Bayesian inference does it

minimize some divergence under some constraint? or is it just a heuristic? maybe

a ﬁgure showing the true posterior and a Dirac at the MAP TODO– speciﬁc priors

leading to speciﬁc penalties, Gaussian -¿ L2, TODO– do we want to talk about

ALL regularization or just “classical regularization”?

7.4 Regularization and Under-Constrained Problems

In some cases, regularization is necessary for machine learning problems to be

properly deﬁned.

185

CHAPTER 7. REGULARIZATION

Many linear models in machine learning, including linear regression and PCA,

depend on inverting the matrix X

X. This is not possible whenever X

X is

singular. This matrix can be singular whenever the data truly has no variance in

some direction, or when there are fewer examples (rows of X) than input features

(columns of X). In this case, many forms of regularization correspond to inverting

X + αI instead. This regularized matrix is guaranteed to be invertible.

These linear problems have closed form solutions when the relevant matrix is

invertible. It is also possible for a problem with no closed form solution to be

underdetermined. For example, consider logistic regression applied to a problem

where the classes are linearly separable. If a weight vector w is able to achieve

perfect classiﬁcation, then 2w will also achieve perfect classiﬁcation and higher

likelihood. An iterative optimization procedure like stochastic gradient descent

will continually increase the magnitude of w and, in theory, will never halt. In

practice, a numerical implementation of gradient descent will eventually reach

suﬃciently large weights to cause numerical overﬂow, at which point its behavior

will depend on how the programmer has decided to handle values that are not

real numbers.

Most forms of regularization are able to guarantee the convergence of iterative

methods applied to underdetermined problems. For example, weight decay will

cause gradient descent to quit increasing the magnitude of the weights when the

slope of the likelihood is equal to the weight decay coeﬃcient. Likewise, early

stopping based on the validation set classiﬁcation rate will cause the training

algorithm to terminate soon after the validation set classiﬁcation accuracy has

stopped increasing. Even if the problem is linearly separable and there is no

overﬁtting, the validation set classiﬁcation accuracy will eventually saturate to

100%, resulting in termination of the early stopping procedure.

The idea of using regularization to solve underdetermined problems extends

beyond machine learning. The same idea is useful for several basic linear algebra

problems.

As we saw in Chapter 2.9, we can solve underdetermined linear equations

using the Moore-Penrose pseudoinverse.

One deﬁnition of the pseudo-inverse X

of a matrix X is to perform linear

regression with an inﬁnitesimal amount of L

regularization:

= lim

α&0

+ αI)

−1

When a true inverse for X exists, then w = X

y returns the weights that

exactly solve the regression problem. When X is not invertible because no ex-

act solution exists, this returns the w corresponding to the least possible mean

squared error. When X is not invertible because many solutions exactly solve

the regression problem, this returns w with the minimum possible L

norm.

186

CHAPTER 7. REGULARIZATION

Recall that the Moore-Penrose pseudoinverse can be computed easily using

the singular value decomposition. Because the SVD is robust to underdetermined

problems resulting from too few observations or too little underlying variance,

it is useful for implementing stable variants of many closed-form linear machine

learning algorithms. The stability of these algorithms can be viewed as a result of

applying the minimum amount of regularization necessary to make the problem

become determined.

7.5 Dataset Augmentation

The best way to make a machine learning model generalize better is to train it

on more data. Of course, in practice, the amount of data we have is limited. One

way to get around this problem is to create more fake data. For some machine

learning tasks, it is reasonably straightforward to create new fake data.

This approach is easiest for classiﬁcation. A classiﬁer needs to take a compli-

cated, high dimensional input x and summarize it with a single category identity

y. This means that the main task facing a classiﬁer is to be invariant to a wide

variety of transformations. We can generate new (x, y) pairs easily just by trans-

forming the x inputs in our training set.

This approach is not as readily applicable to many other tasks. For example,

it is diﬃcult to generate new fake data for a density estimation task unless we

have already solved the density estimation problem.

Dataset augmentation has been a particularly eﬀective technique for a spe-

ciﬁc classiﬁcation problem: object recognition. Images are high dimensional and

include an enormous variety of factors of variation, many of which can be easily

simulated. Operations like translating the training images a few pixels in each

direction can often greatly improve generalization, even if the model has already

been designed to be partially translation invariant by using convolution and pool-

ing. Many other operations such as rotating the image or scaling the image have

also proven quite eﬀective. One must be careful not to apply transformations that

are relevant to the classiﬁcation problem. For example, optical character recogni-

tion tasks require recognizing the diﬀerence between ’b’ and ’d’ and the diﬀerence

between ’6’ and ’9’, so horizontal ﬂips and 180

◦

rotations are not appropriate

ways of augmenting datasets for these tasks. There are also transformations that

we would like our classiﬁers to be invariant to, but which are not easy to perform.

For example, out-of-plane rotation can not be implemented as a simple geometric

operation on the input pixels.

For many classiﬁcation and even some regression tasks, the task should still be

possible to solve even if random noise is added to the input. Neural networks prove

not to be very robust to noise, however (Tang and Eliasmith, 2010). One way to

187

CHAPTER 7. REGULARIZATION

improve the robustness of neural networks is simply to train them with random

noise applied to their inputs. This same approach also works when the noise is

applied to the hidden units, which can be seen as doing dataset augmentation

at multiple levels of abstraction. Poole et al. (2014) recently showed that this

approach can be highly eﬀective provided that the magnitude of the noise is

carefully tuned. Dropout, a powerful regularization strategy that will be described

in Sec. 7.11, can be seen as a process of constructing new inputs by multiplying

by noise.

In a multilayer network, it can often be beneﬁcial to apply transformations

such as noise to the hidden units, as well as the inputs. This can be viewed as

augmenting the dataset as seen by the deeper layers.

When reading machine learning research papers, it is important to take the

eﬀect of dataset augmentation into account. Often, hand-designed dataset aug-

mentation schemes can dramatically reduce the generalization error of a machine

learning technique. It is important to look for controlled experiments. When

comparing machine learning algorithm A and machine learning algorithm B, it

is necessary to make sure that both algorithms were evaluated using the same

hand-designed dataset augmentation schemes. If algorithm A performs poorly

with no dataset augmentation and algorithm B performs well when combined with

numerous synthetic transformations of the input, then it is likely the synthetic

transformations and not algorithm B itself that cause the improved performance.

Sometimes the line is blurry, such as when a new machine learning algorithm

involves injecting noise into the inputs. In these cases, it’s best to consider how

generally applicable to the new algorithm is, and to make sure that pre-existing

algorithms are re-run in as similar of conditions as possible.

TODO– tangent propagation

7.6 Classical Regularization as Noise Robustness

Some classical regularization techniques can be derived in terms of training on

noisy inputs. For example, consider

TODO how L2 penalty and L1 penalty can be derived in diﬀerent ways, noise

on inputs, noise on weights results for deep nets, see ”Training with noise is

equivalent to Tikhonov regularization” by Bishop et al

7.7 Bagging and Other Ensemble Methods

Bagging (short for bootstrap aggregating) is a technique for reducing generalization

error by combining several models (Breiman, 1994). The idea is to train several

diﬀerent models separately, then have all of the models vote on the output for

188

CHAPTER 7. REGULARIZATION

test examples. This is an example of a general strategy in machine learning

called model averaging. Techniques employing this strategy are known as ensemble

methods.

The reason that model averaging works is that diﬀerent models will usually

make diﬀerent errors on the test set to some extent.

Consider for example a set of k regression models. Suppose that each model

makes an error 

on each example, with the errors drawn from a zero-mean mul-

tivariate normal distribution with variances E[

] = v and covariances E[



] = c.

Then the error made by the average prediction of all the ensemble models is



. The expected squared error is



]







j6=i







]

v +

k − 1

In the case where the errors are perfectly correlated and c = v, this reduces to

v, and the model averaging does not help at all. But in the case where the errors

are perfectly uncorrelated and c = 0, then the expected error of the ensemble is

only

v. This means that the expected squared error of the ensemble decreases

linearly with the ensemble size. In other words, on average, the ensemble will per-

form at least as well as any of its members, and if the members make independent

errors, the ensemble will perform signiﬁcantly better than of its members.

Diﬀerent ensemble methods construct the ensemble of models in diﬀerent

ways. For example, each member of the ensemble could be formed by training a

completely diﬀerent kind of model using a diﬀerent algorithm or cost function.

Bagging is a method that allows the same kind of model and same kind of training

algorithm and cost function to be reused several times.

Speciﬁcally, bagging involves constructing k diﬀerent datasets. Each dataset

has the same number of examples as the original dataset, but each dataset is

constructed by sampling with replacement from the original dataset. This means

that, with high probability, each dataset is missing some of the examples from

the original dataset and also contains several duplicate examples. Model i is then

trained on dataset i. The diﬀerences between which examples are included in

each dataset result in diﬀerences between the trained models. See Fig. 7.3 for an

example.

189

CHAPTER 7. REGULARIZATION

First ensemble member

Second ensemble member

Original dataset

First resampled dataset

Second resampled dataset

Figure 7.3: A cartoon depiction of how bagging works. Suppose we train an ’8’ detector

on the dataset depicted above, containing an ’8’, a ’6’, and a ’9’. Suppose we make

two diﬀerent resampled datasets. The bagging training procedure is to construct each of

these datasets by sampling with replacement. The ﬁrst dataset omits the ’9’ and repeats

the ’8’. On this dataset, the detector learns that a loop on top of the digit corresponds

to an ’8’. On the second dataset, we repeat the ’9’ and omit the ’6’. In this case, the

detector learns that a loop on the bottom of the digit corresponds to an ’8’. Each of these

individual classiﬁcation rules is brittle, but if we average their output then the detector

is robust, achieving maximal conﬁdence only when both loops of the ’8’ are present.

190

CHAPTER 7. REGULARIZATION

Neural networks reach a wide enough variety of solution points that they can

often beneﬁt from model averaging even if all of the models are trained on the same

dataset. Diﬀerences in random initialization, random selection of minibatches,

diﬀerences in hyperparameters, or diﬀerent outcomes of non-deterministic imple-

mentations of neural networks are often enough to cause diﬀerent members of the

ensemble to make partially independent errors.

Model averaging is an extremely powerful and reliable method for reducing

generalization error. Its use is usually discouraged when benchmarking algorithms

for scientiﬁc papers, because any machine learning algorithm can beneﬁt substan-

tially from model averaging at the price of increased computation and memory.

For this reason, benchmark comparisons are usually made using a single model.

Machine learning contests are usually won by methods using model averag-

ing over dozens of models. A recent prominent example is the Netﬂix Grand

Prize (Koren, 2009).

Not all techniques for constructing ensembles are designed to make the en-

semble more regularized than the individual models. For example, a technique

called boosting constructs an ensemble with higher capacity than the individual

models.

7.8 Early Stopping as a Form of Regularization

When training large models with high capacity, we often observe that training

error decreases steadily over time, but validation set error begins to rise again.

See Fig. 7.4 for an example of this behavior. This behavior occurs very reliably.

This means we can obtain a model with better validation set error (and thus,

hopefully better test set error) by returning to the parameter setting at the point

in time with the lowest validation set error. Instead of running our optimization

algorithm until we reach a (local) minimum, we run it until the error on the

validation set has not improved for some amount of time. Every time the error

on the validation set improves, we store a copy of the model parameters. When

the training algorithm terminates, we return these parameters, rather than the

latest parameters. This procedure is speciﬁed more formally in Alg. 7.1.

This strategy is known as early stopping. It is probably the most commonly

used form of regularization in deep learning. Its popularity is due both to its

eﬀectiveness and its simplicity.

One way to think of early stopping is as a very eﬃcient hyperparameter se-

lection algorithm. In this view, the number of training steps is just another

hyperparameter. We can see in Fig. 7.4 that this hyperparameter has a U-shaped

validation set performance curve, just like most other model capacity control pa-

rameters. In this case, we are controlling the eﬀective capacity of the model by

191

CHAPTER 7. REGULARIZATION

Figure 7.4: Learning curves showing how the negative log likelihood loss changes over

time. In this example, we train a maxout network on MNIST, regularized with dropout.

Observe that the training loss decreases consistently over time, but the validation set loss

eventually begins to increase again.

192

CHAPTER 7. REGULARIZATION

Algorithm 7.1 The early stopping meta-algorithm for determining the best

amount of time to train. This meta-algorithm is a general strategy that works

well with a variety of training algorithms and ways of quantifying error on the

validation set.

Let n be the number of steps between evaluations.

Let p be the “patience,” the number of times to observe worsening validation

set error before giving up.

Let θ

be the initial parameters.

θ ← θ

i ← 0

j ← 0

v ← ∞

∗

← θ

∗

← i

while j < p do

Update θ by running the training algorithm for n steps.

i ← i + n

← ValidationSetError(θ)

if v

< v then

j ← 0

∗

← θ

∗

← i

v ← v

else

j ← j + 1

end if

end while

Best parameters are θ

∗

, best number of training steps is i

∗

determining how many steps it can take to ﬁt the training set precisely. Most of

the time, setting hyperparameters requires an expensive guess and check process,

where we must set a hyperparameter at the start of training, then run training

for several steps to see its eﬀect. The “training time” hyperparameter is unique

in that by deﬁnition a single run of training tries out many values of the hyperpa-

rameter. The only signiﬁcant cost to choosing this hyperparameter automatically

via early stopping is running the validation set evaluation periodically during

training.

An additional cost to early stopping is the need to maintain a copy of the best

parameters. This cost is generally negligible, because it is acceptable to store

these parameters in a slower and larger form of memory (for example, training

193

CHAPTER 7. REGULARIZATION

in GPU memory, but storing the optimal parameters in host memory or on a

disk drive). Since the best parameters are written to infrequently and never read

during training, these occasional slow writes are have little eﬀect on the total

training time.

Early stopping is a very inobtrusive form of regularization, in that it requires

no change the underlying training procedure, the objective function, or the set of

allowable parameter values. This means that it is easy to use early stopping with-

out damaging the learning dynamics. This is in contrast to weight decay, where

one must be careful not to use too much weight decay and trap the network in a

bad local minima corresponding to a solution with pathologically small weights.

Early stopping may be used either alone or in conjunction with other regu-

larization strategies. Even when using regularization strategies that modify the

objective function to encourage better generalization, it is rare for the best gen-

eralization to occur at a local minimum of the training objective.

Early stopping requires a validation set, which means some training data is not

fed to the model. To best exploit this extra data, one can perform extra training

after the initial training with early stopping has completed. In the second, extra

training step, all of the training data is included. There are two basic strategies

one can use for this second training procedure.

One strategy is to initialize the model again and retrain on all of the data.

In this second training pass, we train for the same number of steps as the early

stopping procedure determined was optimal in the ﬁrst pass. There are some

subtleties associated with this procedure. For example, there is not a good way

of knowing whether to retrain for the same number of parameter updates or the

same number of passes through the dataset. On the second round of training,

each pass through the dataset will require more parameter updates because the

training set is bigger. Usually, if overﬁtting is a serious concern, you will want to

retrain for the same number of epochs, rather than the same number of parameter

udpates. If the primary diﬃculty is optimization rather than generalization, then

retraining for the same number of parameter updates makes more sense (but it’s

also less likely that you need to use a regularization method like early stopping

in the ﬁrst place). This algorithm is described more formally in Alg. 7.2.

Another strategy for using all of the data is to keep the parameters obtained

from the ﬁrst round of training and then continue training but now using all of the

data. At this stage, we now no longer have a guide for when to stop in terms of a

number of steps. Instead, we can monitor the loss function on the validation set,

and continue training until it falls below the value of the training set objective at

which the early stopping procedure halted. This strategy avoids the high cost of

retraining the model from scratch, but is not as well-behaved. For example, there

is not any guarantee that the objective on the validation set will ever reach the

194

CHAPTER 7. REGULARIZATION

Algorithm 7.2 A meta-algorithm for using early stopping to determine how long

to train, then retraining on all the data.

Let X

(train)

and y

(train)

be the training set

Split X

(train)

and y

(train)

into X

(subtrain)

, y

(subtrain)

, X

(valid)

, y

(valid)

Run early stopping (Alg. 7.1) starting from random θ using X

(subtrain)

and

(subtrain)

for training data and X

(valid)

and y

(valid)

for validation data. This

returns i

∗

, the optimal number of steps.

Set θ to random values again

Train on X

(train)

and y

(train)

for i

∗

steps.

target value, so this strategy is not even guaranteed to terminate. This procedure

is presented more formally in Alg. 7.3.

Algorithm 7.3 A meta-algorithm for using early stopping to determining at

what objective value we start to overﬁt, then continuing training.

Let X

(train)

and y

(train)

be the training set

Split X

(train)

and y

(train)

into X

(subtrain)

, y

(subtrain)

, X

(valid)

, y

(valid)

Run early stopping (Alg. 7.1) starting from random θ using X

(subtrain)

and

(subtrain)

for training data and X

(valid)

and y

(valid)

for validation data. This

updates θ

 ← J(θ, X

(subtrain)

, y

(subtrain)

)

while J(θ, X

(valid)

, y

(valid)

) >  do

Train on X

(train)

and y

(train)

for n steps.

end while

Early stopping and the use of surrogate loss functions: A useful property

of early stopping is that it can help to mitigate the problems caused by a mismatch

between the surrogate loss function whose gradient we follow downhill and the

underlying performance measure that we actually care about. For example, 0-

1 classiﬁcation loss has a derivative that is zero or undeﬁned everywhere, so it

is not appropriate for gradient-based optimization. We therefore train with a

surrogate such as the log likelihood of the correct class label. However, 0-1 loss

is inexpensive to compute, so it can easily be used as an early stopping criterion.

Often the 0-1 loss continues to decrease for long after the log likelihood has begun

to worsen on the validation set. TODO: ﬁgures. in ﬁgures/regularization, I have

extracted the 0-1 loss but only used the nll for the regularization chapter’s ﬁgures.

Early stopping is also useful because it reduces the computational cost of the

training procedure. It is a form of regularization that does not require adding

195

CHAPTER 7. REGULARIZATION

∗

(τ )

∗

Figure 7.5: An illustration of the eﬀect of early stopping (Right) as a form of regularization

on the value of the optimal w, as compared to L2 regularization (Left) discussed in

Sec. 7.1.1.

additional terms to the surrogate loss function, so we get the beneﬁt of regular-

ization without the cost of any additional gradient computations. It also means

that we do not spend time approaching the exact local minimum of the surrogate

loss.

How early stopping acts as a regularizer: So far we have stated that early

stopping is a regularizer, but we have only backed up this claim by showing

learning curves where the validation set error has a U-shaped curve. What is the

actual mechanism by which early stopping regularizes the model?

Early stopping has the eﬀect of restricting the optimization procedure to a

relatively small volume of parameter space in the neighborhood of the initial

parameter value θ

. More speciﬁcally, imagine taking τ optimization steps (cor-

responding to τ training iterations) and taking η as the learning rate. We can

view the product ητ as the reciprocal of a regularization parameter. Assuming the

gradient is bounded, restricting both the number of iterations and the learning

rate limits the volume of parameter space reachable from θ

Indeed, we can show how — in the case of a simple linear model with a

quadratic error function and simple gradient descent – early stopping is equivalent

to L2 regularization as seen in Section 7.1.1.

In order to compare with classical L

regularization, we again consider the

simple setting where we will take as the parameters to be optimized as θ = w and

we take a quadratic approximation to the objective function J in the neighborhood

Material for this section was taken from Bishop (1995); Sj¨oberg and Ljung (1995), for further

details regarding the interpretation of early-stopping as a regularizer, please consult these works.

196

CHAPTER 7. REGULARIZATION

of the empirically optimal value of the weights w

∗

J(θ) = J(w

∗

) +

(w − w

∗

)

H(θ −θ

∗

) (7.13)

where, as before, H is the Hessian matrix of J with respect to w evaluated at

∗

. Given the assumption that w

∗

is a minimum of J(w), we can consider that

H is positive semi-deﬁnite and that the gradient is given by:

∇

J(w) = H(w − w

∗

). (7.14)

Let us consider initial parameter vector chosen at the origin, i.e. w

(0)

= 0.

We will consider updating the parameters via gradient descent:

(τ)

= w

(τ−1)

− η∇

J(w

(τ−1)

) (7.15)

= w

(τ−1)

− ηH(w

(τ−1)

− w

∗

) (7.16)

(τ)

− w

∗

= (I − ηH)(w

(τ−1)

− w

∗

) (7.17)

Let us now consider this expression in the space of the eigenvectors of H, i.e.

we will again consider the eigendecomposition of H: H = QΛQ

, where Λ is a

diagonal matrix and Q is an ortho-normal basis of eigenvectors.

(τ)

− w

∗

= (I − ηQΛQ

)(w

(τ−1)

− w

∗

)

(τ)

− w

∗

) = (I − ηΛ)Q

(τ−1)

− w

∗

)

Assuming w

= 0, and that |1 − ηλ

| < 1, we have after τ training updates,

(TODO: derive the expression below).

(τ)

= [I − (I −ηΛ)

∗

. (7.18)

Now, the expression for Q

w in Eqn. 7.8 for L

regularization can rearrange as:

w = (Λ + αI)

−1

ΛQ

∗

w = [I − (Λ + αI)

−1

α]Q

∗

(7.19)

Comparing Eqns 7.18 and 7.19, we see that if

(I − ηΛ)

= (Λ + αI)

−1

α,

then L

regularization and early stopping can be seen to be equivalent (at least

under the quadratic approximation of the objective function). Going even further,

by taking logs and using the series expansion for log(1 + x), we can conclude that

if all λ

are small (i.e. ηλ

 1 and λ

/α  1) then

τ ≈ 1/ηα. (7.20)

197

CHAPTER 7. REGULARIZATION

That is, under these assumptions, the number of training iterations τ plays a role

inversely proportional to the L

regularization parameter.

Parameter values corresponding to directions of signiﬁcant curvature (of the

loss) are regularized less than directions of less curvature. Of course, in the context

of early stopping, this really means that parameters that correspond to directions

of signiﬁcant curvature tend to learn early relative to parameters corresponding

to directions of less curvature.

7.9 Parameter Sharing

TODO: start with bayesian perspective (parameters should be close), add prac-

tical constraints to get parameter sharing.

7.10 Sparse Representations

TODO Most deep learning models have some concept of representations.

7.11 Dropout

Because deep models have a high degree of expressive power, they are capable of

overﬁtting signiﬁcantly. While this problem can be solved by using a very large

dataset, large datasets are not always available. Dropout (Srivastava et al., 2014)

provides a computationally inexpensive but powerful method of regularizing a

broad family of models.

Dropout can be thought of as a method of making bagging practical for neu-

ral networks. Bagging involves training multiple models, and evaluating multiple

models on each test example. This seems impractical when each model is a neural

network, since training and evaluating a neural network is costly in terms of run-

time and storing a neural network is costly in terms of memory. Dropout provides

an inexpensive approximation to training and evaluating a bagged ensemble of

exponentially many neural networks.

Speciﬁcally, dropout trains the ensemble consisting of all sub-networks that

can be formed by removing units from an underlying base network. In most mod-

ern neural networks, based on a series of aﬃne transformations and nonlinearities,

we can eﬀective remove a unit from a network by multiplying its state by zero.

This procedure requires some slight modiﬁcation for models such as radial basis

function networks, which take the diﬀerence between the unit’s state and some

reference value. Here, we will present the dropout algorithm in terms of multipli-

cation by zero for simplicity, but it can be trivially modiﬁed to work with other

operations that remove a unit from the network.

198

CHAPTER 7. REGULARIZATION

TODO–describe training algorithm, with reference to bagging TODO– include

ﬁgures from IG’s job talk TODO– training doesn’t rely on the model being prob-

abilistic TODO–describe inference algorithm, with reference to bagging TODO–

inference does rely on the model being probabilistic. and speciﬁcally, exponential

family?

For many classes of models that do not have nonlinear hidden units, the weight

scaling inference rule is exact. For a simple example, consider a softmax regression

classiﬁer with n input variables represented by the vector v:

P (Y = y | v) = softmax



v + b



We can index into the family of sub-models by element-wise multiplication of the

input with a binary vector d:

P (Y = 1 | v; d) = softmax



d v + b



The ensemble predictor is deﬁned by re-normalizing the geometric mean over all

ensemble members’ predictions:

ensemble

(Y = y | v) =

ensemble

(Y = y | v)

ensemble

(Y = y

| v)

(7.21)

where

ensemble

(Y = y | v) =

d∈{0,1}

P (Y = y | v; d).

To see that the weight scaling rule is exact, we can simplify

ensemble

(Y = y | v) =

d∈{0,1}

P (Y = y | v; d)

d∈{0,1}

softmax (w

d v + b)

d∈{0,1}

exp



y,:

d v + b



exp



d v + b



d∈{0,1}

exp



y,:

d v + b



d∈{0,1}

exp



d v + b



Because

P will be normalized, we can safely ignore multiplication by factors that

are constant with respect to y:

ensemble

(Y = y | v) ∝

d∈{0,1}

exp



y,:

d v + b



199

CHAPTER 7. REGULARIZATION

= exp





d∈{0,1}

y,:

d v + b





= exp



y,:

v + b



Substituting this back into equation 7.21 we obtain a softmax classiﬁer with

weights

W .

The weight scaling rule is also exact in other settings, including regression

networks with conditionally normal outputs, and deep networks that have hidden

layers without nonlinearities. However, the weight scaling rule is only an approxi-

mation for deep models that have non-linearities, and this approximation has not

been theoretically characterized. Fortunately, it works well, empirically. Good-

fellow et al. (2013a) found empirically that for deep networks with nonlinearities,

the weight scaling rule can work better (in terms of classiﬁcation accuracy) than

Monte Carlo approximations to the ensemble predictor, even if the Monte Carlo

approximation is allowed to sample up to 1,000 sub-networks.

Srivastava et al. (2014) showed that dropout is more eﬀective than other stan-

dard computationally inexpensive regularizers, such as weight decay, ﬁlter norm

constraints, and sparse activity regularization. Dropout may also be combined

with more expensive forms of regularization such as unsupervised pretraining to

yield an improvement. As of this writing, the state of the art classiﬁcation error

rate on the permutation invariant MNIST dataset (not using any prior knowledge

about images) is attained by a classiﬁer that uses both dropout regularization and

deep Boltzmann machine pretraining. However, combining dropout with unsu-

pervised pretraining has not become a popular strategy for larger models and

more challenging datasets.

One advantage of dropout is that it is very computationally cheap. Using

dropout during training requires only O(n) computation per example per update,

to generate n random binary numbers and multiply them by the state. Depending

on the implementation, it may also require O(n) memory to store these binary

numbers until the backpropagation stage. Running inference in the trained model

has the same cost per-example as if dropout were not used, though we must pay

the cost of dividing the weights by 2 once before beginning to run inference on

examples.

One signiﬁcant advantage of dropout is that it does not signiﬁcantly limit

the type of model or training procedure that can be used. It works well with

nearly any model that uses a distributed representation and can be trained with

stochastic gradient descent. This includes feedforward neural networks, proba-

bilistic models such as restricted Boltzmann machines (Srivastava et al., 2014),

and recurrent neural networks (Pascanu et al., 2014b). This is very diﬀerent from

200

CHAPTER 7. REGULARIZATION

many other neural network regularization strategies, such as those based on un-

supervised pretraining or semi-supervised learning. Such regularization strategies

often impose restrictions such as not being able to use rectiﬁed linear units or

max pooling. Often these restrictions incur enough harm to outweigh the beneﬁt

provided by the regularization strategy.

Though the cost per-step of applying dropout to a speciﬁc model is negligible,

the cost of using dropout in a complete system can be signiﬁcant. This is because

the size of the optimal model (in terms of validation set error) is usually much

larger, and because the number of steps required to reach convergence increases.

This is of course to be expected from a regularization method, but it does mean

that for very large datasets (as a rough rule of thumb, dropout is unlikely to be

beneﬁcial when more than 15 million training examples are available, though the

exact boundary may be highly problem dependent) it is often preferable not to

use dropout at all, just to speed training and reduce the computational cost of

the ﬁnal model.

When extremely few labeled training examples are available, dropout is less

eﬀective. Bayesian neural networks (Neal, 1996) outperform dropout on the Alter-

native Splicing Dataset (Xiong et al., 2011) where fewer than 5,000 examples are

available (Srivastava et al., 2014). When additional unlabeled data is available,

unsupervised feature learning can gain an advantage over dropout.

TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al.,

2013) TODO–perspective as L2 regularization TODO–connection to adagrad?

TODO–semi-supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013)

TODO–DWF paper (Warde-Farley et al., 2014) TODO–using geometric mean

is not a problem TODO–dropout boosting, it’s not just noise robustness TODO–

what was the conclusion about mixability?

The stochasticity used while training with dropout is not a necessary part

of the model’s success. It is just a means of approximating the sum over all

sub-models. Wang and Manning (2013) derived analytical approximations to

this marginalization. Their approximation, known as fast dropout resulted in

faster convergence time due to the reduced stochasticity in the computation of

the gradient. This method can also be applied at test time, as a more principled

(but also more computationally expensive) approximation to the average over

all sub-networks than the weight scaling approximation. Fast dropout has been

used to match the performance of standard dropout on small neural network

problems, but has not yet yielded a signiﬁcant improvement or been applied to a

large problem.

Dropout has inspired other stochastic approaches to training exponentially

large ensembles of models that share weights. DropConnect is a special case of

dropout where each product between a single scalar weight and a single hidden

201

CHAPTER 7. REGULARIZATION

unit state is considered a unit that can be dropped (Wan et al., 2013). Stochastic

pooling is a form of randomized pooling (see chapter 9.3) for building ensembles

of convolutional networks with each convolutional network attending to diﬀerent

spatial locations of each feature map. So far, dropout remains the most widely

used implicit ensemble method.

TODO–improved performance with maxout units and probably ReLUs

7.12 Multi-Task Learning

Multi-task learning (Caruana, 1993) is a way to improve generalization by pooling

the examples (i.e., constraints) arising out of several tasks.

Figure 7.6 illustrates a very common form of multi-task learning, in which dif-

ferent supervised tasks (predicting Y

given X) share the same input X, as well as

some intermediate-level representation capturing a common pool of factors. The

model can generally be divided into two kinds of parts and associated parameters:

1. Task-speciﬁc parameters (which only beneﬁt from the examples of their task

to achieve good generalization). Example: upper layers of a neural network,

in Figure 7.6.

2. Generic parameters, shared across all the tasks (which beneﬁt from the

pooled data of all the tasks). Example: lower layers of a neural network, in

Figure 7.6.

Improved generalization and generalization error bounds (Baxter, 1995) can be

achieved because of the shared parameters, for which statistical strength can be

greatly improved (in proportion with the increased number of examples for the

shared parameters, compared to the scenario of single-task models). Of course this

will happen only if some assumptions about the statistical relationship between

the diﬀerents tasks are valid, i.e., that there is something shared across some of

the tasks.

From the point of view of deep learning, the underlying prior regarding the

data is the following: among the factors that explain the variations observed in

the data associated with the diﬀerent tasks, some are shared across two or more

tasks.

7.13 Adversarial Training

TODO adversarial training–also mention rubbish class examples

202

CHAPTER 7. REGULARIZATION

Figure 7.6: Multi-task learning can be cast in several ways in deep learning frameworks

and this ﬁgure illustrates the common situation where the tasks share a common input but

involve diﬀerent target random variables. The lower layers of a deep network (whether

it is supervised and feedforward or includes a generative component with downward

arrows) can be shared across such tasks, while task-speciﬁc parameters can be learned

on top of a shared representation (associated respectively with h

and h

in the ﬁgure).

The underlying assumption is that there exist a common pool of factors that explain the

variations in the input X, while each task is associated with a subset of these factors. In

the ﬁgure, it is additionally assumed that top-level hidden units are specialized to each

task, while some intermediate-level representation is shared across all tasks. Note that

in the unsupervised learning context, it makes sense for some of the top-level factors to

be associated with none of the output tasks (h

): these are the factors that explain some

of the input variations but are not relevant for these tasks.

203

CHAPTER 7. REGULARIZATION

+ .007 × =

x sign(∇

J(θ, x, y))

x +

 sign(∇

J(θ, x, y))

“panda” “nematode” “gibbon”

57.7%

conﬁdence

8.2% conﬁdence

99.3 %

conﬁdence

Figure 7.7: TODO rewrite caption, cite the paper the ﬁgure is from, and refer to the ﬁgure

in the text A demonstration of fast adversarial example generation applied to GoogLeNet

(Szegedy et al., 2014) on ImageNet. By adding an imperceptibly small vector whose

elements are equal to the sign of the elements of the gradient of the cost function with

respect to the input, we can change GoogLeNet’s classiﬁcation of the image. Here our 

of .007 corresponds to the magnitude of the smallest bit of an 8 bit image encoding after

GoogLeNet’s conversion to real numbers.

204