Chapter 7

Regularization of Deep or

Distributed Models

A central problem in machine learning is how to make an algorithm that will

perform well not just on the training data, but also on new inputs. Many strategies

used in machine learning are explicitly designed to reduce the test error, possibly

at the expense of increased training error. These strategies are known collectively

as regularization.

Chapter 5 introduced the basic concepts of generalization, underﬁtting, over-

ﬁtting, bias, variance and regularization. If you are not already familiar with

these notions, please refer to that chapter before continuing with this one.

In this chapter, we describe regularization in more detail, focusing on general

strategies for regularizing deep or distributed models. Most deep learning algo-

rithms can be seen as diﬀerent ways of developing the speciﬁcs of these general

strategies.

Regularization is any component of the model, training process or prediction

procedure which is included to account for limitations of the training data, in-

cluding its ﬁniteness. There are many regularization strategies. Some put extra

constraints on a machine learning model, such as adding restrictions on the pa-

rameter values. Some add extra terms in the objective function that one might

consider a soft constraint on the parameter values. If chosen carefully, these extra

constraints and penalties can lead to improved performance on the test set. Some-

times these constraints and penalties are designed to encode speciﬁc kinds of prior

knowledge. Other times, these constraints and penalties are designed to express

a generic preference for a simpler model class in order to promote generalization.

Sometimes penalties and constraints are necessary to make an underdetermined

problem determined. Other forms of regularization, known as ensemble methods,

combine multiple hypotheses that explain the training data.

196

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

In the context of deep learning, most regularization strategies are based on

regularizing estimators. Regularization of an estimator works by trading increased

bias for reduced variance. An eﬀective regularizer is one that makes a proﬁtable

trade, that is it reduces variance signiﬁcantly while not overly increasing the

bias. When we discussed generalization and overﬁtting in Chapter 5, we focused

on three situations, where the model family being trained either (1) excluded the

true data generating process—corresponding to underﬁtting and inducing bias, or

(2) matched to the true data generating process—the “just right” model space,

or (3) includes the generating process but also many other possible generating

processes—the regime where variance dominates the estimation error (e.g. as

measured by the MSE—see Section. 5.5).

Note that, in practice, an overly complex model family does not necessarily

include (or even come close to) the target function or the true data generating

process. We almost never have access to the true data generating process so

we can never know if the model family being estimated includes the generating

process or not. But since, in deep learning, we are often trying to work with

data such as images, audio sequences and text, we can probably safely assume

that our model family does not include the data generating process. We can

assume that—to some extent – we are always trying to ﬁt a square peg (the data

generating process) into a round hole (our model family) and using the data to

do that as best we can.

What this means is that controlling the complexity of the model is not going

to be a simple question of ﬁnding the model of the right size, i.e. the right

number of parameters. Instead, we might ﬁnd—and indeed in practical deep

learning scenarios, we almost always do ﬁnd – that the best ﬁtting model (in the

sense of minimizing generalization error) is one that possesses a large number of

parameters that are not entirely free to span their domain.

As we will see there are a great many forms of regularization available to the

deep learning practitioner. In fact, developing more eﬀective regularizers has been

one of the major research eﬀorts in the ﬁeld.

Most machine learning tasks can be viewed in terms of learning to represent

a function

f(x) parametrized by a vector of parameters θ. The data consists of

inputs x

(t)

and (for some tasks) targets y

(t)

for t ∈ {1, . . ., m}. In the case of

classiﬁcation, each y

(t)

is an integer class label in {1, . . . , k}. For regression tasks,

each y

(t)

is a real number or a real-valued vector (and we then denote the target

in bold, as y

(t)

), while in the case of a density estimation task, there are simply

no targets (or we consider both x

(t)

and y

(t)

as the values of observed variables

(x, x) whose joint distribution is to be captured, and we may simply lump both

of them into the name x). We may group these examples into a design matrix X

and a vector of targets y (when y

(t)

is a scalar), or a matrix of targets Y (when

197

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

(t)

is a vector).

In deep learning, we are mainly interested in the case where the function

f(x) has a large number of parameters and as a result possesses a high capacity

to ﬁt relatively complicated functions. This means that deep learning algorithms

usually require either a large training set or careful regularization (intended either

to reduce the eﬀective capacity of the model or to guide the model toward a speciﬁc

solution using prior information) or both.

7.1 Regularization from a Bayesian Perspective

The Bayesian perspective on statistical inference oﬀers a useful framework in

which to consider many common methods of regularization. As we discussed in

Sec. 5.7, Bayesian estimation theory takes a fundamentally diﬀerent approach

to model estimation than the frequentist view by considering that the model

parameters themselves are uncertain and therefore should be considered random

variables.

There are a number of immediate consequences of assuming a Bayesian world

view. The ﬁrst is that if we are using probability distributions to assess uncer-

tainty in the model parameters then we should be able to express our uncertainty

about the model parameters before we see any data

. This is the role of the prior

distribution. The second consequence comes out of the marginalization equa-

tion over probability functions: when using the model to make predictions about

outcomes, one should ideally average over over the probable parameter values,

or more precisely sum over all the possible values, weighted by their posterior

distribution.

There is a deep connection between the Bayesian perspective on estimation

and the process of regularization. This is not surprising since at the root both

are concerned with making predictions relative to the true data generating dis-

tribution while taking into account the ﬁniteness of the data. What this means

is that both are open to combining information sources. that is, both are in-

terested in combining the information that can be extracted from the training

data with other, or “prior” sources of information. As we will see, many forms of

regularization can be given a Bayesian interpretation.

If we consider a dataset {x

(1)

, . . . , x

(m)

}, we recover the posterior distribution

on the model parameter θ by combining p(θ | x

(1)

, . . . , x

(m)

) the data likelihood

p(x

(1)

, . . . , x

(m)

| θ) with the prior p(θ):

log p(θ | x

(1)

, . . . , x

(m)

) = log p(θ) +

log p(x

(i)

| θ) + constant (7.1)

We should have a lot of uncertainty about the parameters before we see the data, but some

conﬁgurations may a priori be impossible, while others could seem more plausible, a priori.

198

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

where the constant is −log Z, with Z the normalization constant which does

not depend on θ but does depend on the data. When maximizing over θ, this

constant does not matter. In the context of maximum likelihood learning, the

introduction of the prior distribution plays the same role as a regularizer in that

it can be seen as a term (the ﬁrst one above) added to the objective function

that is added (to the second term, the log-likelihood) in hopes of achieving better

generalization, despite of its detrimental eﬀect on the likelihood of the training

data (the optimum of which would be achieved by considering only the second

term above).

In the following section, we will detail how the addition of a prior is equiv-

alent to certain regularization strategies. However we must be a bit careful in

establishing the relationship between the prior and a regularizer. Regularizers

are more general than priors. Priors are distributions and as such are subject to

constraints such as they must always be positive and must sum to one over their

domain. Regularizers have no such explicit constraints and can depend on the

data, not just on the parameters. Another problem in interpreting all regulariz-

ers as priors is that the equivalence implies the overly restrictive constraint that

all unregularized objective functions be interpretable as log-likelihood functions.

Nevertheless, it remains true that many of the most popular forms of regulariza-

tion can be equated to a Bayesian prior.

7.2 Classical Regularization: Parameter Norm Penalty

Regularization has been used for decades prior to the advent of deep learning.

Statistical and machine learning models traditionally represented simpler func-

tions. Because the functions themselves had less capacity, the regularization did

not need to be as sophisticated. We use the term classical regularization to refer

to the techniques used in the general machine learning and statistics literature.

Most classical regularization approaches are based on limiting the capacity

of models, such as neural networks, linear regression, or logistic regression, by

adding a parameter norm penalty Ω(θ) to the objective function J. We denote

the regularized objective function by

J(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.2)

where α is a hyperparameter that weights the relative contribution of the norm

penalty term, Ω, relative to the standard objective function J(x; θ). The hyper-

parameter α should be a non-negative real number, with α = 0 corresponding to

no regularization, and larger values of α corresponding to more regularization.

When our training algorithm minimizes the regularized objective function

J it

will decrease both the original objective J on the training data and some measure

199

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

of the size of the parameters θ (or some subset of the parameters). Diﬀerent

choices for the parameter norm Ω can result in diﬀerent solutions being preferred.

In this section, we discuss the eﬀects of the various norms when used as penalties

on the model parameters.

Before delving into the regularization behavior of diﬀerent norms, we note

that for neural networks, we typically choose to use a parameter norm penalty Ω

that only penalizes the interaction weights, i.e we leave the oﬀsets unregularized.

The oﬀsets typically require less data to ﬁt accurately than the weights. Each

weight speciﬁes how two variables interact, and requires observing both variables

in a variety of conditions to ﬁt well. Each oﬀset controls only a single variable.

This means that we do not induce too much variance by leaving the oﬀsets un-

regularized. Also, regularizing the oﬀsets can introduce a signiﬁcant amount of

underﬁtting.

7.2.1 L

Parameter Regularization

We have already seen one the simplest and most common kinds of classical reg-

ularization: the L

parameter norm penalty commonly known as weight decay.

This regularization strategy drives the parameters closer to the origin

by adding

a regularization term Ω(θ) =

kwk

to the objective function, where w is the

subset of parameters considered to be weights (i.e., parameters of a linear trans-

formation, rather than, e.g, biases parameterizing the oﬀset of an aﬃne trans-

formation), and α is a coeﬃcient determining the strength of the regularization

penalty relative to the other terms in the objective function, such as the negative

log-likelihood.

In various contexts, L

regularization is also known as ridge regression or

Tikhonov regularization.

In the context of neural networks, it is sometimes desirable to use a separate

weight decay penalty with a diﬀerent α coeﬃcient for each layer of the network.

Because α is a hyperparameter and it can be expensive to search for the correct

value of multiple hyperparameters, it is still reasonable to use the same weight

decay at all layers just to reduce the search space.

We can gain some insight into the behavior of weight decay regularization by

considering the gradient of the regularized objective function. To simplify the

presentation, we assume no oﬀset term, so θ is just w. Such a model has the

More generally, we could regularize the parameters to be near any speciﬁc point in space

and, surprisingly, still get a regularization eﬀect, but better results will be obtained for a value

closer to the true one, with zero being a default value that makes sense when we do not know

if the correct value should be positive or negative. Since it is far more common to consider

regularizing the model parameters to zero, we will focus on this special case in our exposition.

200

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

following gradient of the total objective function:

∇

J(w; X, y) = αw + ∇

J(w; X, y). (7.3)

To take a single gradient step to update the weights, we perform this update:

w := w −  (αw + ∇wJ(w; X, y)) .

Written another way, the update is:

w := (1 − α)w − ∇wJ(w; X, y).

We can see that the addition of the weight decay term has modiﬁed the learn-

ing rule to multiplicatively shrink the weight vector by a constant factor on each

step, towards zero, just before performing the usual gradient update. This de-

scribes what happens in a single step. But what happens over the entire course

of training?

We will further simplify the analysis by considering a quadratic approximation

to the objective function in the neighborhood of the empirically optimal value of

the weights w

∗

. (If the objective function is truly quadratic, as in the case of

ﬁtting a linear regression model with mean squared error, then the approximation

is perfect).

J(θ) = J(w

∗

) +

(w − w

∗

)

H(w − w

∗

) (7.4)

where H is the Hessian matrix of J with respect to w evaluated at w

∗

. There is

no ﬁrst order term in this quadratic approximation, because w

∗

is deﬁned to be a

minimum, where the gradient vanishes. Likewise, because w

∗

is a minimum, we

can conclude that H is positive semi-deﬁnite.

∇

J(w) = H(w − w

∗

). (7.5)

If we replace the exact gradient in equation 7.3 with the approximate gradient

in equation 7.5, we can write an equation for the location of the minimum of the

regularized objective function:

αw + H(w − w

∗

) = 0 (7.6)

(H + αI)w = Hw

∗

(7.7)

w = (H + αI)

−1

∗

. (7.8)

From this, we see that the presence of the regularization term moves the

optimum from w

∗

w. As α approaches 0,

w approaches w

∗

. But what happens

as α grows? Because H is real and symmetric, we can decompose it into a diagonal

201

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

∗

Figure 7.1: An illustration of the eﬀect of L2 (or weight decay) regularization on the value

of the optimal w. The solid ellipses represent contours of equal value of the unregularized

objective. The dotted circles represent contours of equal value of the L

regularizer. At

the point

w, these competing objectives reach an equilibrium. Note how in the ﬁrst

dimension, where the eigenvalue of the Hessian of J is small (i.e., the objective function

does not increase much when moving horizontally away from w

∗

), the regularizer has a

strong eﬀect, pulling the parameter values close to zero, whereas in the second dimension,

the objective function is very sensitive to movements away from w

∗

, the eigenvalue is large

(high curvature), and weight decay does not aﬀect the solution much.

matrix Λ and an orthonormal basis of eigenvectors, Q, such that H = QΛQ

Applying the decomposition to equation 7.8, we obtain:

w = (QΛQ

+ αI)

−1

QΛQ

∗

Q(Λ + αI)Q

−1

QΛQ

∗

= Q(Λ + αI)

−1

ΛQ

∗

w = (Λ + αI)

−1

ΛQ

∗

. (7.9)

If we interpret the Q

w as rotating our solution parameters

w into the basis

deﬁned by the eigenvectors Q of H, then we see that the eﬀect of weight decay

is to rescale the coeﬃcients of eigenvectors. Speciﬁcally the ith component is

rescaled by a factor of

+α

. (You may wish to review how this kind of scaling

works, ﬁrst explained in Fig. 2.3).

Along the directions where the eigenvalues of H are relatively large, for ex-

ample, where λ

 α, the eﬀect of regularization is relatively small. However,

components with λ

 α will be shrunk to have nearly zero magnitude. This

eﬀect is illustrated in Fig. 7.1.

Only directions along which the parameters contribute signiﬁcantly to reduc-

ing the objective function are preserved relatively intact. In directions that do not

contribute to reducing the objective function, a small eigenvalue of the Hessian

tell us that movement in this direction will not signiﬁcantly increase the gradient.

202

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Components of the weight vector corresponding to such unimportant directions

are decayed away through the use of the regularization throughout training. This

eﬀect of suppressing contributions to the parameter vector along these principle

directions of the Hessian H is captured in the concept of the eﬀective number of

parameters, deﬁned to be

γ =

+ α

. (7.10)

As α is increased, the eﬀective number of parameters decreases.

Another way to gain some intuition for the eﬀect of L

regularization is to

consider its eﬀect on linear regression. The unregularized objective function for

linear regression is the sum of squared errors:

(Xw − y)

(Xw − y).

When we add L

regularization, the objective function changes to

(Xw − y)

(Xw − y) +

αw

This changes the normal equations for the solution from

w = (X

−1

w = (X

X + αI)

−1

We can see L

regularization causes the learning algorithm to “perceive” the

input X as having higher variance, which makes it shrink the weights on features

whose covariance with the output target is low compared to this added variance.

TODO–make sure the chapter includes maybe a table showing relationships

between early stopping, priors, constraints, penalties, and adding noise? e.g. look

up L1 penalty and it tells you what prior it corresponds to scratch-work thinking

about how to do it:

L2 penalty L2 constraint add noise early stopping Gaussian prior L1 penalty

L1 constraint Laplace prior Max-norm penalty

7.2.2 L

Regularization

While L

weight decay is the most common form of weight decay, there are other

ways to penalize the size of the model parameters. Another option is to use L

regularization.

Formally, L

regularization on the model parameter w is deﬁned as:

Ω(θ) = ||w||

|, (7.11)

203

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

that is, as the sum of absolute values of the individual parameters.

We will now

consider the eﬀect of L

regularization on the simple linear model, with no oﬀset

term, that we considered in our analysis of L

regularization. In particular, we are

interested in delineating the diﬀerences between L

and L

forms of regularization.

Thus, if we consider the gradient (actually the sub-gradient) on the regularized

objective function

J(w; X, y), we have:

∇

J(w; X, y) = βsign(w) + ∇

J(X, y; w) (7.12)

where sign(w) is simply the sign of w applied element-wise.

By inspecting Eqn. 7.12, we can see immediately that the eﬀect of L

regu-

larization is quite diﬀerent from that of L

regularization. Speciﬁcally, we can see

that the regularization contribution to the gradient no longer scales linearly with

w, instead it is a constant factor with a sign equal to sign(w). One consequence

of this form of the gradient is that we will not necessarily see clean solutions

to quadratic forms of ∇

J(X, y; w) as we did for L

regularization. Instead,

the solutions are going to be much more aligned to the basis space in which the

problem is embedded.

To see this, and for the sake of comparison with L

regularization, we will

again consider a simpliﬁed setting of a quadratic approximation to the objective

function in the neighborhood of the empirical optimum w

∗

. (Once again, if the

per-example loss is truly quadratic, as in the case of ﬁtting a linear regression

model with mean squared error, then the approximation is perfect). The gradient

of this approximation is given by

∇

J(w) = H(w − w

∗

). (7.13)

where, again, H is the Hessian matrix of J with respect to w evaluated at w

∗

We will also make the further simplifying assumption that the Hessian is diag-

onal, H = diag([γ

, . . . , γ

]), where each γ

> 0. With this rather restrictive

assumption, the solution of the minimum of the L

regularized objective function

decomposes into a system of equations of the form:

J(w; X, y) =

− w

∗

)

+ β|w

It admits an optimal solution (for each dimension i), with the following form:

= sign(w

∗

) max(|w

∗

| −

, 0).

As with L

regularization, we could consider regularizing the parameters to a value that is

not zero, but instead to some parameter value w

(o)

. In that case the L

regularization would

introduce the term Ω(θ) = ||w − w

(o)

= β

− w

(o)

204

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

∗

Figure 7.2: An illustration of the eﬀect of L

regularization (right) on the value of the

optimal W , in comparison to the eﬀect of L

regularization (left).

Let’s consider the situation where w

∗

> 0 for all i, there are two possible out-

comes. Case 1: w

∗

≤

, here the optimal value of w

under the regularized

objective is simply w

= 0, this occurs because the contribution of J(w; X, y) to

the regularized objective

J(w; X, y) is overwhelmed—in direction i, by the L

regularization which pushes the value of w

to zero. Case 2: w

∗

, here the

regularization does not move the optimal value of w to zero but instead it just

shifts it in that direction by a distance equal to

. This is illustrated in Fig. 7.2.

A similar argument can be made when w

∗

< 0, but with the L

penalty making

less negative by

, or 0.

In comparison to L

regularization, L

regularization results in a solution that

is more sparse. Sparsity in this context implies that there are free parameters of

the model that—through L

regularization—with an optimal value (under the

regularized objective) of zero. TODO: something is wrong with the sentence

above As we discussed, for each element i of the parameter vector, this happens

when |w

∗

| ≤

. Comparing this to the situation for L

regularization, where

(under the same assumptions of a diagonal Hessian H) we get w

+α

∗

which is nonzero as long as w

∗

is nonzero.

In Fig. 7.2, we see that even when the optimal value of w is nonzero, L

regularization acts to punish small values of parameters just as harshly as larger

values, leading to optimal solutions with more parameters having value zero and

more larger valued parameters.

The sparsity property induced by L

regularization has been used extensively

as a feature selection mechanism. In particular, the well known LASSO (Tibshi-

rani, 1995) (least absolute shrinkage and selection operator) model integrates an

penalty with a linear model and a least squares cost function.

205

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

7.2.3 Bayesian Interpretation of the Parameter Norm Penalty

Parameter norm penalties are often amenable to being interpreted as a Bayesian

prior. Recall that parameter norm penalties are eﬀected by adding a term Ω(w)

to the unregularized objective function J.

J(w; X, y) = J(w; X, y) + αΩ(w) (7.14)

where α is a hyperparameter that weights the relative contribution of the norm

penalty term.

We can view the minimization of the regularized objective function above as

equivalent to ﬁnding the maximum a posteriori (MAP) estimate of the param-

eters: p(w | X, y) ∝ p(y | X, w)p(w), where the unregularized J(w; X, y) is

taken as the log-likelihood and the regularization term αΩ(w) plays the role of

the parameter prior distribution. Diﬀerence choices of regularizers correspond to

diﬀerent priors.

In the case of L

regularization, minimizing with αΩ(w) =

kwk

is function-

ally equivalent to maximizing the log of the posterior distribution (or minimizing

the negative log posterior) where the prior is given by a Gaussian distribution.

log p(w; µ, Σ) = −

(w − µ)

−1

(w − µ) −

log |Σ| −

log(2π)

where d is the dimension of w. Ignoring terms that are not a function of w (and

therefore do not aﬀect the MAP value), we can see that the by choosing µ = 0

and Σ

−1

= αI, we recover the functional form of L

regularization: p(w; µ, Σ) ∝

−

kwk

. Thus L

regularization can be interpreted as assuming independent

Gaussian prior distributions over all the model parameters, each with precision

(i.e. the inverse of variance) α.

For L

regularization, minimizing with αΩ(w) = α

k is equivalent to

maximizing the log of the posterior distribution with an isotropic Laplace distri-

bution over w.

log p(w; µ, η) =

log Laplace(w

; µ

, η

) =

−

− µ

− log (2η

)

Once again we can ignore the second term here because it does not depend on the

elements of w, so L

regularization is equivalent to optimizing a MAP objective

with a log prior given by

log Laplace(w

; 0, λ

−1

206

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

7.3 Classical Regularization as Constrained Optimiza-

tion

Classical regularization adds a penalty term to the training objective:

J(θ; X, y) = J(θ; X, y) + αΩ(θ).

Recall from Sec. 4.4 that we can minimize a function subject to constraints by

constructing a generalized Lagrange function (see 4.4), consisting of the original

objective function plus a set of penalties. Each penalty is a product between

a coeﬃcient, called a Karush–Kuhn–Tucker (KKT) multiplier

, and a function

representing whether the constraint is satisﬁed. If we wanted to constrain Ω(θ) to

be less than some constant k, we could construct a generalized Lagrange function

L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k).

The solution to the constrained problem is given by

∗

= min

max

α,α≥0

L(θ, α).

Solving this problem requires modifying both θ and α. Speciﬁcally, α must

increase whenever ||θ||

> k and decrease whenever ||θ||

< k. However, after we

have solved the problem, we can ﬁx α

∗

and view the problem as just a function

of θ:

∗

= min

L(θ, α

∗

) = min

J(θ; X, y) + α

∗

Ω(θ).

This is exactly the same as the regularized training problem of minimizing

Note that the value of α

∗

does not directly tell us the value of k. In principle, one

can solve for k, but the relationship between k and α

∗

depends on the form of

J. We can thus think of classical regularization as imposing a constraint on the

weights, but with an unknown size of the constraint region. Larger α will result in

a smaller constraint region, and smaller α will result in a larger constraint region.

Sometimes we may wish to use explicit constraints rather than penalties. As

described in Sec. 4.4, we can modify algorithms such as stochastic gradient descent

to take a step downhill on J(θ) and then project θ back to the nearest point that

satisﬁes Ω(θ) < k. This can be useful if we have an idea of what value of k

is appropriate and do not want to spend time searching for the value of α that

corresponds to this k.

Another reason to use explicit constraints and reprojection rather than enforc-

ing constraints with penalties is that penalties can cause non-convex optimization

KKT multipliers generalize Lagrange multipliers to allow for inequality constraints.

207

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

procedures to get stuck in local minima corresponding to small θ. When training

neural networks, this usually manifests as neural networks that train with several

“dead units”. These are units that do not contribute much to the behavior of the

function learned by the network because the weights going into or out of them are

all very small. When training with a penalty on the norm of the weights, these

conﬁgurations can be locally optimal, even if it is possible to signiﬁcantly reduce

J by making the weights larger. (This concern about local minima obviously does

not apply when

J is convex)

Finally, explicit constraints with reprojection can be useful because they im-

pose some stability on the optimization procedure. When using high learning

rates, it is possible to encounter a positive feedback loop in which large weights

induce large gradients which then induce a large update to the weights. If these

updates consistently increase the size of the weights, then θ rapidly moves away

from the origin until numerical overﬂow occurs. Explicit constraints with repro-

jection allow us to terminate this feedback loop after the weights have reached a

certain magnitude. Hinton et al. (2012c) recommend using constraints combined

with a high learning rate to allow rapid exploration of parameter space while

maintaining some stability.

TODO how L2 penalty is equivalent to L2 constraint (with unknown value),

L1 penalty is equivalent to L1 constraint maybe move the earlier L2 regularization

ﬁgure to here, now that the sub-level sets will make more sense? show the shapes

induced by the diﬀerent norms separate L2 penalty on each hidden unit vector

is diﬀerent from L2 penalty on all theta; is equivalent to a penalty on the max

across columns of the column norms

7.4 Regularization and Under-Constrained Problems

In some cases, regularization is necessary for machine learning problems to be

properly deﬁned. Many linear models in machine learning, including linear re-

gression and PCA, depend on inverting the matrix X

X. This is not possible

whenever X

X is singular. This matrix can be singular whenever the data truly

has no variance in some direction, or when there are fewer examples (rows of X)

than input features (columns of X). In this case, many forms of regularization

correspond to inverting X

X +αI instead. This regularized matrix is guaranteed

to be invertible.

These linear problems have closed form solutions when the relevant matrix is

invertible. It is also possible for a problem with no closed form solution to be

underdetermined. For example, consider logistic regression applied to a problem

where the classes are linearly separable. If a weight vector w is able to achieve

perfect classiﬁcation, then 2w will also achieve perfect classiﬁcation and higher

208

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

likelihood. An iterative optimization procedure like stochastic gradient descent

will continually increase the magnitude of w and, in theory, will never halt. In

practice, a numerical implementation of gradient descent will eventually reach

suﬃciently large weights to cause numerical overﬂow, at which point its behavior

will depend on how the programmer has decided to handle values that are not

real numbers.

Most forms of regularization are able to guarantee the convergence of iterative

methods applied to underdetermined problems. For example, weight decay will

cause gradient descent to quit increasing the magnitude of the weights when the

slope of the likelihood is equal to the weight decay coeﬃcient. Likewise, early

stopping based on the validation set classiﬁcation rate will cause the training

algorithm to terminate soon after the validation set classiﬁcation accuracy has

stopped increasing. Even if the problem is linearly separable and there is no

overﬁtting, the validation set classiﬁcation accuracy will eventually saturate to

100%, resulting in termination of the early stopping procedure.

The idea of using regularization to solve underdetermined problems extends

beyond machine learning. The same idea is useful for several basic linear algebra

problems.

As we saw in Chapter 2.9, we can solve underdetermined linear equations

using the Moore-Penrose pseudoinverse.

One deﬁnition of the pseudoinverse X

of a matrix X is to perform linear

regression with an inﬁnitesimal amount of L

regularization:

= lim

α&0

X + αI)

−1

When a true inverse for X exists, then w = X

y returns the weights that

exactly solve the regression problem. When X is not invertible because no ex-

act solution exists, this returns the w corresponding to the least possible mean

squared error. When X is not invertible because many solutions exactly solve

the regression problem, this returns w with the minimum possible L

norm.

Recall that the Moore-Penrose pseudoinverse can be computed easily using

the singular value decomposition. Because the SVD is robust to underdetermined

problems resulting from too few observations or too little underlying variance,

it is useful for implementing stable variants of many closed-form linear machine

learning algorithms. The stability of these algorithms can be viewed as a result of

applying the minimum amount of regularization necessary to make the problem

become determined.

209

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

7.5 Dataset Augmentation

The best way to make a machine learning model generalize better is to train it

on more data. Of course, in practice, the amount of data we have is limited. One

way to get around this problem is to create fake data and add it to the training

set. For some machine learning tasks, it is reasonably straightforward to create

new fake data.

This approach is easiest for classiﬁcation. A classiﬁer needs to take a compli-

cated, high dimensional input x and summarize it with a single category identity

y. This means that the main task facing a classiﬁer is to be invariant to a wide

variety of transformations. We can generate new (x, y) pairs easily just by trans-

forming the x inputs in our training set.

This approach is not as readily applicable to many other tasks. For example,

it is diﬃcult to generate new fake data for a density estimation task unless we

have already solved the density estimation problem.

Dataset augmentation has been a particularly eﬀective technique for a spe-

ciﬁc classiﬁcation problem: object recognition. Images are high dimensional and

include an enormous variety of factors of variation, many of which can be easily

simulated. Operations like translating the training images a few pixels in each

direction can often greatly improve generalization, even if the model has already

been designed to be partially translation invariant by using convolution and pool-

ing. Many other operations such as rotating the image or scaling the image have

also proven quite eﬀective. One must be careful not to apply transformations

that would change the correct class. For example, optical character recognition

tasks require recognizing the diﬀerence between ’b’ and ’d’ and the diﬀerence be-

tween ’6’ and ’9’, so horizontal ﬂips and 180

◦

rotations are not appropriate ways

of augmenting datasets for these tasks. There are also transformations that we

would like our classiﬁers to be invariant to, but which are not easy to perform.

For example, out-of-plane rotation can not be implemented as a simple geometric

operation on the input pixels.

A way to “analytically” obtain an eﬀect similar to dataset augmentation is to

add a regularizer that prefers a class probability predictor that is invariant to tiny

movements in the directions associated with known desired invariances. These di-

rections correspond to moving the training example along the subspace spanned

by the tangent vectors associated with the manifold containing the training exam-

ple and those obtained by applying these nuisance variations (such as translation

or rotation of the image). See Section 17.5 for more detail on these ideas and

the original Tangent-Prop algorithm (Simard et al., 1992), which introduced this

idea.

For many classiﬁcation and even some regression tasks, the task should still

be possible to solve even if small random noise is added to the input. Neural

210

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

networks prove not to be very robust to noise, however (Tang and Eliasmith,

2010). One way to improve the robustness of neural networks is simply to train

them with random noise applied to their inputs, and this input noise injection

is part of some unsupervised learning algorithms such as the denoising auto-

encoder (Vincent et al., 2008). This same approach also works when the noise

is applied to the hidden units, which can be seen as doing dataset augmentation

at multiple levels of abstraction. Poole et al. (2014) recently showed that this

approach can be highly eﬀective provided that the magnitude of the noise is

carefully tuned. Dropout, a powerful regularization strategy that will be described

in Sec. 7.11, can be seen as a process of constructing new inputs by multiplying by

noise. In a multilayer network, it can often be beneﬁcial to apply transformations

such as noise to the hidden units, as well as the inputs. This can be viewed as

augmenting the dataset as seen by the deeper layers.

When reading machine learning research papers, it is important to take the

eﬀect of dataset augmentation into account. Often, hand-designed dataset aug-

mentation schemes can dramatically reduce the generalization error of a machine

learning technique. It is important to look for controlled experiments. When

comparing machine learning algorithm A and machine learning algorithm B, it

is necessary to make sure that both algorithms were evaluated using the same

hand-designed dataset augmentation schemes. If algorithm A performs poorly

with no dataset augmentation and algorithm B performs well when combined

with numerous synthetic transformations of the input, then it is likely the syn-

thetic transformations and not algorithm B itself that cause the improved per-

formance. Sometimes the line is blurry, such as when a new machine learning

algorithm involves injecting noise into the inputs. In these cases, it is best to

consider how generally applicable the new algorithm is, and to make sure that

pre-existing algorithms are re-run in as similar of conditions as possible.

7.6 Classical Regularization as Noise Robustness

In the machine learning literature, there have been two ways that noise has been

used as part of a regularization strategy. The ﬁrst and most popular way is by

adding noise to the input. While this can be interpreted simply as form of dataset

augmentation (as described above in Sec. 7.5), we can also interpret it as being

equivalent to more traditional forms of regularization.

The second way that noise has been used in the service of regularizing models

is by adding it to the weights. This technique has been used primarily in the

context of recurrent neural networks (Jim et al., 1996; Graves, 2011a). This can

be interpreted as a stochastic implementation of a Bayesian inference over the

weights. The Bayesian treatment of learning would consider the model weights

211

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

to be uncertain and representable via a probability distribution that reﬂects this

uncertainty. Adding noise to the weights is a practical, stochastic way to reﬂect

this uncertainty (Graves, 2011a).

In this section, we review these two strategies and provide some insight into

how noise can act to regularize the model.

7.6.1 Injecting Noise at the Input

Some classical regularization techniques can be derived in terms of training on

noisy inputs

. Let us consider a regression setting, where we are interested in

learning a model ˆy(x) that maps a set of features x to a scalar. The cost function

we will use is the least-squares error between the model prediction ˆy(x) and the

true value y:

J = E

p(x,y)



(ˆy(x) − y)



, (7.15)

where we are given a dataset of m input / output pairs {(x

(1)

, y

(1)

), . . . , (x

(m)

, y

(m)

)}

and the training objective is to minimize the objective function, which is the em-

pirical average of the squared error on the training data.

Now consider that with each input presentation to the model we also include

a random perturbation  ∼ (0, νI), so that the error function becomes

= E

p(x,y,)



(ˆy(x + ) − y)



= E

p(x,y,)



ˆy

(x + ) − 2yˆy(x + ) + y



= E

p(x,y,)



ˆy

(x + )



− 2E

p(x,y,)

[yˆy(x + )] + E

p(x,y,)





(7.16)

Assuming small noise, we can consider the Taylor series expansion of ˆy(x + )

around ˆy(x).

ˆy(x + ) = ˆy(x) + 

∇

ˆy(x) +



∇

ˆy(x) + O(

) (7.17)

Substituting this approximation for ˆy(x+) into the objective function (Eq. 7.16)

The analysis in this section is mainly based on that in Bishop (1995a,b)

212

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

and using the fact that E

p()

[] = 0 and that E

p()

[

] = νI to simplify

, we get:

≈ E

p(x,y,)



ˆy(x) + 

∇

ˆy(x) +



∇

ˆy(x)



− 2E

p(x,y,)



yˆy(x) + y

∇

yˆy(x) +

y

∇

ˆy(x)



+ E

p(x,y,)





= E

p(x,y,)



(ˆy(x) − y)



+ E

p(x,y,)



ˆy(x)

∇

ˆy(x) +





∇

ˆy(x)



+ O(

)



− 2E

p(x,y,)



y

∇

ˆy(x)



= J + νE

p(x,y)



(ˆy(x) − y)∇

ˆy(x)



+ νE

p(x,y)



k∇

ˆy(x)k



(7.18)

If we consider minimizing this objective function, by taking the functional gradient

of ˆy(x) and setting the result to zero, we can see that

ˆy(x) = E

p(y|x)

[y] + O(ν).

This implies that the expectation in the second last term in Eq. 7.18,

p(x,y)



(ˆy(x) − y)∇

ˆy(x)



reduces to O(ν) because the expectation of the diﬀerence (ˆy(x) − y) reduces to

O(ν).

This leaves us with the objective function of the form

= E

p(x,y)



(ˆy(x) − y)



+ νE

p(x,y)



k∇

ˆy(x)k



+ O(ν

For small ν, the minimization of J with added noise on the input (with covariance

νI) is equivalent to minimization of J with an additional regularization term given

by νE

p(x,y)



k∇

ˆy(x)k



Considering the behavior of this regularization term, we note that it has the

eﬀect of penalizing large gradients of the function ˆy(x). That is, it has the eﬀect

of reducing the sensitivity of the output of the network with respect to small

variations in its input x. We can interpret this as attempting to build in some

local robustness into the model and thereby promote generalization. We note also

that for linear networks, this regularization term reduces to simple weight decay

(as discussed in Sec. 7.2.1).

In this derivation we have used two properties of the trace operator: (1) that a scalar is

equal to its trace; (2) that, for a square matrix AB, Tr(AB) = Tr(BA). These are discussed in

Sec. 2.10.

213

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

7.6.2 Injecting Noise at the Weights

Rather than injecting noise as part of the input, one could also consider adding

noise directly to the model parameters. As we shall see, this can also be in-

terpreted as equivalent (under some assumptions) to a more traditional form of

regularization. Adding noise to the weights has been shown to be an eﬀective reg-

ularization strategy in the context of recurrent neural networks

(Jim et al., 1996;

Graves, 2011b). In the following, we will present an analysis of the eﬀect of weight

noise on a standard feedforward neural network (as introduced in Chapter 6).

As we did in the last section, we again consider the regression setting, where

we wish to train a function ˆy(x) that maps a set of features x to a scalar using

the least-squares cost function between the model predictions ˆy(x) and the true

values y:

J = E

p(x,y)



(ˆy(x) − y)



. (7.19)

We again assume we are given a dataset of m input / output pairs {(x

(1)

, y

(1)

. . . , (x

(m)

, y

(m)

)}.

We now assume that with each input presentation we also include a random

perturbation 

∼ (0, ηI) of the network weights. Let us imagine that we have

a standard L-layer MLP, we denote the perturbed model as ˆy



(x). Despite the

injection of noise, we are still interested in minimizing the squared error of the

output of the network. The objective function thus becomes:

= E

p(x,y,

)



(ˆy

) − y)



= E

p(x,y,

)



ˆy



(x) − 2yˆy



(x) + y



(7.20)

Assuming small noise, we can consider the Taylor series expansion of ˆy



(x)

around the unperturbed function ˆy(x).

ˆy



(x) = ˆy(x) + 

∇

ˆy(x) +



∇

ˆy(x)

+ O(

) (7.21)

From here, we follow the same basic strategy that was laid-out in the previous

section in analyzing the eﬀect of adding noise to the input. That is, we substitute

Recurrent neural networks will be discussed in detail in Chapter 10

214

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

the Taylor series expansion of ˆy



(x) into the objective function in Eq. 7.20.

≈ E

p(x,y,

)



ˆy(x) + 

∇

ˆy(x) +



∇

ˆy(x)



p(x,y,

)



−2y



ˆy(x) + 

∇

ˆy(x) +



∇

ˆy(x)



+ E

p(x,y,

)





= E

p(x,y,

)

(ˆy(x) − y)

− 2E

p(x,y,

)



y

∇

ˆy(x)



+ E

p(x,y,

)



ˆy(x)

∇

ˆy(x)





∇

ˆy(x)



+ O(

)



. (7.22)

(7.23)

Where we have used the fact that E



)



= 0 to drop terms that are linear in



. Incorporating the assumption that E



)



= ηI, we have:

≈ J + νE

p(x,y)



(ˆy(x) − y)∇

ˆy(x)



+ νE

p(x,y)



k∇

ˆy(x)k



(7.24)

Again, if we consider minimizing this objective function, we can see that the

optimal value of ˆy(x) is:

ˆy(x) = E

p(y|x)

[y] + O(η),

implying that the expectation in the middle term in Eq. 7.24, E

p(x,y)



(ˆy(x) − y)∇

ˆy(x)



reduces to O(η) because the expectation of the diﬀerence (ˆy(x) −y) is reduces to

O(η).

This leaves us with the objective function of the form

= E

p(x,y)



(ˆy(x) − y)



+ ηE

p(x,y)



k∇

ˆy(x)k



+ O(η

For small η, the minimization of J with added weight noise (with covariance

ηI) is equivalent to minimization of J with an additional regularization term:

ηE

p(x,y)



k∇

ˆy(x)k



. This form of regularization encourages the parameters

to go to regions of parameter space where weights have a relatively small inﬂu-

ence on the output. In other words, it pushes the model into regions where the

model is relatively insensitive to small variations in the weights. Regularization

strategies with this kind of behavior have been considered before (Hochreiter

and Schmidhuber, 1995). In the simpliﬁed case of linear regression (where, for

instance, ˆy(x) = w

x + b), this regularization term collapses into ηE

p(x)



kxk



which is not a function of parameters and therefore does not contribute to the

gradient of

w.r.t the model parameters.

215

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over

time (indicated as number of training iterations over the dataset, or epochs). In this

example, we train a maxout network on MNIST, regularized with dropout. Observe that

the training objective decreases consistently over time, but the validation set average loss

eventually begins to increase again.

216

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

7.7 Early Stopping as a Form of Regularization

When training large models with large enough capacity, we often observe that

training error decreases steadily over time, but validation set error begins to rise

again. See Fig. 7.3 for an example of this behavior. This behavior occurs very

reliably.

This means we can obtain a model with better validation set error (and thus,

hopefully better test set error) by returning to the parameter setting at the point

in time with the lowest validation set error. Instead of running our optimization

algorithm until we reach a (local) minimum of validation error, we run it until the

error on the validation set has not improved for some amount of time. Every time

the error on the validation set improves, we store a copy of the model parameters.

When the training algorithm terminates, we return these parameters, rather than

the latest parameters. This procedure is speciﬁed more formally in Alg. 7.1.

This strategy is known as early stopping. It is probably the most commonly

used form of regularization in deep learning. Its popularity is due both to its

eﬀectiveness and its simplicity.

One way to think of early stopping is as a very eﬃcient hyperparameter se-

lection algorithm. In this view, the number of training steps is just another

hyperparameter. We can see in Fig. 7.3 that this hyperparameter has a U-shaped

validation set performance curve, just like most other model capacity control pa-

rameters, albeit with the addition of irregular oscillations. In this case, we are

controlling the eﬀective capacity of the model by determining how many steps

it can take to ﬁt the training set precisely. Most of the time, setting hyperpa-

rameters requires an expensive guess and check process, where we must set a

hyperparameter at the start of training, then run training for several steps to see

its eﬀect. The “training time” hyperparameter is unique in that by deﬁnition

a single run of training tries out many values of the hyperparameter. The only

signiﬁcant cost to choosing this hyperparameter automatically via early stopping

is running the validation set evaluation periodically during training. One typ-

ically chooses a validation set size and a period for computing validation error

such that this monitoring only adds a fraction of the training time to the total

computational cost, for example by making the number of training examples seen

between validation error measurements a multiple of the validation set size. Even

better, with access to another processor, one uses it to perform the monitoring

work, allowing one to consider larger validation sets without slowing down the

training process.

An additional cost to early stopping is the need to maintain a copy of the best

parameters. This cost is generally negligible, because it is acceptable to store

these parameters in a slower and larger form of memory (for example, training

in GPU memory, but storing the optimal parameters in host memory or on a

217

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Algorithm 7.1 The early stopping meta-algorithm for determining the best

amount of time to train. This meta-algorithm is a general strategy that works

well with a variety of training algorithms and ways of quantifying error on the

validation set.

Let n be the number of steps between evaluations.

Let p be the “patience,” the number of times to observe worsening validation

set error before giving up.

Let θ

be the initial parameters.

θ ← θ

i ← 0

j ← 0

v ← ∞

∗

← θ

∗

← i

while j < p do

Update θ by running the training algorithm for n steps.

i ← i + n

← ValidationSetError(θ)

if v

< v then

j ← 0

∗

← θ

∗

← i

v ← v

else

j ← j + 1

end if

end while

Best parameters are θ

∗

, best number of training steps is i

∗

disk drive). Since the best parameters are written to infrequently and never read

during training, these occasional slow writes are have little eﬀect on the total

training time.

Early stopping is a very unobtrusive form of regularization, in that it requires

almost no change in the underlying training procedure, the objective function,

or the set of allowable parameter values. This means that it is easy to use early

stopping without damaging the learning dynamics. This is in contrast to weight

decay, where one must be careful not to use too much weight decay and trap the

network in a bad local minimum corresponding to a solution with pathologically

small weights.

Early stopping may be used either alone or in conjunction with other regu-

218

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

larization strategies. Even when using regularization strategies that modify the

objective function to encourage better generalization, it is rare for the best gen-

eralization to occur at a local minimum of the training objective.

Early stopping requires a validation set, which means some training data is not

fed to the model. To best exploit this extra data, one can perform extra training

after the initial training with early stopping has completed. In the second, extra

training step, all of the training data is included. There are two basic strategies

one can use for this second training procedure.

One strategy is to initialize the model again and retrain on all of the data.

In this second training pass, we train for the same number of steps as the early

stopping procedure determined was optimal in the ﬁrst pass. There are some

subtleties associated with this procedure. For example, there is not a good way

of knowing whether to retrain for the same number of parameter updates or the

same number of passes through the dataset. On the second round of training,

each pass through the dataset will require more parameter updates because the

training set is bigger. Usually, if overﬁtting is a serious concern, you will want to

retrain for the same number of epochs, rather than the same number of parameter

updates. If the primary diﬃculty is optimization rather than generalization, then

retraining for the same number of parameter updates makes more sense (but it is

also less likely that you need to use a regularization method like early stopping

in the ﬁrst place). This algorithm is described more formally in Alg. 7.2.

Algorithm 7.2 A meta-algorithm for using early stopping to determine how long

to train, then retraining on all the data.

Let X

(train)

and y

(train)

be the training set.

Split X

(train)

and y

(train)

into (X

(subtrain)

, X

(valid)

) and (y

(subtrain)

, y

(valid)

)

respectively.

Run early stopping (Alg. 7.1) starting from random θ using X

(subtrain)

and

(subtrain)

for training data and X

(valid)

and y

(valid)

for validation data. This

returns i

∗

, the optimal number of steps.

Set θ to random values again.

Train on X

(train)

and y

(train)

for i

∗

steps.

Another strategy for using all of the data is to keep the parameters obtained

from the ﬁrst round of training and then continue training but now using all of

the data. At this stage, we now no longer have a guide for when to stop in terms

of a number of steps. Instead, we can monitor the average loss function on the

validation set, and continue training until it falls below the value of the training

set objective at which the early stopping procedure halted. This strategy avoids

the high cost of retraining the model from scratch, but is not as well-behaved. For

219

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

∗

(τ)

∗

Figure 7.4: An illustration of the eﬀect of early stopping (Right) as a form of regularization

on the value of the optimal w, as compared to L2 regularization (Left) discussed in

Sec. 7.2.1.

example, there is not any guarantee that the objective on the validation set will

ever reach the target value, so this strategy is not even guaranteed to terminate.

This procedure is presented more formally in Alg. 7.3.

Algorithm 7.3 Meta-algorithm using early stopping to determine at what ob-

jective value we start to overﬁt, then continue training until that value is reached.

Let X

(train)

and y

(train)

be the training set.

Split X

(train)

and y

(train)

into (X

(subtrain)

, X

(valid)

) and (y

(subtrain)

, y

(valid)

)

respectively.

Run early stopping (Alg. 7.1) starting from random θ using X

(subtrain)

and

(subtrain)

for training data and X

(valid)

and y

(valid)

for validation data. This

updates θ.

 ← J(θ, X

(subtrain)

, y

(subtrain)

)

while J(θ, X

(valid)

, y

(valid)

) >  do

Train on X

(train)

and y

(train)

for n steps.

end while

TODO: coordinate with optimization.tex

Early stopping and the use of surrogate loss functions: A useful property

of early stopping is that it can help to mitigate the problems caused by a mismatch

between the surrogate loss function whose gradient we follow downhill and the

underlying performance measure that we actually care about. For example, 0-

1 classiﬁcation loss has a derivative that is zero or undeﬁned everywhere, so it

is not appropriate for gradient-based optimization. We therefore train with a

220

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

surrogate such as the log-likelihood of the correct class label. However, 0-1 loss

is inexpensive to compute, so it can easily be used as an early stopping criterion.

Often the 0-1 loss continues to decrease for long after the log-likelihood has begun

to worsen on the validation set.

Early stopping is also useful because it reduces the computational cost of the

training procedure. It is a form of regularization that does not require adding

additional terms to the surrogate loss function, so we get the beneﬁt of regular-

ization without the cost of any additional gradient computations. It also means

that we do not spend time approaching the exact local minimum of the surrogate

loss.

How early stopping acts as a regularizer: So far we have stated that early

stopping is a regularization strategy, but we have only backed up this claim by

showing learning curves where the validation set error has a U-shaped curve.

What is the actual mechanism by which early stopping regularizes the model?

Early stopping has the eﬀect of restricting the optimization procedure to a

relatively small volume of parameter space in the neighborhood of the initial

parameter value θ

. More speciﬁcally, imagine taking τ optimization steps (cor-

responding to τ training iterations) and taking η as the learning rate. We can

view the product ητ as the reciprocal of a regularization parameter. Assuming the

gradient is bounded, restricting both the number of iterations and the learning

rate limits the volume of parameter space reachable from θ

Indeed, we can show how—in the case of a simple linear model with a quadratic

error function and simple gradient descent—early stopping is equivalent to L2

regularization as seen in Section 7.2.1.

In order to compare with classical L

regularization, we again consider the

simple setting where we will take as the parameters to be optimized θ = w and

we take a quadratic approximation to the objective function J in the neighborhood

of the empirically optimal value of the weights w

∗

J(θ) = J(w

∗

) +

(w − w

∗

)

H(θ − θ

∗

) (7.25)

where, as before, H is the Hessian matrix of J with respect to w evaluated at

∗

. Given the assumption that w

∗

is a minimum of J(w), we can consider that

H is positive semi-deﬁnite and that the gradient is given by:

∇

J(w) = H(w − w

∗

). (7.26)

Material for this section was taken from Bishop (1995a); Sj¨oberg and Ljung (1995); for

further details regarding the interpretation of early-stopping as a regularizer, please consult

these works.

221

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Let us consider some initial parameter vector, for simplicity the origin

i.e.

(0)

= 0. We will consider updating the parameters via gradient descent:

(τ)

= w

(τ−1)

− η∇

J(w

(τ−1)

) (7.27)

= w

(τ−1)

− ηH(w

(τ−1)

− w

∗

) (7.28)

(τ)

− w

∗

= (I − ηH)(w

(τ−1)

− w

∗

) (7.29)

Let us now consider this expression in the space of the eigenvectors of H, i.e.

we will again consider the eigendecomposition of H: H = QΛQ

, where Λ is a

diagonal matrix and Q is an orthonormal basis of eigenvectors.

(τ)

− w

∗

= (I − ηQΛQ

)(w

(τ−1)

− w

∗

)

(τ)

− w

∗

) = (I − ηΛ)Q

(τ−1)

− w

∗

)

Assuming w

= 0, and that |1 − ηλ

| < 1, we have after τ training updates,

(TODO: derive the expression below).

(τ)

= [I − (I −ηΛ)

∗

. (7.30)

Now, the expression for Q

w in Eqn. 7.9 for L

regularization can rearrange as:

w = (Λ + αI)

−1

ΛQ

∗

w = [I − (Λ + αI)

−1

α]Q

∗

(7.31)

Comparing Eqs 7.30 and 7.31, we see that if

(I − ηΛ)

= (Λ + αI)

−1

α,

then L

regularization and early stopping can be seen to be equivalent (at least

under the quadratic approximation of the objective function). Going even further,

by taking logs and using the series expansion for log(1 + x), we can conclude that

if all λ

are small (i.e. ηλ

 1 and λ

/α  1) then

τ ≈

ηα

α ≈

τη

. (7.32)

That is, under these assumptions, the number of training iterations τ plays a role

inversely proportional to the L

regularization parameter, and the inverse of τ η

plays the role of the weight decay coeﬃcient.

For neural networks, to obtain symmetry breaking between hidden units, we cannot initialize

all the parameters at 0, as discussed in Section 8.7.3. However, the argument holds for any other

initial value w

(0)

222

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Parameter values corresponding to directions of signiﬁcant curvature (of the

objective function) are regularized less than directions of less curvature. Of course,

in the context of early stopping, this really means that parameters that correspond

to directions of signiﬁcant curvature tend to learn early relative to parameters

corresponding to directions of less curvature.

7.8 Parameter Tying and Parameter Sharing

Thus far, in this chapter, when we have discussed adding constraints or penalties

to the parameters, we have always done so with respect to a ﬁxed region or point.

For example, L

regularization (or weight decay) penalizes model parameters

for deviating from the ﬁxed value of zero. However, sometimes we may need

other ways to express our prior knowledge about suitable values of the model

parameters. Sometimes we might not know precisely what values the parameters

should take but we know, from knowledge of the domain and model architecture,

that there should be some dependencies between the model parameters.

A common type of dependency that we often want to express is that certain

parameters should be close to one another.

Consider the following scenario: we have two models performing the same

classiﬁcation task (i.e. the same set of classes) but with somewhat diﬀerent input

distributions. Formally, we have model a with parameters w

(a)

and model b with

parameters w

(b)

. The two models map the input to two diﬀerent, but related

outputs: ˆy

= f(w

(a)

, x) and ˆy

= g(w

(b)

, x).

Let us imagine that the tasks are similar enough (perhaps with similar input

and output distributions) that we believe the model parameters should be close to

each other, i.e. ∀i, w

(a)

should be close to w

(b)

. We can leverage this information

through regularization. Speciﬁcally, we can use a parameter norm penalty of the

form: Ω(w

(a)

, w

(b)

) = kw

(a)

− w

(b)

. Here we used an L

penalty, but other

choices are also possible.

This kind of approach was proposed in Lasserre et al. (2006), where they

regularized the parameters of one model, trained as a classiﬁer in a supervised

paradigm, with the parameters of another model, this one trained in an unsu-

pervised paradigm (to capture the distribution of the observed input data). The

architectures were constructed such that many of the parameters in the classiﬁer

model could be paired to corresponding parameters in the unsupervised model.

While a parameter norm penalty is one way to regularize parameters to be

close to one another, the more popular way is to use constraints: to force sets

of parameters to be equal. This method of regularization is often referred to as

parameter sharing, where we interpret the various models or model components as

sharing a unique set of parameters. A signiﬁcant advantage of parameter sharing

223

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

over regularizing the parameters to be close (via a norm penalty) is that only a

subset of the parameters (the unique set) need to be stored in memory. In certain

models—such as the convolutional neural network—this can lead to signiﬁcant

reduction in the memory footprint of the model.

Convolutional Neural Networks By far the most popular and extensive use

of parameter sharing occurs in convolutional neural networks (CNNs) applied to

computer vision.

Natural images have many statistical properties that are invariant to transla-

tion. For example, a photo of a cat remains a photo of a cat if it is translated one

pixel to the right. CNNs take this property into account by sharing parameters

across multiple image locations. The same feature (i.e., a hidden unit with the

same weights) is computed over diﬀerent locations in the input. This means that

we can ﬁnd a cat with the same cat detector whether the cat appears at column

i or column i + 1 in the image.

Parameter sharing has allowed CNNs to dramatically lower the number of

unique model parameters and have allowed them to signiﬁcantly increase network

sizes without requiring a corresponding increase in training data. It remains one

of the best examples of how to eﬀectively incorporate domain knowledge into the

network architecture.

CNNs will be discussed in more detail in Chapter 9.

7.9 Sparse Representations

The previous sections of this chapter were concerned with direct regularization of

the model parameters. In this section we will describe a diﬀerent kind of regular-

ization strategy where the eﬀect on the model parameters is only indirect. Specif-

ically we consider representational sparsity as a form of regularization. We have

already discussed (in sec. 7.2.2) how L

penalization induces a sparse parametriza-

tion – meaning that a signiﬁcant number of the parameters is zero (or close to it).

Representational sparsity, on the other hand, describes a representation where a

signiﬁcant number of elements are zero (or close to zero). A simpliﬁed view of

this distinction can be illustrated in the context of linear regression:

224

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

TODO: reformat to ﬁt, in two blocks







−9

−3













4 0 0 −2 0 0

0 0 −1 0 3 0

0 5 0 0 0 0

1 0 0 −1 0 −4

1 0 0 0 −5 0













−2

−5







y ∈ R

A ∈ R

m×n

x ∈ R







−14













3 −1 2 −5 4 1

4 2 −3 −1 1 3

−1 5 4 2 −3 −2

3 1 2 −3 0 −3

−5 4 −2 2 −5 −1













−3







y ∈ R

B ∈ R

m×n

h ∈ R

On the left, we have an example of an sparsely parametrized linear regression

model. On the right, we have linear regression with a sparse representation h

of the data x. That is, h is a function of x that, in some sense, represents the

information present in x, but does so with a sparse vector.

The distinction between representation sparsity and parameter sparsity is im-

portant, but more generally regularizing the representation and regularizing the

model parameters can be related to each other in a fairly natural way. Ultimately,

the goal of either strategy is to regularize the function mapping the input to the

output space (regardless of the task). Sometimes it is more natural to express

constraints or penalties in the form of prior information. However sometimes the

connection between the parameters and the function being regularized is not well

understood and it is diﬃcult to know, a priori, how to specify a penalty or con-

straint on the model parameters that matches our prior knowledge of the function

being learning. In these situations it may well be more natural or more eﬀective

to regularize the representations learned by the model.

Representational regularization is mediated by the same sorts of mechanisms

that we have used in parameter regularization. Speciﬁcally both soft norm penal-

ties and hard constraints may be considered - though norm penalties have been

the more popular of the two in the deep learning community.

Norm penalty regularization of representations is performed by adding to the

loss function J a norm penalty on the representation, denoted Ω(h). As before,

we denote the regularized loss function by

J(θ; X, y) = J(θ; X, y) + αΩ(h) (7.33)

where α (α ≥ 0) weights the relative contribution of the norm penalty term, with

larger values of α corresponding to more regularization.

Just as an L

penalty on the parameters induces parameter sparsity, an L

penalty on the elements of the representation induces representational sparsity:

Ω(h) = |h|

|. Of course, the L

penalty is only one choice of penalty

that can result in a sparse representation. Others include the Student-t penalty

225

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

(derived from a Student-t distribution prior on the elements of the represen-

tation) (Olshausen and Field, 1996; Bergstra, 2011) and KL-divergence penal-

ties (Lee et al., 2008; Larochelle and Bengio, 2008a; Goodfellow et al., 2009) that

are especially useful for representations with elements constrained to lie on the

unit interval. In Sec. 15.8, sparsity inducing penalties are discussed in the context

of auto-encoders.

7.10 Bagging and Other Ensemble Methods

Bagging (short for bootstrap aggregating) is a technique for reducing generalization

error by combining several models (Breiman, 1994). The idea is to train several

diﬀerent models separately, then have all of the models vote on the output for

test examples. This is an example of a general strategy in machine learning

called model averaging. Techniques employing this strategy are known as ensemble

methods.

The reason that model averaging works is that diﬀerent models will usually

not make all the same errors on the test set.

Consider for example a set of k regression models. Suppose that each model

makes an error 

on each example, with the errors drawn from a zero-mean mul-

tivariate normal distribution with variances E[

] = v and covariances E[



] = c.

Then the error made by the average prediction of all the ensemble models is



. The expected squared error of the ensemble predictor is



]







j6=i







]

v +

k − 1

In the case where the errors are perfectly correlated and c = v, this reduces

to v, and the model averaging does not help at all. But in the case where the

errors are perfectly uncorrelated and c = 0, then the expected squared error of

the ensemble is only

v. This means that the expected squared error of the

ensemble decreases linearly with the ensemble size. In other words, on average,

the ensemble will perform at least as well as any of its members, and if the

members make independent errors, the ensemble will perform signiﬁcantly better

than its members.

226

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Diﬀerent ensemble methods construct the ensemble of models in diﬀerent

ways. For example, each member of the ensemble could be formed by training a

completely diﬀerent kind of model using a diﬀerent algorithm or objective func-

tion. Bagging is a method that allows the same kind of model, training algorithm

and objective function to be reused several times.

Speciﬁcally, bagging involves constructing k diﬀerent datasets. Each dataset

has the same number of examples as the original dataset, but each dataset is

constructed by sampling with replacement from the original dataset. This means

that, with high probability, each dataset is missing some of the examples from the

original dataset and also contains several duplicate examples (in average around

2/3 of the examples from the original dataset are found in the resulting training

set, if it has the same size as the original). Model i is then trained on dataset

i. The diﬀerences between which examples are included in each dataset result in

diﬀerences between the trained models. See Fig. 7.5 for an example.

Neural networks reach a wide enough variety of solution points that they can

often beneﬁt from model averaging even if all of the models are trained on the same

dataset. Diﬀerences in random initialization, random selection of minibatches,

diﬀerences in hyperparameters, or diﬀerent outcomes of non-deterministic imple-

mentations of neural networks are often enough to cause diﬀerent members of the

ensemble to make partially independent errors.

Model averaging is an extremely powerful and reliable method for reducing

generalization error. Its use is usually discouraged when benchmarking algorithms

for scientiﬁc papers, because any machine learning algorithm can beneﬁt substan-

tially from model averaging at the price of increased computation and memory.

For this reason, benchmark comparisons are usually made using a single model.

Machine learning contests are usually won by methods using model averag-

ing over dozens of models. A recent prominent example is the Netﬂix Grand

Prize (Koren, 2009).

Not all techniques for constructing ensembles are designed to make the ensem-

ble more regularized than the individual models. For example, a technique called

boosting (Freund and Schapire, 1996b,a) constructs an ensemble with higher ca-

pacity than the individual models, and has been applied to boost either whole neu-

ral networks (Schwenk and Bengio, 1998), constructively adding neural networks

to the ensemble, or to boost hidden units (Bengio et al., 2006a), constructively

adding hidden units to a single neural network.

7.11 Dropout

Because deep models with many parameters have a high degree of expressive

power necessary to capture complex tasks such as speech or object recognition,

227

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

First ensemble member

Second ensemble member

Original dataset

First resampled dataset

Second resampled dataset

Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an ’8’ detector

on the dataset depicted above, containing an ’8’, a ’6’, and a ’9’. Suppose we make

two diﬀerent resampled datasets. The bagging training procedure is to construct each of

these datasets by sampling with replacement. The ﬁrst dataset omits the ’9’ and repeats

the ’8’. On this dataset, the detector learns that a loop on top of the digit corresponds

to an ’8’. On the second dataset, we repeat the ’9’ and omit the ’6’. In this case, the

detector learns that a loop on the bottom of the digit corresponds to an ’8’. Each of these

individual classiﬁcation rules is brittle, but if we average their output then the detector

is robust, achieving maximal conﬁdence only when both loops of the ’8’ are present.

228

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

they are capable of overﬁtting signiﬁcantly. While this problem can be solved

by using a very large dataset, large datasets are not always available. Dropout

(Srivastava et al., 2014) provides a computationally inexpensive but powerful

method of regularizing a broad family of models.

Dropout can be thought of as a method of making bagging practical for large

neural networks. Bagging involves training multiple models, and evaluating mul-

tiple models on each test example. This seems impractical when each model is

a large neural network, since training and evaluating such networks is costly in

terms of runtime and memory. Dropout provides an inexpensive approximation

to training and evaluating a bagged ensemble of exponentially many neural net-

works.

Speciﬁcally, dropout trains the ensemble consisting of all sub-networks that

can be formed by removing units from an underlying base network. In most mod-

ern neural networks, based on a series of aﬃne transformations and nonlinearities,

we can eﬀectively remove a unit from a network by multiplying its output value

by zero. This procedure requires some slight modiﬁcation for models such as ra-

dial basis function networks, which take the diﬀerence between the unit’s state

and some reference value. Here, we will present the dropout algorithm in terms

of multiplication by zero for simplicity, but it can be trivially modiﬁed to work

with other operations that remove a unit from the network.

TODO–describe training algorithm, with reference to bagging. should also ref-

erence Gal and Gahramani paper’s about Bayesian info: http://mlg.eng.cam.ac.uk/yarin/publications.html

TODO– include ﬁgures from IG’s job talk TODO– training doesn’t rely on the

model being probabilistic TODO– describe inference algorithm, with reference

to bagging TODO– inference does rely on the model being probabilistic. and

speciﬁcally, exponential family?

For many classes of models that do not have nonlinear hidden units, the weight

scaling inference rule is exact. For a simple example, consider a softmax regression

classiﬁer with n input variables represented by the vector v:

P (y = y | v) = softmax



v + b



We can index into the family of sub-models by element-wise multiplication of the

input with a binary vector d:

P (y = y | v; d) = softmax



(d  v) + b



The ensemble predictor is deﬁned by re-normalizing the geometric mean over all

ensemble members’ predictions:

ensemble

(y = y | v) =

ensemble

(y = y | v)

ensemble

(y = y

| v)

(7.34)

229

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

where

ensemble

(y = y | v) =

d∈{0,1}

P (y = y | v; d).

To see that the weight scaling rule is exact, we can simplify

ensemble

(y = y | v) =

d∈{0,1}

P (y = y | v; d)

d∈{0,1}

softmax (W

(d  v) + b)

d∈{0,1}

exp



y,:

(d  v) + b



exp



(d  v) + b



d∈{0,1}

exp



y,:

(d  v) + b



d∈{0,1}

exp



(d  v) + b



Because

P will be normalized, we can safely ignore multiplication by factors that

are constant with respect to y:

ensemble

(y = y | v) ∝

d∈{0,1}

exp



y,:

(d  v) + b



= exp





d∈{0,1}

y,:

(d  v) + b





= exp



y,:

v + b



Substituting this back into equation 7.34 we obtain a softmax classiﬁer with

weights

W .

The weight scaling rule is also exact in other settings, including regression

networks with conditionally normal outputs, and deep networks that have hidden

layers without nonlinearities. However, the weight scaling rule is only an approxi-

mation for deep models that have non-linearities, and this approximation has not

been theoretically characterized. Fortunately, it works well, empirically. Good-

fellow et al. (2013a) found empirically that for deep networks with nonlinearities,

the weight scaling rule can work better (in terms of classiﬁcation accuracy) than

230

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

Monte Carlo approximations to the ensemble predictor, even if the Monte Carlo

approximation is allowed to sample up to 1,000 sub-networks.

Srivastava et al. (2014) showed that dropout is more eﬀective than other stan-

dard computationally inexpensive regularizers, such as weight decay, ﬁlter norm

constraints, and sparse activity regularization. Dropout may also be combined

with more expensive forms of regularization such as unsupervised pretraining to

yield an improvement.

One advantage of dropout is that it is very computationally cheap. Using

dropout during training requires only O(n) computation per example per update,

to generate n random binary numbers and multiply them by the state. Depending

on the implementation, it may also require O(n) memory to store these binary

numbers until the backpropagation stage. Running inference in the trained model

has the same cost per-example as if dropout were not used, though we must pay

the cost of dividing the weights by 2 once before beginning to run inference on

examples.

One signiﬁcant advantage of dropout is that it does not signiﬁcantly limit the

type of model or training procedure that can be used. It works well with nearly

any model that uses a distributed representation and can be trained with stochas-

tic gradient descent. This includes feedforward neural networks, probabilistic

models such as restricted Boltzmann machines (Srivastava et al., 2014), and re-

current neural networks (Bayer and Osendorfer, 2014; Pascanu et al., 2014a). On

the other hand, if one wants to take advantage of the regularization eﬀect of

approaches that try to capture the structure of the input distribution, such as

unsupervised pre-training, it is often necessary to change the architecture of the

model.

Though the cost per-step of applying dropout to a speciﬁc model is negligible,

the cost of using dropout in a complete system can be signiﬁcant. Because dropout

is a regularization technique, it reduces the eﬀective capacity of a model. To

oﬀset this eﬀect, we must increase the size of the model. Typically the optimal

validation set error is much lower when using dropout, but this comes at the cost

of a much larger model and many more iterations of the training algorithm. For

very large datasets, regularization confers little reduction in generalization error.

In these cases, the computational cost of using dropout and larger models may

outweigh the beneﬁt of regularization. As a very rough rule of thumb, the beneﬁt

of dropout often becomes questionable when more than around 15 million labeled

training examples are available.

When extremely few labeled training examples are available, dropout is less

eﬀective. Bayesian neural networks (Neal, 1996) outperform dropout on the Alter-

native Splicing Dataset (Xiong et al., 2011) where fewer than 5,000 examples are

available (Srivastava et al., 2014). When additional unlabeled data is available,

231

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

unsupervised feature learning can gain an advantage over dropout.

TODO– ”Dropout Training as Adaptive Regularization” ? (Wager et al.,

2013) TODO–perspective as L2 regularization TODO–connection to adagrad?

TODO–semi-supervised variant TODO–Baldi paper (Baldi and Sadowski, 2013)

TODO–DWF paper (Warde-Farley et al., 2014) TODO–using geometric mean

is not a problem TODO–dropout boosting, it’s not just noise robustness TODO–

what was the conclusion about mixability (DWF)?

The stochasticity used while training with dropout is not a necessary part

of the model’s success. It is just a means of approximating the sum over all

sub-models. Wang and Manning (2013) derived analytical approximations to

this marginalization. Their approximation, known as fast dropout resulted in

faster convergence time due to the reduced stochasticity in the computation of

the gradient. This method can also be applied at test time, as a more principled

(but also more computationally expensive) approximation to the average over

all sub-networks than the weight scaling approximation. Fast dropout has been

used to match the performance of standard dropout on small neural network

problems, but has not yet yielded a signiﬁcant improvement or been applied to a

large problem.

Dropout has inspired other stochastic approaches to training exponentially

large ensembles of models that share weights. DropConnect is a special case of

dropout where each product between a single scalar weight and a single hidden

unit state is considered a unit that can be dropped (Wan et al., 2013). Stochastic

pooling is a form of randomized pooling (see chapter 9.3) for building ensembles

of convolutional networks with each convolutional network attending to diﬀerent

spatial locations of each feature map. So far, dropout remains the most widely

used implicit ensemble method.

TODO–improved performance with maxout units and probably ReLUs

7.12 Multi-Task Learning

Multi-task learning (Caruana, 1993) is a way to improve generalization by pooling

the examples (which can be seen as soft constraints imposed on the parameters)

arising out of several tasks. In the same way that additional training examples

put more pressure on the parameters of the model towards values that generalize

well, when part of a model is shared across tasks, that part of the model is more

constrained towards good values (assuming the sharing is justiﬁed), often yielding

better generalization.

Figure 7.6 illustrates a very common form of multi-task learning, in which

diﬀerent supervised tasks (predicting y

given x) share the same input x, as well

as some intermediate-level representation h

shared

capturing a common pool of

232

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

shared

Figure 7.6: Multi-task learning can be cast in several ways in deep learning frameworks

and this ﬁgure illustrates the common situation where the tasks share a common input but

involve diﬀerent target random variables. The lower layers of a deep network (whether it

is supervised and feedforward or includes a generative component with downward arrows)

can be shared across such tasks, while task-speciﬁc parameters (associated respectively

with the weights into and from h

and h

in the ﬁgure) can be learned on top of those

yielding a shared representation h

shared

. The underlying assumption is that there exists

a common pool of factors that explain the variations in the input x, while each task is

associated with a subset of these factors. In the ﬁgure, it is additionally assumed that

top-level hidden units h

and h

are specialized to each task (respectively predicting y

and y

) while some intermediate-level representation h

shared

is shared across all tasks.

Note that in the unsupervised learning context, it makes sense for some of the top-level

factors to be associated with none of the output tasks (h

): these are the factors that

explain some of the input variations but are not relevant for predicting y

or y

233

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

factors. The model can generally be divided into two kinds of parts and associated

parameters:

1. Task-speciﬁc parameters (which only beneﬁt from the examples of their task

to achieve good generalization). Example: upper layers of a neural network,

in Figure 7.6.

2. Generic parameters, shared across all the tasks (which beneﬁt from the

pooled data of all the tasks). Example: lower layers of a neural network, in

Figure 7.6.

Improved generalization and generalization error bounds (Baxter, 1995) can be

achieved because of the shared parameters, for which statistical strength can be

greatly improved (in proportion with the increased number of examples for the

shared parameters, compared to the scenario of single-task models). Of course this

will happen only if some assumptions about the statistical relationship between

the diﬀerent tasks are valid, i.e., that there is something shared across some of

the tasks.

From the point of view of deep learning, the underlying prior regarding the

data is the following: among the factors that explain the variations observed in

the data associated with the diﬀerent tasks, some are shared across two or more

tasks.

7.13 Adversarial Training

In many cases, neural networks have begun to reach human performance when

evaluated on an i.i.d. test set. It is natural therefore to wonder whether these

models have obtained a true human-level understanding of these tasks. In order

to probe the level of understanding a network has of the underlying task, we

can search for examples that the model misclassiﬁes. Szegedy et al. (2014b)

found that even neural networks that perform at human level accuracy have a

nearly 100% error rate on examples that are intentionally constructed by using

an optimization procedure to search for an input x

near a data point x such that

the model output is very diﬀerent at x

. In many case, x

can be so similar to x

that a human observer cannot tell the diﬀerence between the original example and

the adversarial example, but the network can make highly diﬀerent predictions.

See Fig. 7.7 for an example.

Adversarial examples have many implications, for example, in computer secu-

rity, that are beyond the scope of this chapter. However, they are interesting in

the context of regularization because one can reduce the error rate on the original

i.i.d. test set by training on adversarially perturbed examples from the training

set (Szegedy et al., 2014b).

234

CHAPTER 7. REGULARIZATION OF DEEP OR DISTRIBUTED MODELS

+ .007 × =

x sign(∇

J(θ, x, y))

x +

 sign(∇

J(θ, x, y))

y =“panda” “nematode” “gibbon”

w/ 57.7%

conﬁdence

w/ 8.2%

conﬁdence

w/ 99.3 %

conﬁdence

Figure 7.7: A demonstration of adversarial example generation applied to GoogLeNet

(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose

elements are equal to the sign of the elements of the gradient of the cost function with

respect to the input, we can change GoogLeNet’s classiﬁcation of the image. Reproduced

with permission from Goodfellow et al. (2014b).

Goodfellow et al. (2014b) showed that one of the primary causes of these ad-

versarial examples is excessive linearity. Neural networks are built out of primarily

linear building blocks, and in some empirical experiments the overall function they

implement proves to be highly linear as a result. These linear functions are easy

to optimize. Unfortunately, the value of a linear function can change very rapidly

if it has numerous inputs. If we change each input by , then a linear function

with weights w can change by as much as |w|, which can be a very large amount

if w is high-dimensional. Adversarial training discourages this highly sensitive

locally linear behavior by encouraging the network to be locally constant in the

neighborhood of the training data. This can be seen as a way of introducing

explicitly the local smoothness prior into supervised neural nets.

This phenomenon helps to illustrate the power of using a large function family

in combination with aggressive regularization. Purely linear models, like logistic

regression, are not able to resist adversarial examples because they are forced to

be linear. Neural networks are able to represent functions that can range from

nearly linear to nearly locally constant and thus have the ﬂexibility to capture

linear trends in the training data while still learning to resist local perturbation.

235