Chapter 8
Optimization for Training Deep
Models
Deep learning algorithms involve optimization in many contexts. For example,
performing inference in models such as PCA involves solving an optimization
problem. We often use analytical optimization to write proofs or design algorithms.
Of all of the many optimization problems involved in deep learning, the most
difficult is neural network training. It is quite common to invest days to months of
time on hundreds of machines in order to solve even a single instance of the neural
network training problem. Because this problem is so important and so expensive,
a specialized set of optimization techniques have been developed for solving it.
This chapter presents these optimization techniques for neural network training.
If you are unfamiliar with the basic principles of gradient-based optimization,
we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical
optimization in general.
This chapter focuses on one particular case of optimization: finding the param-
eters
θ
of a neural network that significantly reduce a cost function
J
(
θ
), which
typically includes a performance measure evaluated on the entire training set as
well as additional regularization terms.
We begin with a description of how optimization used as a training algorithm
for a machine learning task differs from pure optimization. Next, we present several
of the concrete challenges that make optimization of neural networks difficult. We
then define several practical algorithms, including both optimization algorithms
themselves and strategies for initializing the parameters. More advanced algorithms
adapt their learning rates during training or leverage information contained in
276
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the second derivatives of the cost function. Finally, we conclude with a review of
several optimization strategies that are formed by combining simple optimization
algorithms into higher-level procedures.
8.1 How Learning Differs from Pure Optimization
Optimization algorithms used for training of deep models differ from traditional
optimization algorithms in several ways. Machine learning usually acts indirectly.
In most machine learning scenarios, we care about some performance measure
P
, that is defined with respect to the test set and may also be intractable. We
therefore optimize
P
only indirectly. We reduce a different cost function
J
(
θ
) in
the hope that doing so will improve
P
. This is in contrast to pure optimization,
where minimizing
J
is a goal in and of itself. Optimization algorithms for training
deep models also typically include some specialization on the specific structure of
machine learning objective functions.
Typically, the cost function can be written as an average over the training set,
such as
J(θ) = E
(x,y)ˆp
data
L(f(x; θ), y), (8.1)
where
L
is the per-example loss function,
f
(
x
;
θ
) is the predicted output when
the input is
x
,
ˆp
data
is the empirical distribution. In the supervised learning case,
y
is the target output. Throughout this chapter, we develop the unregularized
supervised case, where the arguments to
L
are
f
(
x
;
θ
) and
y
. However, it is trivial
to extend this development, for example, to include
θ
or
x
as arguments, or to
exclude
y
as arguments, in order to develop various forms of regularization or
unsupervised learning.
Eq. 8.1 defines an objective function with respect to the training set. We
would usually prefer to minimize the corresponding objective function where the
expectation is taken across
the data generating distribution p
data
rather than
just over the finite training set:
J
(θ) = E
(x,y)p
data
L(f(x; θ), y). (8.2)
8.1.1 Empirical Risk Minimization
The goal of a machine learning algorithm is to reduce the expected generalization
error given by Eq. 8.2. This quantity is known as the risk. We emphasize here that
the expectation is taken over the true underlying distribution
p
data
. If we knew
the true distribution
p
data
(
x, y
), risk minimization would be an optimization task
277
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
solvable by an optimization algorithm. However, when we do not know
p
data
(
x, y
)
but only have a training set of samples, we have a machine learning problem.
The simplest way to convert a machine learning problem back into an op-
timization problem is to minimize the expected loss on the training set. This
means replacing the true distribution
p
(
x, y
) with the empirical distribution
ˆp
(
x, y
)
defined by the training set. We now minimize the empirical risk
E
x,yˆp
data
(x,y)
[L(f(x; θ), y)] =
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
) (8.3)
where m is the number of training examples.
The training process based on minimizing this average training error is known
as empirical risk minimization. In this setting, machine learning is still very similar
to straightforward optimization. Rather than optimizing the risk directly, we
optimize the empirical risk, and hope that the risk decreases significantly as well.
A variety of theoretical results establish conditions under which the true risk can
be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with
high capacity can simply memorize the training set. In many cases, empirical
risk minimization is not really feasible. The most effective modern optimization
algorithms are based on gradient descent, but many useful loss functions, such
as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined
everywhere). These two problems mean that, in the context of deep learning, we
rarely use empirical risk minimization. Instead, we must use a slightly different
approach, in which the quantity that we actually optimize is even more different
from the quantity that we truly want to optimize.
8.1.2 Surrogate Loss Functions and Early Stopping
Sometimes, the loss function we actually care about (say classification error) is not
one that can be optimized efficiently. For example, exactly minimizing expected 0-1
loss is typically intractable (exponential in the input dimension), even for a linear
classifier (Marcotte and Savard, 1992). In such situations, one typically optimizes
a surrogate loss function instead, which acts as a proxy but has advantages. For
example, the negative log-likelihood of the correct class is typically used as a
surrogate for the 0-1 loss. The negative log-likelihood allows the model to estimate
the conditional probability of the classes, given the input, and if the model can
do that well, then it can pick the classes that yield the least classification error in
expectation.
278
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
In some cases, a surrogate loss function actually results in being able to learn
more. For example, the test set 0-1 loss often continues to decrease for a long
time after the training set 0-1 loss has reached zero, when training using the
log-likelihood surrogate. This is because even when the expected 0-1 loss is zero,
one can improve the robustness of the classifier by further pushing the classes apart
from each other, obtaining a more confident and reliable classifier, thus extracting
more information from the training data than would have been possible by simply
minimizing the average 0-1 loss on the training set.
A very important difference between optimization in general and optimization
as we use it for training algorithms is that training algorithms do not usually halt
at a local minimum. Instead, a machine learning algorithm usually minimizes
a surrogate loss function but halts when a convergence criterion based on early
stopping (Sec. 7.8) is satisfied. Typically the early stopping criterion is based on
the true underlying loss function, such as 0-1 loss measured on a validation set,
and is designed to cause the algorithm to halt whenever overfitting begins to occur.
Training often halts while the surrogate loss function still has large derivatives,
which is very different from the pure optimization setting, where an optimization
algorithm is considered to have converged when the gradient becomes very small.
8.1.3 Batch and Minibatch Algorithms
One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a sum
over the training examples. Optimization algorithms for machine learning typically
compute each update to the parameters based on an expected value of the cost
function estimated using only a subset of the terms of the full cost function.
For example, maximum likelihood estimation problems, when viewed in log
space, decompose into a sum over each example:
θ
ML
= arg max
θ
m
i=1
log p
model
(x
(i)
, y
(i)
; θ). (8.4)
Maximizing this sum is equivalent to maximizing the expectation over the
empirical distribution defined by the training set:
J(θ) = E
x,yˆp
data
log p
model
(x, y; θ). (8.5)
Most of the properties of the objective function
J
used by most of our opti-
mization algorithms are also expectations over the training set. For example, the
279
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
most commonly used property is the gradient:
θ
J(θ) = E
x,yˆp
data
θ
log p
model
(x, y; θ). (8.6)
Computing this expectation exactly is very expensive because it requires
evaluating the model on every example in the entire dataset. In practice, we can
compute these expectations by randomly sampling a small number of examples
from the dataset, then taking the average over only those examples.
Recall that the standard error of the mean (Eq. 5.30) estimated from
n
samples
is given by
σ/
n,
where
σ
is the true standard deviation of the value of the samples.
The denominator of
n
shows that there are less than linear returns to using
more examples to estimate the gradient. Compare two hypothetical estimates of
the gradient, one based on 100 examples and another based on 10,000 examples.
The latter requires 100 times more computation than the former, but reduces the
standard error of the mean only by a factor of 10. Most optimization algorithms
converge much faster (in terms of total computation, not in terms of number of
updates) if they are allowed to rapidly compute approximate estimates of the
gradient rather than slowly computing the exact gradient.
Another consideration motivating statistical estimation of the gradient from a
small number of samples is redundancy in the training set. In the worst case, all
m
samples in the training set could be identical copies of each other. A sampling-
based estimate of the gradient could compute the correct gradient with a single
sample, using
m
times less computation than the naive approach. In practice, we
are unlikely to truly encounter this worst-case situation, but we may find large
numbers of examples that all make very similar contributions to the gradient.
Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch. This terminology can be somewhat confusing
because the word “batch” is also often used to describe the minibatch used by
minibatch stochastic gradient descent. Typically the term “batch gradient descent”
implies the use of the full training set, while the use of the term “batch” to describe
a group of examples does not. For example, it is very common to use the term
“batch size” to describe the size of a minibatch.
Optimization algorithms that use only a single example at a time are sometimes
called stochastic or sometimes online methods. The term online is usually reserved
for the case where the examples are drawn from a stream of continually created
examples rather than from a fixed-size training set over which several passes are
made.
Most algorithms used for deep learning fall somewhere in between, using more
280
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
than one but less than all of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call
them stochastic methods.
The canonical example of a stochastic method is stochastic gradient descent,
presented in detail in Sec. 8.3.1.
Minibatch sizes are generally driven by the following factors:
Larger batches provide a more accurate estimate of the gradient, but with
less than linear returns.
Multicore architectures are usually underutilized by extremely small batches.
This motivates using some absolute minimum batch size, below which there
is no reduction in the time to process a minibatch.
If all examples in the batch are to be processed in parallel (as is typically
the case), then the amount of memory scales with the batch size. For many
hardware setups this is the limiting factor in batch size.
Some kinds of hardware achieve better runtime with specific sizes of arrays.
Especially when using GPUs, it is common for power of 2 batch sizes to offer
better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16
sometimes being attempted for large models.
Small batches can offer a regularizing effect (Wilson and Martinez, 2003),
perhaps due to the noise they add to the learning process. Generalization
error is often best for a batch size of 1. Training with such a small batch
size might require a small learning rate to maintain stability due to the high
variance in the estimate of the gradient. The total runtime can be very high
due to the need to make more steps, both because of the reduced learning
rate and because it takes more steps to observe the entire training set.
Different kinds of algorithms use different kinds of information from the mini-
batch in different ways. Some algorithms are more sensitive to sampling error than
others, either because they use information that is difficult to estimate accurately
with few samples, or because they use information in ways that amplify sampling
errors more. Methods that compute updates based only on the gradient
g
are
usually relatively robust and can handle smaller batch sizes like 100. Second-order
methods, which use also the Hessian matrix
H
and compute updates such as
H
1
g
, typically require much larger batch sizes like 10,000. These large batch
sizes are required to minimize fluctuations in the estimates of
H
1
g
. Suppose
that
H
is estimated perfectly but has a poor condition number. Multiplication by
281
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
H
or its inverse amplifies pre-existing errors, in this case, estimation errors in
g
.
Very small changes in the estimate of
g
can thus cause large changes in the update
H
1
g
, even if
H
were estimated perfectly. Of course,
H
will be estimated only
approximately, so the update
H
1
g
will contain even more error than we would
predict from applying a poorly conditioned operation to the estimate of g.
It is also crucial that the minibatches be selected randomly. Computing an
unbiased estimate of the expected gradient from a set of samples requires that those
samples be independent. We also wish for two subsequent gradient estimates to be
independent from each other, so two subsequent minibatches of examples should
also be independent from each other. Many datasets are most naturally arranged
in a way where successive examples are highly correlated. For example, we might
have a dataset of medical data with a long list of blood sample test results. This
list might be arranged so that first we have five blood samples taken at different
times from the first patient, then we have three blood samples taken from the
second patient, then the blood samples from the third patient, and so on. If we
were to draw examples in order from this list, then each of our minibatches would
be extremely biased, because it would represent primarily one patient out of the
many patients in the dataset. In cases such as these where the order of the dataset
holds some significance, it is necessary to shuffle the examples before selecting
minibatches. For very large datasets, for example datasets containing billions of
examples in a data center, it can be impractical to sample examples truly uniformly
at random every time we want to construct a minibatch. Fortunately, in practice
it is usually sufficient to shuffle the order of the dataset once and then store it in
shuffled fashion. This will impose a fixed set of possible minibatches of consecutive
examples that all models trained thereafter will use, and each individual model
will be forced to reuse this ordering every time it passes through the training
data. However, this deviation from true random selection does not seem to have a
significant detrimental effect. Failing to ever shuffle the examples in any way can
seriously reduce the effectiveness of the algorithm.
Many optimization problems in machine learning decompose over examples
well enough that we can compute entire separate updates over different examples
in parallel. In other words, we can compute the update that minimizes
J
(
X
) for
one minibatch of examples
X
at the same time that we compute the update for
several other minibatches. Such asynchronous parallel distributed approaches are
discussed further in Sec. 12.1.3.
An interesting motivation for minibatch stochastic gradient descent is that it
follows the gradient of the true
generalization error
(Eq. 8.2) so long as no
examples are repeated. Most implementations of minibatch stochastic gradient
282
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
descent shuffle the dataset once and then pass through it multiple times. On the
first pass, each minibatch is used to compute an unbiased estimate of the true
generalization error. On the second pass, the estimate becomes biased because it is
formed by re-sampling values that have already been used, rather than obtaining
new fair samples from the data generating distribution.
The fact that stochastic gradient descent minimizes generalization error is
easiest to see in the online learning case, where examples or minibatches are drawn
from a stream of data. In other words, instead of receiving a fixed-size training
set, the learner is similar to a living being who sees a new example at each instant,
with every example (
x, y
) coming from the data generating distribution
p
data
(
x, y
).
In this scenario, examples are never repeated; every experience is a fair sample
from p
data
.
The equivalence is easiest to derive when both
x
and
y
are discrete. In this
case, the generalization error (Eq. 8.2) can be written as a sum
J
(θ) =
x
y
p
data
(x, y)L(f(x; θ), y), (8.7)
with the exact gradient
g =
θ
J
(θ) =
x
y
p
data
(x, y)
x
L(f(x; θ), y). (8.8)
We have already seen the same fact demonstrated for the log-likelihood in Eq. 8.5
and Eq. 8.6; we observe now that this holds for other functions
L
besides the
likelihood. A similar result can be derived when
x
and
y
are continuous, under
mild assumptions regarding p
data
and L.
Hence, we can obtain an unbiased estimator of the exact gradient of the
generalization error by sampling a minibatch of examples
{x
(1)
, . . . x
(m)
}
with cor-
responding targets
y
(i)
from the data generating distribution
p
data
, and computing
the gradient of the loss with respect to the parameters for that minibatch:
ˆ
g =
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
). (8.9)
Updating θ in the direction of
ˆ
g performs SGD on the generalization error.
Of course, this interpretation only applies when examples are not reused.
Nonetheless, it is usually best to make several passes through the training set,
unless the training set is extremely large. When multiple such epochs are used,
only the first epoch follows the unbiased gradient of the generalization error, but
283
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
of course, the additional epochs usually provide enough benefit due to decreased
training error to offset the harm they cause by increasing the gap between training
error and test error.
With some datasets growing rapidly in size, faster than computing power, it
is becoming more common for machine learning applications to use each training
example only once or even to make an incomplete pass through the training
set. When using an extremely large training set, overfitting is not an issue, so
underfitting and computational efficiency become the predominant concerns. See
also Bottou and Bousquet (2008) for a discussion of the effect of computational
bottlenecks on generalization error, as the number of training examples grows.
8.2 Challenges in Neural Network Optimization
Optimization in general is an extremely difficult task. Traditionally, machine
learning has avoided the difficulty of general optimization by carefully designing
the objective function and constraints to ensure that the optimization problem is
convex. When training neural networks, we must confront the general non-convex
case. Even convex optimization is not without its complications. In this section,
we summarize several of the most prominent challenges involved in optimization
for training deep models.
8.2.1 Ill-Conditioning
Some challenges arise even when optimizing convex functions. Of these, the most
prominent is ill-conditioning of the Hessian matrix
H
. This is a very general
problem in most numerical optimization, convex or otherwise, and is described in
more detail in Sec. 4.3.1.
The ill-conditioning problem is generally believed to be present in neural
network training problems. Ill-conditioning can manifest by causing SGD to get
“stuck” in the sense that even very small steps increase the cost function.
Recall from Eq. 4.9 that a second-order Taylor series expansion of the cost
function predicts that a gradient descent step of g will add
1
2
2
g
Hg g
g (8.10)
to the cost. Ill-conditioning of the gradient becomes a problem when
1
2
2
g
Hg
exceeds
g
g
. To determine whether ill-conditioning is detrimental to a neural
network training task, one can monitor the squared gradient norm
g
g
and the
284
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
50 0 50 100 150 200 250
Training time (epochs)
2
0
2
4
6
8
10
12
14
16
Gradient norm
0 50 100 150 200 250
Training time (epochs)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Classification error rate
Figure 8.1: Gradient descent often does not arrive at a critical point of any kind. In
this example, the gradient norm increases throughout training of a convolutional network
used for object detection. (Left) A scatterplot showing how the norms of individual
gradient evaluations are distributed over time. To improve legibility, only one gradient
norm is plotted per epoch. The running average of all gradient norms is plotted as a solid
curve. The gradient norm clearly increases over time, rather than decreasing as we would
expect if the training process converged to a critical point. (Right) Despite the increasing
gradient, the training process is reasonably successful. The validation set classification
error decreases to a low level.
g
Hg
term. In many cases, the gradient norm does not shrink significantly
throughout learning, but the
g
Hg
term grows by more than order of magnitude.
The result is that learning becomes very slow despite the presence of a strong
gradient because the learning rate must be shrunk to compensate for even stronger
curvature. Fig. 8.1 shows an example of the gradient increasing significantly during
the successful training of a neural network.
Though ill-conditioning is present in other settings besides neural network
training, some of the techniques used to combat it in other contexts are less
applicable to neural networks. For example, Newton’s method is an excellent tool
for minimizing convex functions with poorly conditioned Hessian matrices, but in
the subsequent sections we will argue that Newton’s method requires significant
modification before it can be applied to neural networks.
8.2.2 Local Minima
One of the most prominent features of a convex optimization problem is that it
can be reduced to the problem of finding a local minimum. Any local minimum is
285
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
guaranteed to be a global minimum. Some convex functions have a flat region at
the bottom rather than a single global minimum point, but any point within such
a flat region is an acceptable solution. When optimizing a convex function, we
know that we have reached a good solution if we find a critical point of any kind.
With non-convex functions, such as neural nets, it is possible to have many
local minima. Indeed, nearly any deep model is essentially guaranteed to have
an extremely large number of local minima. However, as we will see, this is not
necessarily a major problem.
Neural networks and any models with multiple equivalently parametrized latent
variables all have multiple local minima because of the model identifiability problem.
A model is said to be identifiable if a sufficiently large training set can rule out all
but one setting of the model’s parameters. Models with latent variables are often
not identifiable because we can obtain equivalent models by exchanging latent
variables with each other. For example, we could take a neural network and modify
layer 1 by swapping the incoming weight vector for unit
i
with the incoming weight
vector for unit
j
, then doing the same for the outgoing weight vectors. If we have
m
layers with
n
units each, then there are
n
!
m
ways of arranging the hidden units.
This kind of non-identifiability is known as weight space symmetry.
In addition to weight space symmetry, many kinds of neural networks have
additional causes of non-identifiability. For example, in any rectified linear or
maxout network, we can scale all of the incoming weights and biases of a unit by
α
if we also scale all of its outgoing weights by
1
α
. This means that—if the cost
function does not include terms such as weight decay that depend directly on the
weights rather than the models’ outputs—every local minimum of a rectified linear
or maxout network lies on an (
m ×n
)-dimensional hyperbola of equivalent local
minima.
These model identifiability issues mean that there can be an extremely large
or even uncountably infinite amount of local minima in a neural network cost
function. However, all of these local minima arising from non-identifiability are
equivalent to each other in cost function value. As a result, these local minima are
not a problematic form of non-convexity.
Local minima can be problematic if they have high cost in comparison to the
global minimum. One can construct small neural networks, even without hidden
units, that have local minima with higher cost than the global minimum (Sontag
and Sussman, 1989; Brady et al., 1989; Gori and Tesi, 1992). If local minima
with high cost are common, this could pose a serious problem for gradient-based
optimization algorithms.
It remains an open question whether there are many local minima of high cost
286
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
for networks of practical interest and whether optimization algorithms encounter
them. For many years, most practitioners believed that local minima were a
common problem plaguing neural network optimization. Today, that does not
appear to be the case. The problem remains an active area of research, but experts
now suspect that, for sufficiently large neural networks, most local minima have a
low cost function value, and that it is not important to find a true global minimum
rather than to find a point in parameter space that has low but not minimal cost
(Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska
et al., 2014).
Many practitioners attribute nearly all difficulty with neural network optimiza-
tion to local minima. We encourage practitioners to carefully test for specific
problems. A test that can rule out local minima as the problem is to plot the
norm of the gradient over time. If the norm of the gradient does not shrink to
insignificant size, the problem is neither local minima nor any other kind of critical
point. This kind of negative test can rule out local minima. In high dimensional
spaces, it can be very difficult to positively establish that local minima are the
problem. Many structures other than local minima also have small gradients.
8.2.3 Plateaus, Saddle Points and Other Flat Regions
For many high-dimensional non-convex functions, local minima (and maxima)
are in fact rare compared to another kind of point with zero gradient: a saddle
point. Some points around a saddle point have greater cost than the saddle point,
while others have a lower cost. At a saddle point, the Hessian matrix has both
positive and negative eigenvalues. Points lying along eigenvectors associated with
positive eigenvalues have greater cost than the saddle point, while points lying
along negative eigenvalues have lower value. We can think of a saddle point as
being a local minimum along one cross-section of the cost function and a local
maximum along another cross-section. See Fig. 4.5 for an illustration.
Many classes of random functions exhibit the following behavior: in low-
dimensional spaces, local minima are common. In higher dimensional spaces, local
minima are rare and saddle points are more common. For a function
f
:
R
n
R
of
this type, the expected ratio of the number of saddle points to local minima grows
exponentially with
n
. To understand the intuition behind this behavior, observe
that the Hessian matrix at a local minimum has only positive eigenvalues. The
Hessian matrix at a saddle point has a mixture of positive and negative eigenvalues.
Imagine that the sign of each eigenvalue is generated by flipping a coin. In a single
dimension, it is easy to obtain a local minimum by tossing a coin and getting heads
once. In
n
-dimensional space, it is exponentially unlikely that all
n
coin tosses will
287
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
be heads. See Dauphin et al. (2014) for a review of the relevant theoretical work.
An amazing property of many random functions is that the eigenvalues of the
Hessian become more likely to be positive as we reach regions of lower cost. In
our coin tossing analogy, this means we are more likely to have our coin come up
heads
n
times if we are at a critical point with low cost. This means that local
minima are much more likely to have low cost than high cost. Critical points with
high cost are far more likely to be saddle points. Critical points with extremely
high cost are more likely to be local maxima.
This happens for many classes of random functions. Does it happen for neural
networks? Baldi and Hornik (1989) showed theoretically that shallow autoencoders
(feedforward networks trained to copy their input to their output, described in
Chapter 14) with no nonlinearities have global minima and saddle points but no
local minima with higher cost than the global minimum. They observed without
proof that these results extend to deeper networks without nonlinearities. The
output of such networks is a linear function of their input, but they are useful
to study as a model of nonlinear neural networks because their loss function is
a non-convex function of their parameters. Such networks are essentially just
multiple matrices composed together. Saxe et al. (2013) provided exact solutions
to the complete learning dynamics in such networks and showed that learning in
these models captures many of the qualitative features observed in the training of
deep models with nonlinear activation functions. Dauphin et al. (2014) showed
experimentally that real neural networks also have loss functions that contain very
many high-cost saddle points. Choromanska et al. (2014) provided additional
theoretical arguments, showing that another class of high-dimensional random
functions related to neural networks does so as well.
What are the implications of the proliferation of saddle points for training algo-
rithms? For first-order optimization algorithms that use only gradient information,
the situation is unclear. The gradient can often become very small near a saddle
point. On the other hand, gradient descent empirically seems to be able to escape
saddle points in many cases. Goodfellow et al. (2015) provided visualizations of
several learning trajectories of state-of-the-art neural networks, with an example
given in Fig. 8.2. These visualizations show a flattening of the cost function near
a prominent saddle point where the weights are all zero, but they also show the
gradient descent trajectory rapidly escaping this region. Goodfellow et al. (2015)
also argue that continuous-time gradient descent may be shown analytically to be
repelled from, rather than attracted to, a nearby saddle point, but the situation
may be different for more realistic uses of gradient descent.
For Newton’s method, it is clear that saddle points constitute a problem.
288
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Projection 2 of θ
Projection 1 of θ
J(θ)
Figure 8.2: A visualization of the cost function of a neural network. Image adapted
with permission from Goodfellow et al. (2015). These visualizations appear similar for
feedforward neural networks, convolutional networks, and recurrent networks applied
to real object recognition and natural language processing tasks. Surprisingly, these
visualizations usually do not show many conspicuous obstacles. Prior to the success of
stochastic gradient descent for training very large models beginning in roughly 2012,
neural net cost function surfaces were generally believed to have much more non-convex
structure than is revealed by these projections. The primary obstacle revealed by this
projection is a saddle point of high cost near where the parameters are initialized, but, as
indicated by the blue path, the SGD training trajectory escapes this saddle point readily.
Most of training time is spent traversing the relatively flat valley of the cost function,
which may be due to high noise in the gradient, poor conditioning of the Hessian matrix
in this region, or simply the need to circumnavigate the tall “mountain” visible in the
figure via an indirect arcing path.
289
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Gradient descent is designed to move “downhill” and is not explicitly designed
to seek a critical point. Newton’s method, however, is designed to solve for a
point where the gradient is zero. Without appropriate modification, it can jump
to a saddle point. The proliferation of saddle points in high dimensional spaces
presumably explains why second-order methods have not succeeded in replacing
gradient descent for neural network training. Dauphin et al. (2014) introduced
a saddle-free Newton method for second-order optimization and showed that it
improves significantly over the traditional version. Second-order methods remain
difficult to scale to large neural networks, but this saddle-free approach holds
promise if it could be scaled.
There are other kinds of points with zero gradient besides minima and saddle
points. There are also maxima, which are much like saddle points from the
perspective of optimization—many algorithms are not attracted to them, but
unmodified Newton’s method is. Maxima become exponentially rare in high
dimensional space, just like minima do.
There may also be wide, flat regions of constant value. In these locations, the
gradient and also the Hessian are all zero. Such degenerate locations pose major
problems for all numerical optimization algorithms. In a convex problem, a wide,
flat region must consist entirely of global minima, but in a general optimization
problem, such a region could correspond to a high value of the objective function.
8.2.4 Cliffs and Exploding Gradients
Neural networks with many layers often have extremely steep regions resembling
cliffs, as illustrated in Fig. 8.3. These result from the multiplication of several large
weights together. On the face of an extremely steep cliff structure, the gradient
update step can move the parameters extremely far, usually jumping off of the
cliff structure altogether.
290
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
w
b
J(w;b)
Figure 8.3: The objective function for highly nonlinear deep neural networks or for
recurrent neural networks often contains sharp nonlinearities in parameter space resulting
from the multiplication of several parameters. These nonlinearities give rise to very
high derivatives in some places. When the parameters get close to such a cliff region, a
gradient descent update can catapult the parameters very far, possibly losing most of the
optimization work that had been done. Figure adapted with permission from Pascanu
et al. (2013a).
The cliff can be dangerous whether we approach it from above or from below,
but fortunately its most serious consequences can be avoided using the gradient
clipping heuristic described in Sec. 10.11.1. The basic idea is to recall that the
gradient does not specify the optimal step size, but only the optimal direction
within an infinitesimal region. When the traditional gradient descent algorithm
proposes to make a very large step, the gradient clipping heuristic intervenes to
reduce the step size to be small enough that it is less likely to go outside the region
where the gradient indicates the direction of approximately steepest descent. Cliff
structures are most common in the cost functions for recurrent neural networks,
because such models involve a multiplication of many factors, with one factor
for each time step. Long temporal sequences thus incur an extreme amount of
multiplication.
8.2.5 Long-Term Dependencies
Another difficulty that neural network optimization algorithms must overcome arises
when the computational graph becomes extremely deep. Feedforward networks
with many layers have such deep computational graphs. So do recurrent networks,
described in Chapter 10, which construct very deep computational graphs by
291
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
repeatedly applying the same operation at each time step of a long temporal
sequence. Repeated application of the same parameters gives rise to especially
pronounced difficulties.
For example, suppose that a computational graph contains a path that consists
of repeatedly multiplying by a matrix
W
. After
t
steps, this is equivalent to mul-
tiplying by
W
t
. Suppose that
W
has an eigendecomposition
W
=
V diag
(
λ
)
V
1
.
In this simple case, it is straightforward to see that
W
t
=
V diag(λ)V
1
t
= V diag(λ)
t
V
1
. (8.11)
Any eigenvalues
λ
i
that are not near an absolute value of 1 will either explode if
they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.
The vanishing and exploding gradient problem refers to the fact that gradients
through such a graph are also scaled according to
diag
(
λ
)
t
. Vanishing gradients
make it difficult to know which direction the parameters should move to improve
the cost function, while exploding gradients can make learning unstable. The cliff
structures described earlier that motivate gradient clipping are an example of the
exploding gradient phenomenon.
The repeated multiplication by
W
at each time step described here is very
similar to the power method algorithm used to find the largest eigenvalue of a matrix
W
and the corresponding eigenvector. From this point of view it is not surprising
that
x
W
t
will eventually discard all components of
x
that are orthogonal to the
principal eigenvector of W .
Recurrent networks use the same matrix
W
at each time step, but feedforward
networks do not, so even very deep feedforward networks can largely avoid the
vanishing and exploding gradient problem (Sussillo, 2014).
We defer a further discussion of the challenges of training recurrent networks
until Sec. 10.7, after recurrent networks have been described in more detail.
8.2.6 Inexact Gradients
Most optimization algorithms are primarily motivated by the case where we have
exact knowledge of the gradient or Hessian matrix. In practice, we usually only
have a noisy or even biased estimate of these quantities. Nearly every deep learning
algorithm relies on sampling-based estimates at least insofar as using a minibatch
of training examples to compute the gradient.
In other cases, the objective function we want to minimize is actually intractable.
When the objective function is intractable, typically its gradient is intractable as
well. In such cases we can only approximate the gradient. These issues mostly arise
292
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
with the more advanced models in Part III. For example, contrastive divergence
gives a technique for approximating the gradient of the intractable log-likelihood
of a Boltzmann machine.
Various neural network optimization algorithms are designed to account for
imperfections in the gradient estimate. One can also avoid the problem by choosing
a surrogate loss function that is easier to approximate than the true loss.
8.2.7 Poor Correspondence between Local and Global Structure
Many of the problems we have discussed so far correspond to properties of the
loss function at a single point—it can be difficult to make a single step if
J
(
θ
) is
poorly conditioned at the current point
θ
, or if
θ
lies on a cliff, or if
θ
is a saddle
point hiding the opportunity to make progress downhill from the gradient.
It is possible to overcome all of these problems at a single point and still
perform poorly if the direction that results in the most improvement locally does
not point toward distant regions of much lower cost.
Goodfellow et al. (2015) argue that much of the runtime of training is due to
the length of the trajectory needed to arrive at the solution. Fig. 8.2 shows that
the learning trajectory spends most of its time tracing out a wide arc around a
mountain-shaped structure.
Much of research into the difficulties of optimization has focused on whether
training arrives at a global minimum, a local minimum, or a saddle point, but in
practice neural networks do not arrive at a critical point of any kind. Fig. 8.1
shows that neural networks often do not arrive at a region of small gradient. Indeed,
such critical points do not even necessarily exist. For example, the loss function
log p
(
y | x
;
θ
) can lack a global minimum point and instead asymptotically
approach some value as the model becomes more confident. For a classifier with
discrete
y
and
p
(
y | x
) provided by a softmax, the negative log-likelihood can
become arbitrarily close to zero if the model is able to correctly classify every
example in the training set, but it is impossible to actually reach the value of
zero. Likewise, a model of real values
p
(
y | x
) =
N
(
y
;
f
(
θ
)
, β
1
) can have negative
log-likelihood that asymptotes to negative infinity—if
f
(
θ
) is able to correctly
predict the value of all training set
y
targets, the learning algorithm will increase
β
without bound. See Fig. 8.4 for an example of a failure of local optimization to
find a good cost function value even in the absence of any local minima or saddle
points.
Future research will need to develop further understanding of the factors that
influence the length of the learning trajectory and better characterize the outcome
293
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
θ
J(θ)
Figure 8.4: Optimization based on local downhill moves can fail if the local surface does
not point toward the global solution. Here we provide an example of how this can occur,
even if there are no saddle points and no local minima. This example cost function
contains only asymptotes toward low values, not minima. The main cause of difficulty in
this case is being initialized on the wrong side of the “mountain” and not being able to
traverse it. In higher dimensional space, learning algorithms can often circumnavigate
such mountains but the trajectory associated with doing so may be long and result in
excessive training time, as illustrated in Fig. 8.2.
of the process.
Many existing research directions are aimed at finding good initial points for
problems that have difficult global structure, rather than developing algorithms
that use non-local moves.
Gradient descent and essentially all learning algorithms that are effective for
training neural networks are based on making small, local moves. The previous
sections have primarily focused on how the correct direction of these local moves
can be difficult to compute. We may be able to compute some properties of the
objective function, such as its gradient, only approximately, with bias or variance
in our estimate of the correct direction. In these cases, local descent may or may
not define a reasonably short path to a valid solution, but we are not actually
able to follow the local descent path. The objective function may have issues
such as poor conditioning or discontinuous gradients, causing the region where
the gradient provides a good model of the objective function to be very small. In
these cases, local descent with steps of size
may define a reasonably short path
to the solution, but we are only able to compute the local descent direction with
steps of size
δ
. In these cases, local descent may or may not define a path
to the solution, but the path contains many steps, so following the path incurs a
294
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
high computational cost. Sometimes local information provides us no guide, when
the function has a wide flat region, or if we manage to land exactly on a critical
point (usually this latter scenario only happens to methods that solve explicitly
for critical points, such as Newton’s method). In these cases, local descent does
not define a path to a solution at all. In other cases, local moves can be too greedy
and lead us along a path that moves downhill but away from any solution, as in
Fig. 8.4, or along an unnecessarily long trajectory to the solution, as in Fig. 8.2.
Currently, we do not understand which of these problems are most relevant to
making neural network optimization difficult, and this is an active area of research.
Regardless of which of these problems are most significant, all of them might be
avoided if there exists a region of space connected reasonably directly to a solution
by a path that local descent can follow, and if we are able to initialize learning
within that well-behaved region. This last view suggests research into choosing
good initial points for traditional optimization algorithms to use.
8.2.8 Theoretical Limits of Optimization
Several theoretical results show that there are limits on the performance of any
optimization algorithm we might design for neural networks (Blum and Rivest,
1992; Judd, 1989; Wolpert and MacReady, 1997). Typically these results have
little bearing on the use of neural networks in practice.
Some theoretical results apply only to the case where the units of a neural
network output discrete values. However, most neural network units output
smoothly increasing values that make optimization via local search feasible. Some
theoretical results show that there exist problem classes that are intractable, but
it can be difficult to tell whether a particular problem falls into that class. Other
results show that finding a solution for a network of a given size is intractable, but
in practice we can find a solution easily by using a larger network for which many
more parameter settings correspond to an acceptable solution. Moreover, in the
context of neural network training, we usually do not care about finding the exact
minimum of a function, but only in reducing its value sufficiently to obtain good
generalization error. Theoretical analysis of whether an optimization algorithm
can accomplish this goal is extremely difficult. Developing more realistic bounds
on the performance of optimization algorithms therefore remains an important
goal for machine learning research.
295
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.3 Basic Algorithms
We have previously introduced the gradient descent (Sec. 4.3) algorithm that
follows the gradient of an entire training set downhill. This may be accelerated
considerably by using stochastic gradient descent to follow the gradient of randomly
selected minibatches downhill, as discussed in Sec. 5.9 and Sec. 8.1.3.
8.3.1 Stochastic Gradient Descent
Stochastic gradient descent (SGD) and its variants are probably the most used
optimization algorithms for machine learning in general and for deep learning in
particular. As discussed in Sec. 8.1.3, it is possible to obtain an unbiased estimate
of the gradient by taking the average gradient on a minibatch of
m
examples drawn
i.i.d from the data generating distribution.
Algorithm 8.1 shows how to follow this estimate of the gradient downhill.
Algorithm 8.1
Stochastic gradient descent (SGD) update at training iteration
k
Require: Learning rate
k
.
Require: Initial parameter θ
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding targets y
(i)
.
Compute gradient estimate:
ˆ
g +
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
Apply update: θ θ
ˆ
g
end while
A crucial parameter for the SGD algorithm is the learning rate. Previously, we
have described SGD as using a fixed learning rate
. In practice, it is necessary to
gradually decrease the learning rate over time, so we now denote the learning rate
at iteration k as
k
.
This is because the SGD gradient estimator introduces a source of noise (the
random sampling of
m
training examples) that does not vanish even when we arrive
at a minimum. By comparison, the true gradient of the total cost function becomes
small and then
0
when we approach and reach a minimum using batch gradient
descent, so batch gradient descent can use a fixed learning rate. A sufficient
296
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
condition to guarantee convergence of SGD is that
k=1
k
= , and
k=1
2
k
< . (8.12)
In practice, it is common to decay the learning rate linearly until iteration τ:
k
= (1 α)
0
+ α
τ
(8.13)
with α =
k
τ
. After iteration τ, it is common to leave constant.
The learning rate may be chosen by trial and error, but it is usually best
to choose it by monitoring learning curves that plot the objective function as a
function of time. This is more of an art than a science, and most guidance on this
subject should be regarded with some skepticism. When using the linear schedule,
the parameters to choose are
0
,
τ
, and
τ
. Usually
τ
may be set to the number of
iterations required to make a few hundred passes through the training set. Usually
τ
should be set to roughly 1% the value of
0
. The main question is how to set
0
.
If it is too large, the learning curve will show violent oscillations, with the cost
function often increasing significantly. Gentle oscillations are fine, especially if
training with a stochastic cost function such as the cost function arising from the
use of dropout. If the learning rate is too low, learning proceeds slowly, and if the
initial learning rate is too low, learning may become stuck with a high cost value.
Typically, the optimal initial learning rate, in terms of total training time and the
final cost value, is higher than the learning rate that yields the best performance
after the first 100 iterations or so. Therefore, it is usually best to monitor the first
several iterations and use a learning rate that is higher than the best-performing
learning rate at this time, but not so high that it causes severe instability.
The most important property of SGD and related minibatch or online gradient-
based optimization is that computation time per update does not grow with the
number of training examples. This allows convergence even when the number
of training examples becomes very large. For a large enough dataset, SGD may
converge to within some fixed tolerance of its final test set error before it has
processed the entire training set.
To study the convergence rate of an optimization algorithm it is common to
measure the excess error
J
(
θ
)
min
θ
J
(
θ
), which is the amount that the current
cost function exceeds the minimum possible cost. When SGD is applied to a convex
problem, the excess error is
O
(
1
k
) after
k
iterations, while in the strongly convex
297
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
case it is
O
(
1
k
). These bounds cannot be improved unless extra conditions are
assumed. Batch gradient descent enjoys better convergence rates than stochastic
gradient descent in theory. However, the Cramér-Rao bound (Cramér, 1946; Rao,
1945) states that generalization error cannot decrease faster than
O
(
1
k
). Bottou
and Bousquet (2008) argue that it therefore may not be worthwhile to pursue
an optimization algorithm that converges faster than
O
(
1
k
) for machine learning
tasks—faster convergence presumably corresponds to overfitting. Moreover, the
asymptotic analysis obscures many advantages that stochastic gradient descent
has after a small number of steps. With large datasets, the ability of SGD to make
rapid initial progress while evaluating the gradient for only very few examples
outweighs its slow asymptotic convergence. Most of the algorithms described in
the remainder of this chapter achieve benefits that matter in practice but are lost
in the constant factors obscured by the
O
(
1
k
) asymptotic analysis. One can also
trade off the benefits of both batch and stochastic gradient descent by gradually
increasing the minibatch size during the course of learning.
For more information on SGD, see Bottou (1998).
8.3.2 Momentum
While stochastic gradient descent remains a very popular optimization strategy,
learning with it can sometimes be slow. The method of momentum (Polyak, 1964)
is designed to accelerate learning, especially in the face of high curvature, small but
consistent gradients, or noisy gradients. The momentum algorithm accumulates
an exponentially decaying moving average of past gradients and continues to move
in their direction. The effect of momentum is illustrated in Fig. 8.5.
Formally, the momentum algorithm introduces a variable
v
that plays the role
of velocity—it is the direction and speed at which the parameters move through
parameter space. The velocity is set to an exponentially decaying average of
the negative gradient. The name momentum derives from a physical analogy, in
which the negative gradient is a force moving a particle through parameter space,
according to Newton’s laws of motion. Momentum in physics is mass times velocity.
In the momentum learning algorithm, we assume unit mass, so the velocity vector
v
may also be regarded as the momentum of the particle. A hyperparameter
α
[0
,
1)
determines how quickly the contributions of previous gradients exponentially decay.
298
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
30 20 10 0 10 20
30
20
10
0
10
20
Figure 8.5: Momentum aims primarily to solve two problems: poor conditioning of the
Hessian matrix and variance in the stochastic gradient. Here, we illustrate how momentum
overcomes the first of these two problems. The contour lines depict a quadratic loss
function with a poorly conditioned Hessian matrix. The red path cutting across the
contours indicates the path followed by the momentum learning rule as it minimizes this
function. At each step along the way, we draw an arrow indicating the step that gradient
descent would take at that point. We can see that a poorly conditioned quadratic objective
looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses
the canyon lengthwise, while gradient steps waste time moving back and forth across the
narrow axis of the canyon. Compare also Fig. 4.6, which shows the behavior of gradient
descent without momentum.
299
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
The update rule is given by:
v αv
θ
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
)
,
θ θ + v.
The velocity
v
accumulates the gradient elements
θ
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
)
.
The larger
α
is relative to
, the more previous gradients affect the current direction.
The SGD algorithm with momentum is given in Algorithm 8.2.
Previously, the size of the step was simply the norm of the gradient multiplied
by the learning rate. Now, the size of the step depends on how large and how
aligned a
sequence
of gradients are. The step size is largest when many successive
gradients point in exactly the same direction. If the momentum algorithm always
observes gradient
g
, then it will accelerate in the direction of
g
, until reaching a
terminal velocity where the size of each step is
||g||
1 α
. (8.14)
It is thus helpful to think of the momentum hyperparameter in terms of
1
1α
. For
example,
α
=
.
9 corresponds to multiplying the maximum speed by 10 relative to
the gradient descent algorithm.
Common values of
α
used in practice include
.
5,
.
9, and
.
99. Like the learning
rate,
α
may also be adapted over time. Typically it begins with a small value and
is later raised. It is less important to adapt
α
over time than to shrink
over time.
Algorithm 8.2 Stochastic gradient descent (SGD) with momentum
Require: Learning rate , momentum parameter α.
Require: Initial parameter θ, initial velocity v.
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding targets y
(i)
.
Compute gradient estimate: g
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
Compute velocity update: v αv g
Apply update: θ θ + v
end while
We can view the momentum algorithm as simulating a particle subject to
continuous-time Newtonian dynamics. The physical analogy can help to build
intuition for how the momentum and gradient descent algorithms behave.
300
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
The position of the particle at any point in time is given by
θ
(
t
). The particle
experiences net force f(t). This force causes the particle to accelerate:
f(t) =
2
t
2
θ(t). (8.15)
Rather than viewing this as a second-order differential equation of the position,
we can introduce the variable
v
(
t
) representing the velocity of the particle at time
t and rewrite the Newtonian dynamics as a first-order differential equation:
v(t) =
t
θ(t), (8.16)
f(t) =
t
v(t). (8.17)
The momentum algorithm then consists of solving the differential equations via
numerical simulation. A simple numerical method for solving differential equations
is Euler’s method, which simply consists of simulating the dynamics defined by
the equation by taking small, finite steps in the direction of each gradient.
This explains the basic form of the momentum update, but what specifically are
the forces? One force is proportional to the negative gradient of the cost function:
−∇
θ
J
(
θ
). This force pushes the particle downhill along the cost function surface.
The gradient descent algorithm would simply take a single step based on each
gradient, but the Newtonian scenario used by the momentum algorithm instead
uses this force to alter the velocity of the particle. We can think of the particle
as being like a hockey puck sliding down an icy surface. Whenever it descends a
steep part of the surface, it gathers speed and continues sliding in that direction
until it begins to go uphill again.
One other force is necessary. If the only force is the gradient of the cost function,
then the particle might never come to rest. Imagine a hockey puck sliding down
one side of a valley and straight up the other side, oscillating back and forth forever,
assuming the ice is perfectly frictionless. To resolve this problem, we add one
other force, proportional to
v
(
t
). In physics terminology, this force corresponds
to viscous drag, as if the particle must push through a resistant medium such as
syrup. This causes the particle to gradually lose energy over time and eventually
converge to a local minimum.
Why do we use
v
(
t
) and viscous drag in particular? Part of the reason to
use
v
(
t
) is mathematical convenience—an integer power of the velocity is easy
to work with. However, other physical systems have other kinds of drag based
on other integer powers of the velocity. For example, a particle traveling through
the air experiences turbulent drag, with force proportional to the square of the
301
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
velocity, while a particle moving along the ground experiences dry friction, with a
force of constant magnitude. We can reject each of these options. Turbulent drag,
proportional to the square of the velocity, becomes very weak when the velocity is
small. It is not powerful enough to force the particle to come to rest. A particle
with a non-zero initial velocity that experiences only the force of turbulent drag
will move away from its initial position forever, with the distance from the starting
point growing like
O
(
log t
). We must therefore use a lower power of the velocity.
If we use a power of zero, representing dry friction, then the force is too strong.
When the force due to the gradient of the cost function is small but non-zero, the
constant force due to friction can cause the particle to come to rest before reaching
a local minimum. Viscous drag avoids both of these problems—it is weak enough
that the gradient can continue to cause motion until a minimum is reached, but
strong enough to prevent motion if the gradient does not justify moving.
8.3.3 Nesterov Momentum
Sutskever et al. (2013) introduced a variant of the momentum algorithm that was
inspired by Nesterov’s accelerated gradient method (Nesterov, 1983, 2004). The
update rules in this case are given by:
v αv
θ
1
m
m
i=1
L
f(x
(i)
; θ + αv), y
(i)
,
θ θ + v, (8.18)
where the parameters
α
and
play a similar role as in the standard momentum
method. The difference between Nesterov momentum and standard momentum is
where the gradient is evaluated. With Nesterov momentum the gradient is evaluated
after the current velocity is applied. Thus one can interpret Nesterov momentum
as attempting to add a correction factor to the standard method of momentum.
The complete Nesterov momentum algorithm is presented in Algorithm 8.3.
In the convex batch gradient case, Nesterov momentum brings the rate of
convergence of the excess error from
O
(1
/k
) (after
k
steps) to
O
(1
/k
2
) as shown
by Nesterov (1983). Unfortunately, in the stochastic gradient case, Nesterov
momentum does not improve the rate of convergence.
8.4 Parameter Initialization Strategies
Some optimization algorithms are not iterative by nature and simply solve for a
solution point. Other optimization algorithms are iterative by nature but, when
302
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum
Require: Learning rate , momentum parameter α.
Require: Initial parameter θ, initial velocity v.
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding labels y
(i)
.
Apply interim update:
˜
θ θ + αv
Compute gradient (at interim point): g
1
m
˜
θ
i
L(f(x
(i)
;
˜
θ), y
(i)
)
Compute velocity update: v αv g
Apply update: θ θ + v
end while
applied to the right class of optimization problems, converge to acceptable solutions
in an acceptable amount of time regardless of initialization. Deep learning training
algorithms usually do not have either of these luxuries. Training algorithms for deep
learning models are usually iterative in nature and thus require the user to specify
some initial point from which to begin the iterations. Moreover, training deep
models is a sufficiently difficult task that most algorithms are strongly affected by
the choice of initialization. The initial point can determine whether the algorithm
converges at all, with some initial points being so unstable that the algorithm
encounters numerical difficulties and fails altogether. When learning does converge,
the initial point can determine how quickly learning converges and whether it
converges to a point with high or low cost. Also, points of comparable cost
can have wildly varying generalization error, and the initial point can affect the
generalization as well.
Modern initialization strategies are simple and heuristic. Designing improved
initialization strategies is a difficult task because neural network optimization is
not yet well understood. Most initialization strategies are based on achieving some
nice properties when the network is initialized. However, we do not have a good
understanding of which of these properties are preserved under which circumstances
after learning begins to proceed. A further difficulty is that some initial points
may be beneficial from the viewpoint of optimization but detrimental from the
viewpoint of generalization. Our understanding of how the initial point affects
generalization is especially primitive, offering little to no guidance for how to select
the initial point.
Perhaps the only property known with complete certainty is that the initial
parameters need to “break symmetry” between different units. If two hidden
units with the same activation function are connected to the same inputs, then
303
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
these units must have different initial parameters. If they have the same initial
parameters, then a deterministic learning algorithm applied to a deterministic cost
and model will constantly update both of these units in the same way. Even if the
model or training algorithm is capable of using stochasticity to compute different
updates for different units (for example, if one trains with dropout), it is usually
best to initialize each unit to compute a different function from all of the other
units. This may help to make sure that no input patterns are lost in the null
space of forward propagation and no gradient patterns are lost in the null space
of back-propagation. The goal of having each unit compute a different function
motivates random initialization of the parameters. We could explicitly search
for a large set of basis functions that are all mutually different from each other,
but this often incurs a noticeable computational cost. For example, if we have at
most as many outputs as inputs, we could use Gram-Schmidt orthogonalization
on an initial weight matrix, and be guaranteed that each unit computes a very
different function from each other unit. Random initialization from a high-entropy
distribution over a high-dimensional space is computationally cheaper and unlikely
to assign any units to compute the same function as each other.
Typically, we set the biases for each unit to heuristically chosen constants, and
initialize only the weights randomly. Extra parameters, for example, parameters
encoding the conditional variance of a prediction, are usually set to heuristically
chosen constants much like the biases are.
We almost always initialize all the weights in the model to values drawn
randomly from a Gaussian or uniform distribution. The choice of Gaussian
or uniform distribution does not seem to matter very much, but has not been
exhaustively studied. The scale of the initial distribution, however, does have a
large effect on both the outcome of the optimization procedure and on the ability
of the network to generalize.
Larger initial weights will yield a stronger symmetry breaking effect, helping
to avoid redundant units. They also help to avoid losing signal during forward or
back-propagation through the linear component of each layer—larger values in the
matrix result in larger outputs of matrix multiplication. Initial weights that are
too large may, however, result in exploding values during forward propagation or
back-propagation. In recurrent networks, large weights can also result in chaos
(such extreme sensitivity to small perturbations of the input that the behavior
of the deterministic forward propagation procedure appears random). To some
extent, the exploding gradient problem can be mitigated by gradient clipping
(thresholding the values of the gradients before performing a gradient descent step).
Large weights may also result in extreme values that cause the activation function
304
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
to saturate, causing complete loss of gradient through saturated units. These
competing factors determine the ideal initial scale of the weights.
The perspectives of regularization and optimization can give very different
insights into how we should initialize a network. The optimization perspective
suggests that the weights should be large enough to propagate information success-
fully, but some regularization concerns encourage making them smaller. The use
of an optimization algorithm such as stochastic gradient descent that makes small
incremental changes to the weights and tends to halt in areas that are nearer to
the initial parameters (whether due to getting stuck in a region of low gradient, or
due to triggering some early stopping criterion based on overfitting) expresses a
prior that the final parameters should be close to the initial parameters. Recall
from Sec. 7.8 that gradient descent with early stopping is equivalent to weight
decay for some models. In the general case, gradient descent with early stopping is
not the same as weight decay, but does provide a loose analogy for thinking about
the effect of initialization. We can think of initializing the parameters
θ
to
θ
0
as
being similar to imposing a Gaussian prior
p
(
θ
) with mean
θ
0
. From this point
of view, it makes sense to choose
θ
0
to be near 0. This prior says that it is more
likely that units do not interact with each other than that they do interact. Units
interact only if the likelihood term of the objective function expresses a strong
preference for them to interact. On the other hand, if we initialize
θ
0
to large
values, then our prior specifies which units should interact with each other, and
how they should interact.
Some heuristics are available for choosing the initial scale of the weights. One
heuristic is to initialize the weights of a fully connected layer with
m
inputs and
n
outputs by sampling each weight from
U
(
1
m
,
1
m
), while Glorot and Bengio
(2010) suggest using the normalized initialization
W
i,j
U(
6
m + n
,
6
m + n
). (8.19)
This latter heuristic is designed to compromise between the goal of initializing
all layers to have the same activation variance and the goal of initializing all
layers to have the same gradient variance. The formula is derived using the
assumption that the network consists only of a chain of matrix multiplications,
with no nonlinearities. Real neural networks obviously violate this assumption,
but many strategies designed for the linear model perform reasonably well on its
nonlinear counterparts.
Saxe et al. (2013) recommend initializing to random orthogonal matrices, with
a carefully chosen scaling or gain factor
g
that accounts for the nonlinearity applied
at each layer. They derive specific values of the scaling factor for different types of
305
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
nonlinear activation functions. This initialization scheme is also motivated by a
model of a deep network as a sequence of matrix multiplies without nonlinearities.
Under such a model, this initialization scheme guarantees that the total number of
training iterations required to reach convergence is independent of depth.
Increasing the scaling factor
g
pushes the network toward the regime where
activations increase in norm as they propagate forward through the network and
gradients increase in norm as they propagate backward. Sussillo (2014) showed
that setting the gain factor correctly is sufficient to train networks as deep as
1,000 layers, without needing to use orthogonal initializations. A key insight of
this approach is that in feedforward networks, activations and gradients can grow
or shrink on each step of forward or back-propagation, following a random walk
behavior. This is because feedforward networks use a different weight matrix at
each layer. If this random walk is tuned to preserve norms, then feedforward
networks can mostly avoid the vanishing and exploding gradients problem that
arises when the same weight matrix is used at each step, described in Sec. 8.2.5.
Unfortunately, these optimal criteria for initial weights often do not lead to
optimal performance. This may be for three different reasons. First, we may
be using the wrong criteria—it may not actually be beneficial to preserve the
norm of a signal throughout the entire network. Second, the properties imposed
at initialization may not persist after learning has begun to proceed. Third, the
criteria might succeed at improving the speed of optimization but inadvertently
increase generalization error. In practice, we usually need to treat the scale of the
weights as a hyperparameter whose optimal value lies somewhere roughly near but
not exactly equal to the theoretical predictions.
One drawback to scaling rules that set all of the initial weights to have the same
standard deviation, such as
1
m
, is that every individual weight becomes extremely
small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to
have exactly
k
non-zero weights. The idea is to keep the total amount of input to
the unit independent from the number of inputs
m
without making the magnitude
of individual weight elements shrink with
m
. Sparse initialization helps to achieve
more diversity among the units at initialization time. However, it also imposes
a very strong prior on the weights that are chosen to have large Gaussian values.
Because it takes a long time for gradient descent to shrink “incorrect” large values,
this initialization scheme can cause problems for units such as maxout units that
have several filters that must be carefully coordinated with each other.
When computational resources allow it, it is usually a good idea to treat the
initial scale of the weights for each layer as a hyperparameter, and to choose these
306
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
scales using a hyperparameter search algorithm described in Sec. 11.4.2, such
as random search. The choice of whether to use dense or sparse initialization
can also be made a hyperparameter. Alternately, one can manually search for
the best initial scales. A good rule of thumb for choosing the initial scales is to
look at the range or standard deviation of activations or gradients on a single
minibatch of data. If the weights are too small, the range of activations across the
minibatch will shrink as the activations propagate forward through the network.
By repeatedly identifying the first layer with unacceptably small activations and
increasing its weights, it is possible to eventually obtain a network with reasonable
initial activations throughout. If learning is still too slow at this point, it can be
useful to look at the range or standard deviation of the gradients as well as the
activations. This procedure can in principle be automated and is generally less
computationally costly than hyperparameter optimization based on validation set
error because it is based on feedback from the behavior of the initial model on a
single batch of data, rather than on feedback from a trained model on the validation
set. While long used heuristically, this protocol has recently been specified more
formally and studied by Mishkin and Matas (2015).
So far we have focused on the initialization of the weights. Fortunately,
initialization of other parameters is typically easier.
The approach for setting the biases must be coordinated with the approach
for settings the weights. Setting the biases to zero is compatible with most weight
initialization schemes. There are a few situations where we may set some biases to
non-zero values:
If a bias is for an output unit, then it is often beneficial to initialize the bias to
obtain the right marginal statistics of the output. To do this, we assume that
the initial weights are small enough that the output of the unit is determined
only by the bias. This justifies setting the bias to the inverse of the activation
function applied to the marginal statistics of the output in the training set.
For example, if the output is a distribution over classes and this distribution
is a highly skewed distribution with the marginal probability of class
i
given
by element
c
i
of some vector
c
, then we can set the bias vector
b
by solving
the equation
softmax
(
b
) =
c
. This applies not only to classifiers but also to
models we will encounter in Part III, such as autoencoders and Boltzmann
machines. These models have layers whose output should resemble the input
data
x
, and it can be very helpful to initialize the biases of such layers to
match the marginal distribution over x.
Sometimes we may want to choose the bias to avoid causing too much
saturation at initialization. For example, we may set the bias of a ReLU
307
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
hidden unit to 0.1 rather than 0 to avoid saturating the ReLU at initialization.
This approach is not compatible with weight initialization schemes that do
not expect strong input from the biases though. For example, it is not
recommended for use with random walk initialization (Sussillo, 2014).
Sometimes a unit controls whether other units are able to participate in a
function. In such situations, we have a unit with output
u
and another unit
h
[0
,
1], then we can view
h
as a gate that determines whether
uh
1 or
uh
0. In these situations, we want to set the bias for
h
so that
h
1 most
of the time at initialization. Otherwise
u
does not have a chance to learn.
For example, Jozefowicz et al. (2015b) advocate setting the bias to 1 for the
forget gate of the LSTM model, described in Sec. 10.10.
Another common type of parameter is a variance or precision parameter. For
example, we can perform linear regression with a conditional variance estimate
using the model
p(y | x) = N(y | w
T
x + b, 1) (8.20)
where
β
is a precision parameter. We can usually initialize variance or precision
parameters to 1 safely. Another approach is to assume the initial weights are close
enough to zero that the biases may be set while ignoring the effect of the weights,
then set the biases to produce the correct marginal mean of the output, and set
the variance parameters to the marginal variance of the output in the training set.
Besides these simple constant or random methods of initializing model parame-
ters, it is possible to initialize model parameters using machine learning. A common
strategy discussed in Part III of this book is to initialize a supervised model with
the parameters learned by an unsupervised model trained on the same inputs.
One can also perform supervised training on a related task. Even performing
supervised training on an unrelated task can sometimes yield an initialization that
offers faster convergence than a random initialization. Some of these initialization
strategies may yield faster convergence and better generalization because they
encode information about the distribution in the initial parameters of the model.
Others apparently perform well primarily because they set the parameters to have
the right scale or set different units to compute different functions from each other.
8.5 Algorithms with Adaptive Learning Rates
Neural network researchers have long realized that the learning rate was reliably one
of the hyperparameters that is the most difficult to set because it has a significant
308
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
impact on model performance. As we have discussed in Sec. 4.3 and Sec. 8.2, the
cost is often highly sensitive to some directions in parameter space and insensitive
to others. The momentum algorithm can mitigate these issues somewhat, but
does so at the expense of introducing another hyperparameter. In the face of this,
it is natural to ask if there is another way. If we believe that the directions of
sensitivity are somewhat axis-aligned, it can make sense to use a separate learning
rate for each parameter, and automatically adapt these learning rates throughout
the course of learning.
The delta-bar-delta algorithm (Jacobs, 1988) is an early heuristic approach
to adapting individual learning rates for model parameters during training. The
approach is based on a simple idea: if the partial derivative of the loss, with respect
to a given model parameter, remains the same sign, then the learning rate should
increase. If the partial derivative with respect to that parameter changes sign,
then the learning rate should decrease. Of course, this kind of rule can only be
applied to full batch optimization.
More recently, a number of incremental (or mini-batch-based) methods have
been introduced that adapt the learning rates of model parameters. This section
will briefly review a few of these algorithms.
8.5.1 AdaGrad
The AdaGrad algorithm, shown in Algorithm 8.4, individually adapts the learning
rates of all model parameters by scaling them inversely proportional to the square
root of the sum of all of their historical squared values (Duchi et al., 2011). The
parameters with the largest partial derivative of the loss have a correspondingly
rapid decrease in their learning rate, while parameters with small partial derivatives
have a relatively small decrease in their learning rate. The net effect is greater
progress in the more gently sloped directions of parameter space.
In the context of convex optimization, the AdaGrad algorithm enjoys some
desirable theoretical properties. However, empirically it has been found that—for
training deep neural network models—the accumulation of squared gradients
from
the beginning of training
can result in a premature and excessive decrease
in the effective learning rate. AdaGrad performs well for some but not all deep
learning models.
309
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.4 The AdaGrad algorithm
Require: Global learning rate
Require: Initial parameter θ
Require: Small constant δ, perhaps 10
7
, for numerical stability
Initialize gradient accumulation variable r = 0
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding targets y
(i)
.
Compute gradient: g
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
Accumulate squared gradient: r r + g g
Compute update:
θ
δ+
r
g
. (Division and square root applied
element-wise)
Apply update: θ θ + θ
end while
8.5.2 RMSProp
The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the
non-convex setting by changing the gradient accumulation into an exponentially
weighted moving average. AdaGrad is designed to converge rapidly when applied
to a convex function. When applied to a non-convex function to train a neural
network, the learning trajectory may pass through many different structures and
eventually arrive at a region that is a locally convex bowl. AdaGrad shrinks the
learning rate according to the entire history of the squared gradient and may
have made the learning rate too small before arriving at such a convex structure.
RMSProp uses an exponentially decaying average to discard history from the
extreme past so that it can converge rapidly after finding a convex bowl, as if it
were an instance of the AdaGrad algorithm initialized within that bowl.
RMSProp is shown in its standard form in Algorithm 8.5 and combined with
Nesterov momentum in Algorithm 8.6. Compared to AdaGrad, the use of the
moving average introduces a new hyperparameter,
ρ
, that controls the length scale
of the moving average.
Empirically, RMSProp has been shown to be an effective and practical op-
timization algorithm for deep neural networks. It is currently one of the go-to
optimization methods being employed routinely by deep learning practitioners.
310
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.5 The RMSProp algorithm
Require: Global learning rate , decay rate ρ.
Require: Initial parameter θ
Require:
Small constant
δ
, usually 10
6
, used to stabilize division by small
numbers.
Initialize accumulation variables r = 0
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding targets y
(i)
.
Compute gradient: g
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
Accumulate squared gradient: r ρr + (1 ρ)g g
Compute parameter update:
θ
=
δ+r
g
. (
1
δ+r
applied element-wise)
Apply update: θ θ + θ
end while
8.5.3 Adam
Adam (Kingma and Ba, 2014) is yet another adaptive learning rate optimization
algorithm and is presented in Algorithm 8.7. The name “Adam” derives from
the phrase “adaptive moments.” In the context of the earlier algorithms, it is
perhaps best seen as a variant on the combination of RMSProp and momentum
with a few important distinctions. First, in Adam, momentum is incorporated
directly as an estimate of the first order moment (with exponential weighting) of
the gradient. The most straightforward way to add momentum to RMSProp is to
apply momentum to the rescaled gradients. The use of momentum in combination
with rescaling does not have a clear theoretical motivation. Second, Adam includes
bias corrections to the estimates of both the first-order moments (the momentum
term) and the (uncentered) second-order moments to account for their initialization
at the origin (see Algorithm 8.7). RMSProp also incorporates an estimate of the
(uncentered) second-order moment, however it lacks the correction factor. Thus,
unlike in Adam, the RMSProp second-order moment estimate may have high bias
early in training. Adam is generally regarded as being fairly robust to the choice
of hyperparameters, though the learning rate sometimes needs to be changed from
the suggested default.
311
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.6 RMSProp algorithm with Nesterov momentum
Require: Global learning rate , decay rate ρ, momentum coefficient α.
Require: Initial parameter θ, initial velocity v.
Initialize accumulation variable r = 0
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding targets y
(i)
.
Compute interim update:
˜
θ θ + αv
Compute gradient: g
1
m
˜
θ
i
L(f(x
(i)
;
˜
θ), y
(i)
)
Accumulate gradient: r ρr + (1 ρ)g g
Compute velocity update: v αv
r
g. (
1
r
applied element-wise)
Apply update: θ θ + v
end while
8.5.4 Choosing the Right Optimization Algorithm
In this section, we discussed a series of related algorithms that each seek to address
the challenge of optimizing deep models by adapting the learning rate for each
model parameter. At this point, a natural question is: which algorithm should one
choose?
Unfortunately, there is currently no consensus on this point. Schaul et al. (2014)
presented a valuable comparison of a large number of optimization algorithms
across a wide range of learning tasks. While the results suggest that the family of
algorithms with adaptive learning rates (represented by RMSProp and AdaDelta)
performed fairly robustly, no single best algorithm has emerged.
Currently, the most popular optimization algorithms actively in use include
SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta
and Adam. The choice of which algorithm to use, at this point, seems to depend
largely on the user’s familiarity with the algorithm (for ease of hyperparameter
tuning).
8.6 Approximate Second-Order Methods
In this section we discuss the application of second-order methods to the training
of deep networks. See LeCun et al. (1998a) for an earlier treatment of this subject.
For simplicity of exposition, the only objective function we examine is the empirical
312
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.7 The Adam algorithm
Require: Step size (Suggested default: 0.001)
Require:
Exponential decay rates for moment estimates,
ρ
1
and
ρ
2
in [0
,
1).
(Suggested defaults: 0.9 and 0.999 respectively)
Require:
Small constant
δ
used for numerical stabilization. (Suggested default:
10
8
)
Require: Initial parameters θ
Initialize 1st and 2nd moment variables s = 0, r = 0
Initialize time step t = 0
while stopping criterion not met do
Sample a minibatch of
m
examples from the training set
{x
(1)
, . . . , x
(m)
}
with
corresponding targets y
(i)
.
Compute gradient: g
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
t t + 1
Update biased first moment estimate: s ρ
1
s + (1 ρ
1
)g
Update biased second moment estimate: r ρ
2
r + (1 ρ
2
)g g
Correct bias in first moment:
ˆ
s
s
1ρ
t
1
Correct bias in second moment:
ˆ
r
r
1ρ
t
2
Compute update: θ =
ˆ
s
ˆ
r+δ
(operations applied element-wise)
Apply update: θ θ + θ
end while
risk:
J(θ) = E
x,yˆp
data
(x,y)
[L(f(x; θ), y)] =
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
). (8.21)
However the methods we discuss here extend readily to more general objective
functions that, for instance, include parameter regularization terms such as those
discussed in Chapter 7.
8.6.1 Newton’s Method
In Sec. 4.3, we introduced second-order gradient methods. In contrast to first-
order methods, second-order methods make use of second derivatives to improve
optimization. The most widely used second-order method is Newton’s method. We
now describe Newton’s method in more detail, with emphasis on its application to
neural network training.
313
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Newton’s method is an optimization scheme based on using a second-order Tay-
lor series expansion to approximate
J
(
θ
) near some point
θ
0
, ignoring derivatives
of higher order:
J(θ) J(θ
0
) + (θ θ
0
)
θ
J(θ
0
) +
1
2
(θ θ
0
)
H(θ θ
0
), (8.22)
where
H
is the Hessian of
J
with respect to
θ
evaluated at
θ
0
. If we then solve for
the critical point of this function, we obtain the Newton parameter update rule:
θ
= θ
0
H
1
θ
J(θ
0
) (8.23)
Thus for a locally quadratic function (with positive definite
H
), by rescaling
the gradient by
H
1
, Newton’s method jumps directly to the minimum. If the
objective function is convex but not quadratic (there are higher-order terms), this
update can be iterated, yielding the training algorithm associated with Newton’s
method, given in Algorithm 8.8.
Algorithm 8.8
Newton’s method with objective
J
(
θ
) =
1
m
m
i=1
L(f(x
(i)
; θ), y
(i)
).
Require: Initial parameter θ
0
Require: Training set of m examples
while stopping criterion not met do
Compute gradient: g
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
Compute Hessian: H
1
m
2
θ
i
L(f(x
(i)
; θ), y
(i)
)
Compute Hessian inverse: H
1
Compute update: θ = H
1
g
Apply update: θ = θ + θ
end while
For surfaces that are not quadratic, as long as the Hessian remains positive
definite, Newton’s method can be applied iteratively. This implies a two-step
iterative procedure. First, update or compute the inverse Hessian (i.e. by updating
the quadratic approximation). Second, update the parameters according to Eq.
8.23.
In Sec. 8.2.3, we discussed how Newton’s method is appropriate only when
the Hessian is positive definite. In deep learning, the surface of the objective
function is typically non-convex with many features, such as saddle points, that
are problematic for Newton’s method. If the eigenvalues of the Hessian are not
all positive, for example, near a saddle point, then Newton’s method can actually
cause updates to move in the wrong direction. This situation can be avoided
314
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
by regularizing the Hessian. Common regularization strategies include adding a
constant, α, along the diagonal of the Hessian. The regularized update becomes
θ
= θ
0
[H (f(θ
0
)) + αI]
1
θ
f(θ
0
). (8.24)
This regularization strategy is used in approximations to Newton’s method, such
as the Levenberg–Marquardt algorithm (Levenberg, 1944; Marquardt, 1963), and
works fairly well as long as the negative eigenvalues of the Hessian are still relatively
close to zero. In cases where there are more extreme directions of curvature, the
value of
α
would have to be sufficiently large to offset the negative eigenvalues.
However, as
α
increases in size, the Hessian becomes dominated by the
αI
diagonal
and the direction chosen by Newton’s method converges to the standard gradient
divided by
α
. When strong negative curvature is present,
α
may need to be so
large that Newton’s method would make smaller steps than gradient descent with
a properly chosen learning rate.
Beyond the challenges created by certain features of the objective function,
such as saddle points, the application of Newton’s method for training large neural
networks is limited by the significant computational burden it imposes. The
number of elements in the Hessian is squared in the number of parameters, so with
k
parameters (and for even very small neural networks the number of parameters
k
can be in the millions), Newton’s method would require the inversion of a
k × k
matrix—with computational complexity of
O
(
k
3
). Also, since the parameters
will change with every update, the inverse Hessian has to be computed
at every
training iteration
. As a consequence, only networks a with very small number
of parameters can be practically trained via Newton’s method. In the remainder
of this section, we will discuss alternatives that attempt to gain some of the
advantages of Newton’s method while side-stepping the computational hurdles.
8.6.2 Conjugate Gradients
Conjugate gradients is a method to efficiently avoid the calculation of the inverse
Hessian by iteratively descending conjugate directions. The inspiration for this
approach follows from a careful study of the weakness of the method of steepest
descent (see Sec. 4.3 for details), where line searches are applied iteratively in
the direction associated with the gradient. Fig. 8.6 illustrates how the method of
steepest descent, when applied in a quadratic bowl, progresses in a rather ineffective
back-and-forth, zig-zag pattern. This happens because each line search direction,
when given by the gradient, is guaranteed to be orthogonal to the previous line
search direction.
315
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
−30 −20 −10 0 10 20
−30
−20
−10
0
10
20
Figure 8.6: The method of steepest descent applied to a quadratic cost surface. The
method of steepest descent involves jumping to the point of lowest cost along the line
defined by the gradient at the initial point on each step. This resolves some of the problems
seen with using a fixed learning rate in Fig. 4.6, but even with the optimal step size the
algorithm still makes back-and-forth progress toward the optimum. By definition, at
the minimum of the objective along a given direction, the gradient at the final point is
orthogonal to that direction.
Let the previous search direction be
d
t1
. At the minimum, where the line
search terminates, the directional derivative is zero in direction
d
t1
:
θ
J
(
θ
)
·
d
t1
= 0. Since the gradient at this point defines the current search direction,
d
t
=
θ
J
(
θ
) will have no contribution in the direction
d
t1
. Thus
d
t
is orthogonal
to
d
t1
. This relationship between
d
t1
and
d
t
is illustrated in Fig. 8.6 for
multiple iterations of steepest descent. As demonstrated in the figure, the choice of
orthogonal directions of descent do not preserve the minimum along the previous
search directions. This gives rise to the zig-zag pattern of progress, where by
descending to the minimum in the current gradient direction, we must re-minimize
the objective in the previous gradient direction. Thus, by following the gradient at
the end of each line search we are, in a sense, undoing progress we have already
made in the direction of the previous line search. The method of conjugate gradients
seeks to address this problem.
In the method of conjugate gradients, we seek to find a search direction that is
conjugate to the previous line search direction, i.e. it will not undo progress made
in that direction. At training iteration
t
, the next search direction
d
t
takes the
form:
d
t
=
θ
J(θ) + β
t
d
t1
(8.25)
were
β
t
is a coefficient whose magnitude controls how much of the direction,
d
t1
,
316
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
we should add back to the current search direction.
Two directions, d
t
and d
t1
, are defined as conjugate if d
t
H(J)d
t1
= 0.
d
t
Hd
t1
= 0 (8.26)
The straightforward way to impose conjugacy would involve calculation of the
eigenvectors of
H
to choose
β
t
, which would not satisfy our goal of developing
a method that is more computationally viable than Newton’s method for large
problems. Can we calculate the conjugate directions without resorting to these
calculations? Fortunately the answer to that is yes.
Two popular methods for computing the β
t
are:
1. Fletcher-Reeves:
β
t
=
θ
J(θ
t
)
θ
J(θ
t
)
θ
J(θ
t1
)
θ
J(θ
t1
)
(8.27)
2. Polak-Ribière:
β
t
=
(
θ
J(θ
t
)
θ
J(θ
t1
))
θ
J(θ
t
)
θ
J(θ
t1
)
θ
J(θ
t1
)
(8.28)
For a quadratic surface, the conjugate directions ensure that the gradient along
the previous direction does not increase in magnitude. We therefore stay at the
minimum along the previous directions. As a consequence, in a
k
-dimensional
parameter space, conjugate gradients only requires
k
line searches to achieve the
minimum. The conjugate gradient algorithm is given in Algorithm 8.9.
Nonlinear Conjugate Gradients:
So far we have discussed the method of
conjugate gradients as it is applied to quadratic objective functions. Of course,
our primary interest in this chapter is to explore optimization methods for training
neural networks and other related deep learning models where the corresponding
objective function is far from quadratic. Perhaps surprisingly, the method of
conjugate gradients is still applicable in this setting, though with some modification.
Without any assurance that the objective is quadratic, the conjugate directions are
no longer assured to remain at the minimum of the objective for previous directions.
As a result, the nonlinear conjugate gradients algorithm includes occasional resets
where the method of conjugate gradients is restarted with line search along the
unaltered gradient.
Practitioners report reasonable results in applications of the nonlinear conjugate
gradients algorithm to training neural networks, though it is often beneficial to
317
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.9 Conjugate gradient method
Require: Initial parameters θ
0
Require: Training set of m examples
Initialize ρ
0
= 0
Initialize g
0
= 0
Initialize t = 1
while stopping criterion not met do
Initialize the gradient g
t
= 0
Compute gradient: g
t
1
m
θ
i
L(f(x
(i)
; θ), y
(i)
)
Compute β
t
=
(g
t
g
t1
)
g
t
g
t1
g
t1
(Polak-Ribière)
(Nonlinear conjugate gradient: optionally reset
β
t
to zero, for example if
t
is
a multiple of some constant k, such as k = 5)
Compute search direction: ρ
t
= g
t
+ β
t
ρ
t1
Perform line search to find:
= argmin
1
m
m
i=1
L(f(x
(i)
; θ
t
+ ρ
t
), y
(i)
)
(On a truly quadratic cost function, analytically solve for
rather than
explicitly searching for it)
Apply update: θ
t+1
= θ
t
+
ρ
t
t t + 1
end while
initialize the optimization with a few iterations of stochastic gradient descent before
commencing nonlinear conjugate gradients. Also, while the (nonlinear) conjugate
gradients algorithm has traditionally been cast as a batch method, minibatch
versions have been used successfully for the training of neural networks (Le et al.,
2011). Adaptations of conjugate gradients specifically for neural networks have
been proposed earlier, such as the scaled conjugate gradients algorithm (Moller,
1993).
8.6.3 BFGS
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm attempts to bring some
of the advantages of Newton’s method without the computational burden. In that
respect, BFGS is similar to CG. However, BFGS takes a more direct approach to
the approximation of Newton’s update. Recall that Newton’s update is given by
θ
= θ
0
H
1
θ
J(θ
0
), (8.29)
where
H
is the Hessian of
J
with respect to
θ
evaluated at
θ
0
. The primary
computational difficulty in applying Newton’s update is the calculation of the
318
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
inverse Hessian
H
1
. The approach adopted by quasi-Newton methods (of which
the BFGS algorithm is the most prominent) is to approximate the inverse with
a matrix
M
t
that is iteratively refined by low rank updates to become a better
approximation of H
1
.
From Newton’s update, in Eq. 8.29, we can see that the parameters at learning
steps
t
are related via the secant condition (also known as the quasi-Newton
condition):
θ
t+1
θ
t
= H
1
(
θ
J(θ
t+1
)
θ
J(θ
t
)) (8.30)
Eq. 8.30 holds precisely in the quadratic case, or approximately otherwise. The
approximation to the Hessian inverse used in the BFGS procedure is constructed
so as to satisfy this condition, with
M
in place of
H
1
. Specifically,
M
is updated
according to:
M
t
= M
t1
+
1 +
φ
M
t1
φ
φ
φ
φ
φ
φ
M
t1
+ M
t1
φ
φ
, (8.31)
where
g
t
=
θ
J
(
θ
t
),
φ
=
g
t
g
t1
and
=
θ
t
θ
t1
. Eq. 8.31 shows that the
BFGS procedure iteratively refines the approximation of the inverse of the Hessian
with rank updates of rank one. This mean that if
θ R
n
, then the computational
complexity of the update is
O
(
n
2
). The derivation of the BFGS approximation is
given in many textbooks on optimization, including Luenberger (1984).
Once the inverse Hessian approximation
M
t
is updated, the direction of descent
ρ
t
is determined by
ρ
t
=
M
t
g
t
. A line search is performed in this direction to
determine the size of the step,
, taken in this direction. The final update to the
parameters is given by:
θ
t+1
= θ
t
+
ρ
t
. (8.32)
The complete BFGS algorithm is presented in Algorithm 8.10.
Like the method of conjugate gradients, the BFGS algorithm iterates a series of
line searches with the direction incorporating second-order information. However
unlike conjugate gradients, the success of the approach is not heavily dependent
on the line search finding a point very close to the true minimum along the line.
Thus, relative to conjugate gradients, BFGS has the advantage that it can spend
less time refining each line search. On the other hand, the BFGS algorithm must
store the inverse Hessian matrix,
M
, that requires
O
(
n
2
) memory, making BFGS
impractical for most modern deep learning models that typically have millions of
parameters.
Limited Memory BFGS (or L-BFGS)
The memory costs of the BFGS
algorithm can be significantly decreased by avoiding storing the complete inverse
319
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.10 BFGS method
Require: Initial parameters θ
0
Initialize inverse Hessian M
0
= I
while stopping criterion not met do
Compute gradient: g
t
=
θ
J(θ
t
)
Compute φ = g
t
g
t1
, = θ
t
θ
t1
Approx
H
1
:
M
t
=
M
t1
+
1 +
φ
M
t1
φ
φ
φ
φ
φ
φ
M
t1
+M
t1
φ
φ
Compute search direction: ρ
t
= M
t
g
t
Perform line search to find:
= argmin
J(θ
t
+ ρ
t
)
Apply update: θ
t+1
= θ
t
+
ρ
t
end while*
Hessian approximation
M
. Alternatively, by replacing the
M
t1
in Eq. 8.31 with
an identity matrix, the BFGS search direction update formula becomes:
ρ
t
= g
t
+ b + aφ, (8.33)
where the scalars a and b are given by:
a =
1 +
φ
φ
φ
g
t
φ
+
φ
g
t
φ
(8.34)
b =
g
t
φ
(8.35)
with
φ
and
as defined above. If used with exact line searches, the directions
defined by Eq. 8.33 are mutually conjugate. However, unlike the method of
conjugate gradients, this procedure remains well behaved when the minimum of
the line search is reached only approximately. This strategy can be generalized to
include more information about the Hessian by storing previous values of
φ
and
.
8.7 Optimization Strategies and Meta-Algorithms
Many optimization techniques are not exactly algorithms, but rather general
templates that can be specialized to yield algorithms, or subroutines that can be
incorporated into many different algorithms.
320
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.7.1 Batch Normalization
Batch normalization (Ioffe and Szegedy, 2015) is one of the most exciting recent
innovations in optimizing deep neural networks and it is actually not an optimization
algorithm at all. Instead, it is a method of adaptive reparametrization, motivated
by the difficulty of training very deep models.
Very deep models involve the composition of several functions or layers. The
gradient tells how to update each parameter, under the assumption that the other
layers do not change. In practice, we update all of the layers simultaneously.
When we make the update, unexpected results can happen because many functions
composed together are changed simultaneously, using updates that were computed
under the assumption that the other functions remain constant. As a simple
example, suppose we have a deep neural network that has only one unit per layer
and does not use an activation function at each hidden layer:
ˆy
=
xw
1
w
2
w
3
. . . w
l
.
Here,
w
i
provides the weight used by layer
i
. The output of layer
i
is
h
i
=
h
i1
w
i
.
The output
ˆy
is a linear function of the input
x
, but a nonlinear function of the
weights
w
i
. Suppose our cost function has put a gradient of 1 on
ˆy
, so we wish to
decrease
ˆy
slightly. The back-propagation algorithm can then compute a gradient
g
=
w
ˆy
. Consider what happens when we make an update
w w g
. The
first-order Taylor series approximation of
ˆy
predicts that the value of
ˆy
will decrease
by
g
g
. If we wanted to decrease
ˆy
by
.
1, this first-order information available in
the gradient suggests we could set the learning rate
to
.1
g
g
. However, the actual
update will include second-order and third-order effects, on up to effects of order
l
.
The new value of ˆy is given by
x(w
1
g
1
)(w
2
g
2
) . . . (w
l
g
l
). (8.36)
An example of one second-order term arising from this update is
2
g
1
g
2
l
i=3
w
i
.
This term might be negligible if
l
i=3
w
i
is small, or might be exponentially large
if the weights on layers 3 through
l
are greater than 1. This makes it very hard
to choose an appropriate learning rate, because the effects of an update to the
parameters for one layer depends so strongly on all of the other layers. Second-order
optimization algorithms address this issue by computing an update that takes these
second-order interactions into account, but we can see that in very deep networks,
even higher-order interactions can be significant. Even second-order optimization
algorithms are expensive and usually require numerous approximations that prevent
them from truly accounting for all significant second-order interactions. Building
an
n
-th order optimization algorithm for
n >
2 thus seems hopeless. What can we
do instead?
Batch normalization provides an elegant way of reparametrizing almost any deep
321
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
network. The reparametrization significantly reduces the problem of coordinating
updates across many layers. Batch normalization can be applied to any input
or hidden layer in a network. Let
H
be a minibatch of activations of the layer
to normalize, arranged as a design matrix, with the activations for each example
appearing in a row of the matrix. To normalize H, we replace it with
H
=
H µ
σ
, (8.37)
where
µ
is a vector containing the mean of each unit and
σ
is a vector containing
the standard deviation of each unit. The arithmetic here is based on broadcasting
the vector
µ
and the vector
σ
to be applied to every row of the matrix
H
. Within
each row, the arithmetic is element-wise, so
H
i,j
is normalized by subtracting
µ
j
and dividing by
σ
j
. The rest of the network then operates on
H
in exactly the
same way that the original network operated on H.
At training time,
µ =
1
m
i
H
i,:
(8.38)
and
σ =
δ +
1
m
i
(H µ)
2
i
, (8.39)
where
δ
is a small positive value such as 10
8
imposed to avoid encountering the
undefined gradient of
z
at
z
= 0. Crucially,
we back-propagate through
these operations
for computing the mean and the standard deviation, and for
applying them to normalize
H
. This means that the gradient will never propose
an operation that acts simply to increase the standard deviation or mean of
h
i
; the normalization operations remove the effect of such an action and zero
out its component in the gradient. This was a major innovation of the batch
normalization approach. Previous approaches had involved adding penalties to
the cost function to encourage units to have normalized activation statistics or
involved intervening to renormalize unit statistics after each gradient descent step.
The former approach usually resulted in imperfect normalization and the latter
usually resulted in significant wasted time as the learning algorithm repeatedly
proposed changing the mean and variance and the normalization step repeatedly
undid this change. Batch normalization reparametrizes the model to make some
units always be standardized by definition, deftly sidestepping both problems.
At test time,
µ
and
σ
may be replaced by running averages that were collected
during training time. This allows the model to be evaluated on a single example,
without needing to use definitions of
µ
and
σ
that depend on an entire minibatch.
322
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Revisiting the
ˆy
=
xw
1
w
2
. . . w
l
example, we see that we can mostly resolve the
difficulties in learning this model by normalizing
h
l1
. Suppose that
x
is drawn
from a unit Gaussian. Then
h
l1
will also come from a Gaussian, because the
transformation from
x
to
h
l
is linear. However,
h
l1
will no longer have zero mean
and unit variance. After applying batch normalization, we obtain the normalized
ˆ
h
l1
that restores the zero mean and unit variance properties. For almost any
update to the lower layers,
ˆ
h
l1
will remain a unit Gaussian. The output
ˆy
may
then be learned as a simple linear function
ˆy
=
w
l
ˆ
h
l1
. Learning in this model is
now very simple because the parameters at the lower layers simply do not have an
effect in most cases; their output is always renormalized to a unit Gaussian. In
some corner cases, the lower layers can have an effect. Changing one of the lower
layer weights to 0 can make the output become degenerate, and changing the sign
of one of the lower weights can flip the relationship between
ˆ
h
l1
and
y
. These
situations are very rare. Without normalization, nearly every update would have
an extreme effect on the statistics of
h
l1
. Batch normalization has thus made
this model significantly easier to learn. In this example, the ease of learning of
course came at the cost of making the lower layers useless. In our linear example,
the lower layers no longer have any harmful effect, but they also no longer have
any beneficial effect. This is because we have normalized out the first and second
order statistics, which is all that a linear network can influence. In a deep neural
network with nonlinear activation functions, the lower layers can perform nonlinear
transformations of the data, so they remain useful. Batch normalization acts to
standardize only the mean and variance of each unit in order to stabilize learning,
but allows the relationships between units and the nonlinear statistics of a single
unit to change.
Because the final layer of the network is able to learn a linear transformation,
we may actually wish to remove all linear relationships between units within a
layer. Indeed, this is the approach taken by Desjardins et al. (2015), who provided
the inspiration for batch normalization. Unfortunately, eliminating all linear
interactions is much more expensive than standardizing the mean and standard
deviation of each individual unit, and so far batch normalization remains the most
practical approach.
Normalizing the mean and standard deviation of a unit can reduce the expressive
power of the neural network containing that unit. In order to maintain the
expressive power of the network, it is common to replace the batch of hidden unit
activations
H
with
γH
+
β
rather than simply the normalized
H
. The variables
γ
and
β
are learned parameters that allow the new variable to have any mean
and standard deviation. At first glance, this may seem useless—why did we set
the mean to
0
, and then introduce a parameter that allows it to be set back to
323
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
any arbitrary value
β
? The answer is that the new parametrization can represent
the same family of functions of the input as the old parametrization, but the new
parametrization has different learning dynamics. In the old parametrization, the
mean of
H
was determined by a complicated interaction between the parameters
in the layers below
H
. In the new parametrization, the mean of
γH
+
β
is
determined solely by
β
. The new parametrization is much easier to learn with
gradient descent.
Most neural network layers take the form of
φ
(
XW
+
b
) where
φ
is some
fixed nonlinear activation function such as the rectified linear transformation. It
is natural to wonder whether we should apply batch normalization to the input
X
, or to the transformed value
XW
+
b
. Ioffe and Szegedy (2015) recommend
the latter. More specifically,
XW
+
b
should be replaced by a normalized version
of
XW
. The bias term should be omitted because it becomes redundant with
the
β
parameter applied by the batch normalization reparametrization. The input
to a layer is usually the output of a nonlinear activation function such as the
rectified linear function in a previous layer. The statistics of the input are thus
more non-Gaussian and less amenable to standardization by linear operations.
In convolutional networks, described in Chapter 9, it is important to apply the
same normalizing
µ
and
σ
at every spatial location within a feature map, so that
the statistics of the feature map remain the same regardless of spatial location.
8.7.2 Coordinate Descent
In some cases, it may be possible to solve an optimization problem quickly by
breaking it into separate pieces. If we minimize
f
(
x
) with respect to a single variable
x
i
, then minimize it with respect to another variable
x
j
and so on, repeatedly
cycling through all variables, we are guaranteed to arrive at a (local) minimum.
This practice is known as coordinate descent, because we optimize one coordinate
at a time. More generally, block coordinate descent refers to minimizing with
respect to a subset of the variables simultaneously. The term “coordinate descent”
is often used to refer to block coordinate descent as well as the strictly individual
coordinate descent.
Coordinate descent makes the most sense when the different variables in the
optimization problem can be clearly separated into groups that play relatively
isolated roles, or when optimization with respect to one group of variables is
significantly more efficient than optimization with respect to all of the variables.
324
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
For example, consider the cost function
J(H, W ) =
i,j
|H
i,j
| +
i,j
X W
H
2
i,j
. (8.40)
This function describes a learning problem called sparse coding, where the goal is
to find a weight matrix
W
that can linearly decode a matrix of activation values
H
to reconstruct the training set
X
. Most applications of sparse coding also
involve weight decay or a constraint on the norms of the columns of
W
, in order
to prevent the pathological solution with extremely small H and large W .
The function
J
is not convex. However, we can divide the inputs to the
training algorithm into two sets: the dictionary parameters
W
and the code
representations
H
. Minimizing the objective function with respect to either one of
these sets of variables is a convex problem. Block coordinate descent thus gives
us an optimization strategy that allows us to use efficient convex optimization
algorithms, by alternating between optimizing
W
with
H
fixed, then optimizing
H with W fixed.
Coordinate descent is not a very good strategy when the value of one variable
strongly influences the optimal value of another variable, as in the function
f
(
x
) =
(
x
1
x
2
)
2
+
α
x
2
1
+ x
2
2
where
α
is a positive constant. The first term encourages
the two variables to have similar value, while the second term encourages them
to be near zero. The solution is to set both to zero. Newton’s method can solve
the problem in a single step because it is a positive definite quadratic problem.
However, for small
α
, coordinate descent will make very slow progress because the
first term does not allow a single variable to be changed to a value that differs
significantly from the current value of the other variable.
8.7.3 Polyak Averaging
Polyak averaging (Polyak and Juditsky, 1992) consists of averaging together several
points in the trajectory through parameter space visited by an optimization
algorithm. If
t
iterations of gradient descent visit points
θ
(1)
, . . . , θ
(t)
, then the
output of the Polyak averaging algorithm is
ˆ
θ
(t)
=
1
t
i
θ
(i)
. On some problem
classes, such as gradient descent applied to convex problems, this approach has
strong convergence guarantees. When applied to neural networks, its justification
is more heuristic, but it performs well in practice. The basic idea is that the
optimization algorithm may leap back and forth across a valley several times
without ever visiting a point near the bottom of the valley. The average of all of
the locations on either side should be close to the bottom of the valley though.
325
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
In non-convex problems, the path taken by the optimization trajectory can be
very complicated and visit many different regions. Including points in parameter
space from the distant past that may be separated from the current point by large
barriers in the cost function does not seem like a useful behavior. As a result,
when applying Polyak averaging to non-convex problems, it is typical to use an
exponentially decaying running average:
ˆ
θ
(t)
= α
ˆ
θ
(t1)
+ (1 α)θ
(t)
. (8.41)
The running average approach is used in numerous applications. See Szegedy
et al. (2015) for a recent example.
8.7.4 Supervised Pretraining
Sometimes, directly training a model to solve a specific task can be too ambitious
if the model is complex and hard to optimize or if the task is very difficult. It is
sometimes more effective to train a simpler model to solve the task, then make the
model more complex. It can also be more effective to train the model to solve a
simpler task, then move on to confront the final task. These strategies that involve
training simple models on simple tasks before confronting the challenge of training
the desired model to perform the desired task are collectively known as pretraining.
Greedy algorithms break a problem into many components, then solve for the
optimal version of each component in isolation. Unfortunately, combining the
individually optimal components is not guaranteed to yield an optimal complete
solution. However, greedy algorithms can be computationally much cheaper than
algorithms that solve for the best joint solution, and the quality of a greedy solution
is often acceptable if not optimal. Greedy algorithms may also be followed by a
fine-tuning stage in which a joint optimization algorithm searches for an optimal
solution to the full problem. Initializing the joint optimization algorithm with a
greedy solution can greatly speed it up and improve the quality of the solution it
finds.
Pretraining, and especially greedy pretraining, algorithms are ubiquitous in
deep learning. In this section, we describe specifically those pretraining algorithms
that break supervised learning problems into other simpler supervised learning
problems. This approach is known as greedy supervised pretraining.
In the original (Bengio et al., 2007) version of greedy supervised pretraining,
each stage consists of a supervised learning training task involving only a subset of
the layers in the final neural network. An example of greedy supervised pretraining
326
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
yy
h
(1)
h
(1)
xx
(a)
U
(1)
U
(1)
W
(1)
W
(1)
yy
h
(1)
h
(1)
xx
(b)
U
(1)
U
(1)
W
(1)
W
(1)
yy
h
(1)
h
(1)
xx
(c)
U
(1)
U
(1)
W
(1)
W
(1)
h
(2)
h
(2)
yy
U
(2)
U
(2)
W
(2)
W
(2)
yy
h
(1)
h
(1)
xx
(d)
U
(1)
U
(1)
W
(1)
W
(1)
h
(2)
h
(2)
y
U
(2)
U
(2)
W
(2)
W
(2)
Figure 8.7: Illustration of one form of greedy supervised pretraining (Bengio et al., 2007).
(a) We start by training a sufficiently shallow architecture. (b) Another drawing of the
same architecture. (c) We keep only the input-to-hidden layer of the original network and
discard the hidden-to-output layer. We send the output of the first hidden layer as input
to another supervised single hidden layer MLP that is trained with the same objective
as the first network was, thus adding a second hidden layer. This can be repeated for
as many layers as desired. (d) Another drawing of the result, viewed as a feedforward
network. To further improve the optimization, we can jointly fine-tune all the layers,
either only at the end or at each stage of this process.
327
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
is illustrated in Fig. 8.7, in which each added hidden layer is pretrained as part of
a shallow supervised MLP, taking as input the output of the previously trained
hidden layer. Instead of pretraining one layer at a time, Simonyan and Zisserman
(2015) pretrain a deep convolutional network (eleven weight layers) and then use
the first four and last three layers from this network to initialize even deeper
networks (with up to nineteen layers of weights). The middle layers of the new,
very deep network are initialized randomly. The new network is then jointly trained.
Another option, explored by Yu et al. (2010) is to use the outputs of the previously
trained MLPs, as well as the raw input, as inputs for each added stage.
Why would greedy supervised pretraining help? The hypothesis initially
discussed by Bengio et al. (2007) is that it helps to provide better guidance to the
intermediate levels of a deep hierarchy. In general, pretraining may help both in
terms of optimization and in terms of generalization.
An approach related to supervised pretraining extends the idea to the context
of transfer learning: Yosinski et al. (2014) pretrain a deep convolutional net with 8
layers of weights on a set of tasks (a subset of the 1000 ImageNet object categories)
and then initialize a same-size network with the first
k
layers of the first net. All
the layers of the second network (with the upper layers initialized randomly) are
then jointly trained to perform a different set of tasks (another subset of the 1000
ImageNet object categories), with fewer training examples than for the first set of
tasks. Other approaches to transfer learning with neural networks are discussed in
Sec. 15.2.
Another related line of work is the FitNets (Romero et al., 2015) approach. This
approach begins by training a network that has low enough depth and great enough
width (number of units per layer) to be easy to train. This network then becomes
a teacher for a second network, designated the student. The student network is
much deeper and thinner (eleven to nineteen layers) and would be difficult to train
with SGD under normal circumstances. The training of the student network is
made easier by training the student network not only to predict the output for
the original task, but also to predict the value of the middle layer of the teacher
network. This extra task provides a set of hints about how the hidden layers
should be used and can simplify the optimization problem. Additional parameters
are introduced to regress the middle layer of the 5-layer teacher network from
the middle layer of the deeper student network. However, instead of predicting
the final classification target, the objective is to predict the middle hidden layer
of the teacher network. The lower layers of the student networks thus have two
objectives: to help the outputs of the student network accomplish their task, as
well as to predict the intermediate layer of the teacher network. Although a thin
328
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
and deep network appears to be more difficult to train than a wide and shallow
network, the thin and deep network may generalize better and certainly has lower
computational cost if it is thin enough to have far fewer parameters. Without
the hints on the hidden layer, the student network performs very poorly in the
experiments, both on the training and test set. Hints on middle layers may thus
be one of the tools to help train neural networks that otherwise seem difficult to
train, but other optimization techniques or changes in the architecture may also
solve the problem.
8.7.5 Designing Models to Aid Optimization
To improve optimization, the best strategy is not always to improve the optimization
algorithm. Instead, many improvements in the optimization of deep models have
come from designing the models to be easier to optimize.
In principle, we could use activation functions that increase and decrease in
jagged non-monotonic patterns. However, this would make optimization extremely
difficult. In practice,
it is more important to choose a model family that
is easy to optimize than to use a powerful optimization algorithm
. Most
of the advances in neural network learning over the past 30 years have been
obtained by changing the model family rather than changing the optimization
procedure. Stochastic gradient descent with momentum, which was used to train
neural networks in the 1980s, remains in use in modern state of the art neural
network applications.
Specifically, modern neural networks reflect a design choice to use linear trans-
formations between layers and activation functions that are differentiable almost
everywhere and have significant slope in large portions of their domain. In par-
ticular, model innovations like the LSTM, rectified linear units and maxout units
have all moved toward using more linear functions than previous models like deep
networks based on sigmoidal units. These models have nice properties that make
optimization easier. The gradient flows through many layers provided that the
Jacobian of the linear transformation has reasonable singular values. Moreover,
linear functions consistently increase in a single direction, so even if the model’s
output is very far from correct, it is clear simply from computing the gradient
which direction its output should move to reduce the loss function. In other words,
modern neural nets have been designed so that their local gradient information
corresponds reasonably well to moving toward a distant solution.
Other model design strategies can help to make optimization easier. For
example, linear paths or skip connections between layers reduce the length of
329
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the shortest path from the lower layer’s parameters to the output, and thus
mitigate the vanishing gradient problem (Srivastava et al., 2015). A related idea
to skip connections is adding extra copies of the output that are attached to the
intermediate hidden layers of the network, as in GoogLeNet (Szegedy et al., 2014a)
and deeply-supervised nets (Lee et al., 2014). These “auxiliary heads” are trained
to perform the same task as the primary output at the top of the network in order
to ensure that the lower layers receive a large gradient. When training is complete
the auxiliary heads may be discarded. This is an alternative to the pretraining
strategies, which were introduced in the previous section. In this way, one can
train jointly all the layers in a single phase but change the architecture, so that
intermediate layers (especially the lower ones) can get some hints about what they
should do, via a shorter path. These hints provide an error signal to lower layers.
8.7.6 Continuation Methods and Curriculum Learning
As argued in Sec. 8.2.7, many of the challenges in optimization arise from the global
structure of the cost function and cannot be resolved merely by making better
estimates of local update directions. The predominant strategy for overcoming this
problem is to attempt to initialize the parameters in a region that is connected
to the solution by a short path through parameter space that local descent can
discover.
Continuation methods are a family of strategies that can make optimization
easier by choosing initial points to ensure that local optimization spends most of
its time in well-behaved regions of space. The idea behind continuation methods is
to construct a series of objective functions over the same parameters. In order to
minimize a cost function
J
(
θ
), we will construct new cost functions
{J
(0)
, . . . , J
(n)
}
.
These cost functions are designed to be increasingly difficult, with
J
(0)
being fairly
easy to minimize, and
J
(n)
, the most difficult, being
J
(
θ
), the true cost function
motivating the entire process. When we say that
J
(i)
is easier than
J
(i+1)
, we
mean that it is well behaved over more of
θ
space. A random initialization is more
likely to land in the region where local descent can minimize the cost function
successfully because this region is larger. The series of cost functions are designed
so that a solution to one is a good initial point of the next. We thus begin by
solving an easy problem then refine the solution to solve incrementally harder
problems until we arrive at a solution to the true underlying problem.
Traditional continuation methods (predating the use of continuation methods
for neural network training) are usually based on smoothing the objective function.
See Wu (1997) for an example of such a method and a review of some related
methods. Continuation methods are also closely related to simulated annealing,
330
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
which adds noise to the parameters (Kirkpatrick et al., 1983). Continuation
methods have been extremely successful in recent years. See Mobahi and Fisher
(2015) for an overview of recent literature, especially for AI applications.
Continuation methods traditionally were mostly designed with the goal of
overcoming the challenge of local minima. Specifically, they were designed to
reach a global minimum despite the presence of many local minima. To do so,
these continuation methods would construct easier cost functions by “blurring” the
original cost function. This blurring operation can be done by approximating
J
(i)
(θ) = E
θ
∼N(θ
;θ
(i)2
)
J(θ
) (8.42)
via sampling. The intuition for this approach is that some non-convex functions
become approximately convex when blurred. In many cases, this blurring preserves
enough information about the location of a global minimum that we can find the
global minimum by solving progressively less blurred versions of the problem. This
approach can break down in three different ways. First, it might successfully define
a series of cost functions where the first is convex and the optimum tracks from
one function to the next arriving at the global minimum, but it might require so
many incremental cost functions that the cost of the entire procedure remains high.
NP-hard optimization problems remain NP-hard, even when continuation methods
are applicable. The other two ways that continuation methods fail both correspond
to the method not being applicable. First, the function might not become convex,
no matter how much it is blurred. Consider for example the function
J
(
θ
) =
θ
θ
.
Second, the function may become convex as a result of blurring, but the minimum
of this blurred function may track to a local rather than a global minimum of the
original cost function.
Though continuation methods were mostly originally designed to deal with the
problem of local minima, local minima are no longer believed to be the primary
problem for neural network optimization. Fortunately, continuation methods can
still help. The easier objective functions introduced by the continuation method can
eliminate flat regions, decrease variance in gradient estimates, improve conditioning
of the Hessian matrix, or do anything else that will either make local updates
easier to compute or improve the correspondence between local update directions
and progress toward a global solution.
Bengio et al. (2009) observed that an approach called curriculum learning or
shaping can be interpreted as a continuation method. Curriculum learning is based
on the idea of planning a learning process to begin by learning simple concepts
and progress to learning more complex concepts that depend on these simpler
concepts. This basic strategy was previously known to accelerate progress in animal
331
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
training (Skinner, 1958; Peterson, 2004; Krueger and Dayan, 2009) and machine
learning (Solomonoff, 1989; Elman, 1993; Sanger, 1994). Bengio et al. (2009)
justified this strategy as a continuation method, where earlier
J
(i)
are made easier by
increasing the influence of simpler examples (either by assigning their contributions
to the cost function larger coefficients, or by sampling them more frequently), and
experimentally demonstrated that better results could be obtained by following a
curriculum on a large-scale neural language modeling task. Curriculum learning
has been successful on a wide range of natural language (Spitkovsky et al., 2010;
Collobert et al., 2011a; Mikolov et al., 2011b; Tu and Honavar, 2011) and computer
vision (Kumar et al., 2010; Lee and Grauman, 2011; Supancic and Ramanan, 2013)
tasks. Curriculum learning was also verified as being consistent with the way in
which humans teach (Khan et al., 2011): teachers start by showing easier and
more prototypical examples and then help the learner refine the decision surface
with the less obvious cases. Curriculum-based strategies are more effective for
teaching humans than strategies based on uniform sampling of examples, and can
also increase the effectiveness of other teaching strategies (Basu and Christensen,
2013).
Another important contribution to research on curriculum learning arose in the
context of training recurrent neural networks to capture long-term dependencies:
Zaremba and Sutskever (2014) found that much better results were obtained with a
stochastic curriculum, in which a random mix of easy and difficult examples is always
presented to the learner, but where the average proportion of the more difficult
examples (here, those with longer-term dependencies) is gradually increased. With
a deterministic curriculum, no improvement over the baseline (ordinary training
from the full training set) was observed.
We have now described the basic family of neural network models and how to
regularize and optimize them. In the chapters ahead, we turn to specializations of
the neural network family, that allow neural networks to scale to very large sizes and
process input data that has special structure. The optimization methods discussed
in this chapter are often directly applicable to these specialized architectures with
little or no modification.
332