Chapter 8
Optimization for Training
Deep Models
Deep learning algorithms involve optimization in many contexts. For example, we
often solve optimization problems analytically in order to prove that an algorithm
has a certain property. Inference in a probabilistic model can be cast as an
optimization problem. Of all of the many optimization problems involved in deep
learning, the most difficult is neural network training. It is quite common to
invest days to months of time on hundreds on machines in order to solve even a
single instance of the neural network training problem. Because this problem is
so important and so expensive, a specialized set of optimization techniques have
been developed for solving it. This chapter presents these optimization techniques
for neural network training.
If you’re unfamiliar with the basic principles of gradient-based optimization,
we suggest reviewing Chapter 4. That chapter includes a brief overview of nu-
merical optimization in general.
This chapter focuses on one particular case of optimization: minimizing a cost
function J(X
(train)
, θ) with respect to the model parameters θ.
8.1 Optimization for Model Training
Optimization algorithms used for training of deep models differ from traditional
optimization algorithms in several ways. Machine learning usually acts indirectly—
we care about some performance measure P that we do not know how to directly
influence, so instead we reduce some cost function J(θ) in hope that it will improve
P . This is in contrast to pure optimization, where minimizing J is a goal in and
of itself. Optimization algorithms for training deep models also typically include
some specialization on the specific structure of machine learning cost functions.
226
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.1.1 Empirical Risk Minimization
Suppose that we have input feature x, targets y, and some loss function L(x, y).
Our ultimate goal is to minimize E
x,yp(x,y)
[L(x, y)]. This quantity is known as
the risk. If we knew the true distribution p(x, y), this would be an optimization
task solveable by an optimization algorithm. However, when we do not know
p(x, y) but only have a training set of samples from it, we have a machine learning
problem.
The simplest way to convert a machine learning problem back into an opti-
mization problem is to minimize the expected loss on the training set. This means
replacing the true distribution p(x, y) with the empirical distribution ˆp(x, y) de-
fined by the training set. We now minimize the empirical risk
E
x,yˆp(x,y)
[L(x, y)] =
1
m
m
X
i=1
L(x
(i)
, y
(i)
)
where m is the number of training examples.
This process is known as empirical risk minimization. In this setting, machine
learning is still very similar to straightforward optimization. Rather than opti-
mizing the risk directly, we optimize the empirical risk, and hope that the risk
decreases significantly as well. A variety of theoretical results establish conditions
under which the true risk can be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with
high capacity can simply memorize the training set. In many cases, empirical
risk minimization is not really feasible. The most effective modern optimization
algorithms are based on gradient descent, but many useful loss functions, such
as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined
everywhere). These two problems mean that, in the context of deep learning, we
rarely use empirical risk minimization. Instead, we must use a slightly different
approach, in which the quantity that we actually optimize is even more different
from the quantity that we truly want to optimize.
TODO– make sure 0-1 loss is defined and in the index
8.1.2 Surrogate Loss Functions
TODO–coordinate with Yoshua / coordinate with MLP / ML chapters do we use
term loss function = map from a specific example to a real number or do we use it
interchangeably with objective function / cost function? it seems some literature
uses ”loss function” in a very general sense while others use it to mean specifically
a single-example cost that you can take the expectation of, etc. this terminology
seems a bit sub-optimal since it relies a lot on using English words with essentially
the same meaning to represent different things with precise technical meanings
227
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
are ”surrogate loss functions” specifically replacing the cost for an individual
examples, or does this also include things like minimizing the empirical risk rather
than the true risk, adding a regularization term to the likelihood terms, etc.?
TODO– in some cases, surrogate loss function actually results in being able
to learn more. for example, test 0-1 loss continues to decrease for a long time
after train 0-1 loss has reached zero when training using log likelihood surrogate
In some cases, using a surrogate loss function allows us to extract more infor-
mation
8.1.3 Generalization
TODO– SGD on an infinite dataset optimizes the generalization error directly
(note that SGD is not introduced until later so this will need to be presented
carefully) TODO–
A very important difference between optimization in general and optimization
as we use it for training algorithms is that training algorithms do not usually
halt at a local minimum. Instead, using a regularization method known as early
stopping (see Sec. 7.7), they halt whenever overfitting begins to occur. This
is often in the middle of a wide, flat region, but it can also occur on a steep
part of the surrogate loss function. This is in contrast to general optimization,
where converge is usually defined by arriving at a point that is very near a (local)
minimum.
8.1.4 Batches and Minibatches
One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a
sum over the training examples. Optimization algorithms for machine learning
typically compute each update to the parameters based on a subset of the terms
of the objective function, not based on the complete objective function itself.
For example, maximum likelihood estimation problems decompose into a sum
over each example: TODO equation, using same format as in original maximum
likelihood section, which isn’t written yet
TODO stochastic gradient descent see 8.3.2
TODO examples can be redundant, so best computational efficiency comes
from minibatches TODO too small of batch size -¿ bad use of multicore architec-
tures TODO issues with Hessian being different for different batches, etc. TODO
importance of shuffling shuffle-once versus shuffle per epoch todo: define the term
“epoch”
228
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.1.5 Data Parallelism
TODO asynch implementations, hogwild, distbelief
8.2 Challenges in Optimization
8.2.1 Local Minima
TODO check whether this is already covered in numerical.tex
8.2.2 Ill-Conditioning
TODO this is definitely already covered in numerical.tex
8.2.3 Plateaus, Saddle Points, and Other Flat Regions
The long-held belief that neural networks are hopeless to train because they are
fraught with local minima has been one of the reasons for the “neural networks
winter” in the 1995-2005 decade. Indeed, one can show that there may be an
exponentially large number of local minima, even in the simplest neural network
optimization problems (Sontag and Sussman, 1989; Brady et al., 1989; Gori and
Tesi, 1992).
Theoretical work has shown that saddle points (and the flat regions surround-
ing them) are important barriers to training neural networks, and may be more
important than local minima.
Explain (Dauphin et al., 2014; Choromanska et al., 2014)
8.2.4 Cliffs and Exploding Gradients
Whereas the issues of ill-conditioning and saddle points discussed in the previous
sections arise because of the second-order structure of the objective function (as
a function of the parameters), neural networks involve stronger non-linearities
which do not fit well with this picture. In particular, the second-order Taylor
series approximation of the objective function yields a symmetric view of the
landscape around the minimum, oriented according to the axes defined by the
principal eigenvectors of the Hessian matrix. (TODO: REFER TO A PLOT
FROM THE ILL-CONDITIONING SECTION WITH COUNTOURS OF VAL-
LEY). Second-order methods and momentum or gradient-averaging methods in-
troduced in Section 8.4 are able to reduce the difficulty due to ill-conditioning by
increasing the size of the steps in the low-curvature directions (the “valley”, in
Figure 8.1) and decreasing the size of the steps in the high-curvature directions
(the steep sides of the valley, in the figure).
229
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Figure 8.1: The traditional view of the optimization difficulty in neural networks is
inspired by the ill-conditioning problem in quadratic optimization: some directions have
a high curvature (second derivative), corresponding to the rising sides of the valley, and
other directions have a low curvature, corresponding to the smooth slope of the valley.
Most second-order methods, as well as momentum or gradient averaging methods are
meant to address that problem, by increasing the step size in the direction of the valley
(where it’s most paying in the long run to go) and decreasing it in the directions of steep
rise, which would otherwise lead to oscillations. The objective is to smoothly go down,
staying at the bottom of the valley.
However, although classical second order methods can help, as shown in Fig-
ure 8.2, due to higher order derivatives, the objective function may have a lot
more non-linearity, which often does not have the nice symmetrical shapes that
the second-order “valley” picture builds in our mind. Instead, there are cliffs
where the gradient rises sharply. When the parameters approach a cliff region,
the gradient update step can move the learner towards a very bad configuration,
ruining much of the progress made during recent training iterations.
230
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Figure 8.2: Contrary to what is shown in Figure 8.1, the cost function for highly non-
linear deep neural networks or for recurrent neural networks is typically not made of
symmetrical sides. As shown in the figure, there are sharp non-linearities that give rise
to very high derivatives in some places. When the parameters get close to such a cliff
region, a gradient descent update can catapult the parameters very far, possibly ruining
a lot of the optimization work that had been done. Figure graciously provided by Razvan
Pascanu (Pascanu, 2014).
As illustrated in Figure 8.3, the cliff can be dangerous whether we approach it
from above or from below, but fortunately there are some fairly straightforward
heuristics that allow one to avoid its most serious consequences. The basic idea
is to limit the size of the jumps that one would make. Indeed, one should keep
in mind that when we use the gradient to make an update of the parameters, we
are relying on the assumption of infinitesimal moves. There is no guarantee that
making a finite step of the parameters θ in the direction of the gradient will yield
an improvement. The only thing that is guaranteed is that a small enough step
in that direction will be helpful. As we can see from Figure 8.3, in the presence
of a cliff (and in general in the presence of very large gradients), the decrease
in the objective function expected from going in the direction of the gradient is
only valid for a very small step. In fact, because the objective function is usually
bounded in its actual value (within a finite domain), when the gradient is large at
θ, it typically only remains like this (especially, keeping its sign) in a small region
around θ. Otherwise, the value of the objective function would have to change
a lot: if the slope was consistently large in some direction as we would move in
that direction, we would be able to decrease the objective function value by a
231
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
very large amount by following it, simply because the total change is the integral
over some path of the directional derivatives along that path.
Figure 8.3: To address the presence of cliffs such as shown in Figure 8.2, a useful heuristic
is to clip the magnitude of the gradient, only keeping its direction if its magnitude is above
a threshold (which is a hyperparameter, although not a very critical one). This helps to
avoid the destructive big moves which would happen when approaching the cliff, either
from above or from below. Figure graciously provided by Razvan Pascanu (Pascanu,
2014).
The gradient clipping heuristics are described in more detail in Section 10.7.6.
The basic idea is to bound the magnitude of the update step, i.e., not trust the
gradient too much when it is very large in magnitude. The context in which such
cliffs have been shown to arise in particular is that of recurrent neural networks,
when considering long sequences, as discussed in the next section.
8.2.5 Vanishing and Exploding Gradients - An Introduction to
the Issue of Learning Long-Term Dependencies
Parametrized dynamical systems such as recurrent neural networks (Chapter 10)
face a particular optimization problem which is different but related to that of
training very deep networks. We introduce this issue here and refer to reader to
Section 10.7 for a deeper treatment along with a discussion of approaches that
have been proposed to reduce this difficulty.
Exploding or Vanishing Product of Jacobians
The simplest explanation of the problem, which is shared among very deep nets
and recurrent nets, is that in both cases the final output is the composition of a
large number of non-linear transformations. Even though each of these non-linear
stages may be relatively smooth (e.g. the composition of an affine transformation
with a hyperbolic tangent or sigmoid), their composition is going to be much
232
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
“more non-linear”, in the sense that derivatives through the whole composition
will tend to be either very small or very large, with more ups and downs. TODO:
the phrase ”ups and downs” has a connotation of specifically good things and bad
things happening over time, use a different phrase. this section is also sloppily
conflating many different ideas, just having both areas of large derivatives and
small derivatives does not mean there are lots of ups and downs, consider the
function (1 x y)
2
, which has small derivatives near the origin and the global
minima, much larger derivatives in between This arises simply because the Jaco-
bian (matrix of derivatives) of a composition is the product of the Jacobians of
each stage, i.e., if
f = f
T
f
T 1
. . . , f
2
f
1
then the Jacobian matrix of derivatives of f(x) with respect to its input vector x
is the product
f
0
= f
0
T
f
0
T 1
. . . , f
0
2
f
1
(8.1)
where
f
0
=
f(x)
x
and
f
0
t
=
f
t
(a
t
)
a
t
where a
t
= f
t1
(f
t2
(. . . , f
2
(f
1
(x)))), i.e. composition has been replaced by
matrix multiplication. This is illustrated in Figure 8.4. TODO: the above sentence
is incredibly long, split it up and probably put the definitions in the opposite order.
It also seems strange to say composition is ”replaced” by matrix multiplication,
more like composition in forward prop implies matrix multiplication in backprop
!!
="
!!
…"
Figure 8.4: When composing many non-linearities (like the activation non-linearity in a
deep or recurrent neural network), the result is highly non-linear, typically with most
of the values associated with a tiny derivative, some values with a large derivative, and
many ups and downs (not shown here).
TODO: can we avoid using capital letters for scalars here? (T) In the scalar
case, we can imagine that multiplying many numbers together tends to be either
very large or very small. In the special case where all the numbers in the product
233
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
have the same value α, this is obvious, since α
T
goes to 0 if α < 1 and goes to
if α > 1, as T increases. The more general case of non-identical numbers be
understood by taking the logarithm of these numbers, considering them to be
random, and computing the variance of the sum of these logarithms. Clearly,
although some cancellation can happen, the variance grows with T , and in fact
if those numbers are independent, the variance grows linearly with T , i.e., the
size of the sum (which is the standard deviation) grows as
T , which means that
the product grows roughly as e
T
(consider the variance of log-normal variate X
if log X is normal with mean 0 and variance T ).
It would be interesting to push this analysis to the case of multiplying square
matrices instead of multiplying numbers, but one might expect qualitatively sim-
ilar conclusions, i.e., the size of the product somehow grows with the number of
matrices, and that it grows exponentially. In the case of matrices, one can get
a new form of cancellation due to leading eigenvectors being well aligned or not.
The product of matrices will blow up only if, among their leading eigenvectors
with eigenvalue greater than 1, there is enough “in common” (in the sense of the
appropriate dot products of leading eigenvectors of one matrix and another).
However, this analysis was for the case where these numbers are independent.
In the case of an ordinary recurrent neural network (developed in more detail in
Chapter 10), these Jacobian matrices are highly related to each other. Each layer-
wise Jacobian is actually the product of two matrices: (a) the recurrent matrix W
and (b) the diagonal matrix whose entries are the derivatives of the non-linearities
associated with the hidden units, which vary depending on the time step. This
makes it likely that successive Jacobians have similar eigenvectors, making the
product of these Jacobians explode or vanish even faster.
Consequence for Recurrent Networks: Difficulty of Learning Long-
Term Dependencies
The consequence of the exponential convergence of these products of Jacobians
towards either very small or very large values is that it makes the learning of
long-term dependencies particularly difficult, as we explain below and was inde-
pendently introduced in Hochreiter (1991) and Bengio et al. (1993, 1994) for the
first time.
TODO: why the capital F? Can we use lowercase instead? This also appears
in rnn.tex Consider a fairly general parametrized dynamical system (which in-
cludes classical recurrent networks as a special case, as well as all their known
variants), processing a sequence of inputs, x
1
, . . . , x
t
, . . ., involving iterating over
the transition operator:
s
t
= F
θ
(s
t1
, x
t
) (8.2)
where s
t
is called the state of the system and F
θ
is the recurrent function that
234
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
maps the previous state and current input to the next state. The state can be
used to produce an output via an output function,
o
t
= g
ω
(s
t
), (8.3)
TODO: could we avoid the capital T? and a loss L
t
is computed at each time step
t as a function of o
t
and possibly of some targets y
t
. Let us consider the gradient
of a loss L
T
at time T with respect to the parameters θ of the recurrent function
F
θ
. One particular way to decompose the gradient
L
T
θ
using the chain rule is the
following:
L
T
θ
=
X
tT
L
T
s
t
s
t
θ
L
T
θ
=
X
tT
L
T
s
T
s
T
s
t
F
θ
(s
t1
, x
t
)
θ
(8.4)
where the last Jacobian matrix only accounts for the immediate effect of θ as a
parameter of F
θ
when computing s
t
= F
θ
(s
t1
, x
t
), i.e., not taking into account
the indirect effect of θ via s
t1
(otherwise there would be double counting and the
result would be incorrect). To see that this decomposition is correct, please refer
to the notions of gradient computation in a flow graph introduced in Section 6.4,
and note that we can construct a graph in which θ influences each s
t
, each of
which influences L
T
via s
T
. Now let us note that each Jacobian matrix
s
T
s
t
can
be decomposed as follows:
s
T
s
t
=
s
T
s
T 1
s
T 1
s
T 2
. . .
s
t+1
s
t
(8.5)
which is of the same form as Eq. 8.1 discussed above, i.e., which tends to either
vanish or explode.
As a consequence, we see from Eq. 8.4 that
L
T
θ
is a weighted sum of terms
over spans T t, with weights that are exponentially smaller (or larger) for longer-
term dependencies relating the state at t to the state at T . As shown in Bengio
et al. (1994), in order for a recurrent network to reliably store memories, the Ja-
cobians
s
t
s
t1
relating each state to the next must have a determinant that is less
than 1 (i.e., yielding to the formation of attractors in the corresponding dynam-
ical system). Hence, when the model is able to capture long-term dependencies
it is also in a situation where gradients vanish and long-term dependencies have
an exponentially smaller weight than short-term dependencies in the total gradi-
ent. It does not mean that it is impossible to learn, but that it might take a
very long time to learn long-term dependencies, because the signal about these
dependencies will tend to be hidden by the smallest fluctuations arising from
235
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
short-term dependencies. In practice, the experiments in Bengio et al. (1994)
show that as we increase the span of the dependencies that need to be captured,
gradient-based optimization becomes increasingly difficult, with the probability
of successful learning rapidly reaching 0 after only 10 or 20 steps in the case of
the vanilla recurrent net and stochastic gradient descent (Section 8.3.2).
For a deeper treatment of the dynamical systems view of recurrent networks,
consider Doya (1993); Bengio et al. (1994); Siegelmann and Sontag (1995), with a
review in Pascanu et al. (2013a). Section 10.7 discusses various approaches that
have been proposed to reduce the difficulty of learning long-term dependencies
(in some cases allowing one to reach to hundreds of steps), but it remains one of
the main challenges in deep learning.
8.3 Optimization Algorithms
In Sec. 6.4, we discussed the backpropagation algorithm (backprop): that is,
how to efficiently compute the gradient of the loss with-respect-to the model
parameters. The backpropagation algorithm does not specify how we use this
gradient to update the weights of the model.
In this section we introduce a number of gradient-based learning algorithms
that have been proposed to optimize the parameters of deep learning models.
Deep learning
8.3.1 Gradient Descent
Gradient descent is the most basic gradient-based algorithms one might apply to
train a deep model. The algorithm involves updating the model parameters θ (in
the case of a deep neural network, these parameters would include the weights
and biases associated with each layer) with a small step in the direction of the
gradient of the loss function (including any regularization terms). For the case of
supervised learning with data pairs [x
(t)
, y
(t)
] we have:
θ θ +
θ
X
t
L(f(x
(t)
; θ), y
(t)
; θ), (8.6)
where is the learning rate, an optimization hyperparameter that controls the size
of the step the the parameters take in the direction of the gradient. Of course,
following the gradient in this way is only guaranteed to reduce the loss in the
limit as 0.
TODO: orphaned phrase on next line, was this a note for a thing to do or was
it a typo? learning rates
236
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.3.2 Stochastic Gradient Descent
One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a
sum over the training examples. Optimization algorithms for machine learning
typically compute each update to the parameters based on a subset of the terms
of the objective function, not based on the complete objective function itself.
For example, maximum likelihood estimation problems decompose into a sum
over each example:
TODO equation, using same format as in original maximum likelihood section,
which isn’t written yet
More general, we are really interested in minimizing the expected loss with the
expectation taken with respect to the data distribution, i.e. E
X,Y
[L(f
θ
(X), Y )],
with X, Y P (X, Y ). As discussed in Sec. 6.2, we replace this expectation
with an average over the training data (eg. for n examples):
expected loss =
1
n
n
X
t=1
L(f(x
(t)
), y
(t)
) (8.7)
This form of the loss implies that the gradient also consists of an average of the
gradient contributions for each data point:
θ
expected loss =
1
n
n
X
t=1
L(f(x
(t)
), y
(t)
) (8.8)
Now
So we can interpret the right hand side of Eqn. 8.8 as an estimator of the
gradient of the expected loss. Seen in this light, it’s reasonble to think about the
properties of this estimator, such as its mean and variance.
Provided that there are a relatively large number of examples in the training
set, computing the gradient over all examples in the training dataset – also known
as batch gradient descent – would yeild a relatively small variance on the estimate
In application to training deep learning models, straightforward gradient de-
scent where each gradient step involves computing the gradient for all training
examples is well known to be inefficient. This is especially true when we are
dealing with large datasets.
TODO stochastic gradient descent
TODO examples can be redundant, so best computational efficiency comes
from minibatches TODO too small of batch size -¿ bad use of multicore architec-
tures TODO issues with Hessian being different for different batches, etc.
TODO importance of shuffling shuffle-once versus shuffle per epoch todo: de-
fine the term “epoch”
discuss learning rates.
237
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.1 Stochastic gradient descent (SGD) update at time t
Require: Learning rate η.
Require: Initial parameter θ
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient estimate: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Apply update: θ θ ηg
end while
8.3.3 Momentum
While stochastic gradient descent remains a very popular optimization strategy,
learning with it can sometimes be slow. This is especially true in situations
where there the gradient is small, but consistent across minibatches. From the
consistency of the gradient, we know that we can afford to take larger steps in this
direction, yet we have no way of really knowing when we are in such situations.
The Momentum method Polyak (1964) is designed to accelerate learning, es-
pecially in the face of small and consistent gradients. The intuition behind mo-
mentum, as the name suggests, is derived from a physical interpretation of the
optimization process. Imagine you have a small ball (think marble) that rep-
resents the current position in parameter space (for our purposes here we can
imagine a 2-D parameter space). Now consider that the ball is on a gentle slope,
while the instantaneous force pulling the ball down hill is relatively small, their
contributions combine and the downhill velocity of the ball gradually begins to
increase over time. The momentum method is designed to inject this kind of
downhill acceleration into gradient-based optimization. TODO: explain effect of
drag on the marble and why it should be linear drag
Formally, we introduce a variable v that plays the role of velocity (or momen-
tum) that accumulates gradient. The update rule is given by:
v +αv + η
θ
1
n
n
X
t=1
L(f(x
(t)
; θ), y
(t)
)
!
θ θ + v
where the
238
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.2 Stochastic gradient descent (SGD) with momentum
Require: Learning rate η, momentum parameter α.
Require: Initial parameter θ, initial velocity v.
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient estimate: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Compute velocity update: v αv ηg
Apply update: θ θ + v
end while
Nesterov momentum Sutskever et al. (2013) introduced a variant of the mo-
mentum algorithm that was inspired by Nesterov.
v +αv + η
θ
1
n
n
X
t=1
L(f(x
(t)
; θ + αv), y
(t)
)
!
,
θ θ + v,
where the parameters α and η play a similar role as in the standard momentum
method. The difference betwee Nesterov momentum and standard momentum is
where the gradient is evaluated.
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum
Require: Learning rate η, momentum parameter α.
Require: Initial parameter θ, initial velocity v.
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Apply interim update: θ θ + αv
Set g = 0
for t = 1 to m do
Compute gradient (at interim point): g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Compute velocity update: v αv ηg
Apply update: θ θ + v
end while
8.3.4 Adagrad
239
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.4 The Adagrad algorithm
Require: Global learning rate η,
Require: Initial parameter θ
Initialize gradient accumulation variable r = 0,
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r r + g
2
Compute update: θ
η
r
g. % (
1
r
applied element-wise)
Apply update: θ θ + θ
t
end while
8.3.5 RMSprop
Algorithm 8.5 The RMSprop algorithm
Require: Global learning rate η, decay rate ρ.
Require: Initial parameter θ
Initialize accumulation variables r = 0
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r ρr + (1 ρ)g
2
Compute parameter update: θ =
η
r
g. % (
1
r
applied element-wise)
Apply update: θ θ + θ
end while
8.3.6 Adadelta
8.3.7 No Pesky Learning Rates
240
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.6 RMSprop algorithm with Nesterov momentum
Require: Global learning rate η, decay rate ρ, momentum para α.
Require: Initial parameter θ, initial velocity v.
Initialize accumulation variable r = 0
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Compute interim update: θ θ + αv
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r ρr + (1 ρ)g
2
Compute velocity update: v αv
η
r
g. % (
1
r
applied element-wise)
Apply update: θ θ + v
end while
8.4 Approximate Natural Gradient and Second-Order
Methods
8.5 Conjugate Gradients
8.6 BFGS
TONGA and “actual” NG, links with HF.
8.6.1 New
8.6.2 Optimization Strategies and Meta-Algorithms
8.6.3 Coordinate Descent
In some cases, it may be possible to solve an optimization problem quickly by
breaking it into separate pieces. If we minimize f(x) with respect to a single
variable x
i
, then minimize it with respect to another variable x
j
and so on, we are
guaranteed to arrive at a (local) minimum. This practice is known as coordinate
descent, because we optimize one coordinate at a time. More generally, block
coordinate descent refers to minimizing with respect to a subset of the variables
simultaneously. The term “coordinate descent” is often used to refer to block
coordinate descent as well as the strictly individual coordinate descent.
241
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.7 The Adadelta algorithm
Require: Decay rate ρ, constant
Require: Initial parameter θ
Initialize accumulation variables r = 0, s = 0,
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r ρr + (1 ρ)g
2
Compute update: θ =
s+
r+
g % (operations applied element-wise)
Accumulate update: s ρs + (1 ρ) [∆θ]
2
Apply update: θ θ + θ
end while
Coordinate descent makes the most sense when the different variables in the
optimization problem can be clearly separated into groups that play relatively
isolated roles, or when optimization with respect to one group of variables is
significantly more efficient than optimization with respect to all of the variables.
For example, the objective function most commonly used for sparse coding is not
convex. However, we can divide the inputs to the training algorithm into two
sets: the dictionary parameters and the code representations. Minimizing the
objective function with respect to either one of these sets of variables is a convex
problem. Block coordinate descent thus gives us an optimization strategy that
allows us to use efficient convex optimization algorithms.
Coordinate descent is not a very good strategy when the value of one variable
strongly influences the optimal value of another variable, as in the function f (x) =
(x
1
x
2
)
2
+ α
x
2
1
+ y
2
1
where α is a positive constant. As α approaches 0,
coordinate descent ceases to make any progress at all, while Newton’s method
could solve the problem in a single step.
8.6.4 Greedy Supervised Pre-training
TODO
242
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.8 The vSGD-1 algorithm from Schaul et al. (2012)
Require: Initial parameter θ
0
Initialize accumulation variables q = 0, r = 0,s = 0
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Initialize the gradient g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: q ρq + (1 ρ)g
Accumulate squared gradient: r ρr + (1 ρ)g
2
Accumulate: s ρs + (1 ρ)
bbprop(θ)
(j)
i
estimate learning rate (element-wise calc.): η
q
2
sr
Update memory size: ρ
q
2
r
1
1
(1 ρ)
Compute update: θ = η
g
% All operations above should be interpreted as element-wise.
Apply update: θ θ + θ
end while
8.7 Hints, Global Optimization and Curriculum Learn-
ing
Most of the work on numerical optimization for machine learning and deep learn-
ing in particular is focused on local descent, i.e., on how to locally improve the
objective function efficiently. What the experiments with different initialization
strategies tell us is that local descent methods can get stuck, presumably near a
local minimum or near a saddle point, i.e., where gradients are small, so that the
initial value of the parameters can matter a lot.
As an illustration of this issue, consider the experiments reported by G¨ul¸cehre
and Bengio (2013), where a learning task is setup so that if the lower half of the
deep supervised network is pre-trained with respect to an appropriate sub-task,
the whole network can learn to solve the overall task, whereas random initializa-
tion almost always fails. In these experiments, we know that the overall task can
be decomposed into two tasks (1) (identifying the presence of different objects
in an input image and (2) verifying whether the different objects detected are of
the same class or not. Each of these two tasks (object recognition, exclusive-or)
are known to be learnable, but when we compose them, it is much more difficult
243
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.9 Conjugate gradient method
Require: Initial parameters θ
0
Initialize ρ
0
= 0
while stopping criterion not met do
Initialize the gradient g = 0
for t = 1 to n % loop over the training set. do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end forbackpropagation)
Compute β
t
=
(g
t
g
t1
)
>
g
t
g
>
t1
g
t1
(Polak Ribi`ere)
Compute search direction: ρ
t
= g
t
+ β
t
ρ
t1
Perform line search to find: η
= argmin
η
J(θ
t
+ ηρ
t
)
Apply update: θ
t+1
= θ
t
+ η
ρ
t
end while
Algorithm 8.10 BFGS method
Require: Initial parameters θ
0
Initialize inverse Hessian M
0
= I
while stopping criterion not met do
Compute gradient: g
t
= J(θ
t
) (via batch backpropagation)
Compute φ = g
t
g
t1
, = θ
t
θ
t1
Approx H
1
: M
t
= M
t1
+
1 +
φ
>
M
t1
φ
>
φ
φ
>
φ
>
φ
φ
>
M
t1
+M
t1
φ
>
>
φ
Compute search direction: ρ
t
= M
t
g
t
Perform line search to find: η
= argmin
η
J(θ
t
+ ηρ
t
)
Apply update: θ
t+1
= θ
t
+ η
ρ
t
end while
to optimize the neural network (including a large variety of architectures), while
other methods such as SVMs, boosting and decision trees also fail. This is an
instance where the optimization difficulty was solved by introducing prior knowl-
edge in the form of hints, specifically hints about what the intermediate layer in
a deep net should be doing. We have already seen in Section 8.6.4 that a useful
strategy is to ask the hidden units to extract features that are useful to the super-
vised task at hand, with greedy supervised pre-training. In section 16.1 we will
discuss an unsupervised version of this idea, where we ask the intermediate layers
to extract features that are good explaining the variations in the input, without
reference to a specific supervised task. Another related line of work is the Fit-
Nets (Romero et al., 2015), where the middle layer of 5-layer supervised teacher
network is used as a hint to be predicted by the middle layer of a much deeper
student network (11 to 19 layers). In that case, additional parameters are intro-
244
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
duced to regress the middle layer of the 5-layer teacher network from the middle
layer of the deeper student network. The lower layers of the student networks
thus get two objectives: help the outputs of the student network accomplish their
task, as well as predict the intermediate layer of the teacher network. Although
a deeper network is usually more difficult to optimize, it can generalize better (it
has to extract these more abstract and non-linear features). Romero et al. (2015)
were motivated by the fact that a deep student network with a smaller number
of hidden units per layer can have a lot less parameters (and faster computation)
than a fatter shallower network and yet achieve the same or better generalization,
thus allowing a trade-off between better generalization (with 3 times fewer pa-
rameters) and faster test-time computation (up to 10 fold, in the paper, using a
very thin and deep network with 35 times less parameters). Without the hints on
the hidden layer, the student network performed very poorly in the experiments,
both on the training and test set.
These drastic effects of initialization and hints to middle layers bring forth
the question of what is sometimes called global optimization (Horst et al., 2000),
the main subject of this section. The objective of global optimization methods is
to find better solutions than local descent minimizers, i.e., ideally find a global
minimum of the objective function and not simply a local minimum. If one could
restart a local optimization method from a very large number of initial conditions,
one could imagine that the global minimum could be found, but there are more
efficient approaches.
Two fairly general approaches to global optimization are continuation meth-
ods (Wu, 1997), a deterministic approach, and simulated annealing (Kirkpatrick
et al., 1983), a stochastic approach. They both proceed from the intuition that
if we sufficiently blur a non-convex objective function (e.g. convolve it with a
Gaussian) whose global minima arae not at infinite values, then it becomes con-
vex, and finding the global optimum of that blurred objective function should be
much easier. As illustrated in Figure 8.5, by gradually changing the objective
function from a very blurred easy to optimize version to the original crisp and
difficult objective function, we are actually likely to find better local minima. In
the case of simulated annealing, the blurring occurs because of injecting noise.
With injected noise, the state of the system can sometimes go uphill, and thus
does not necessarily get stuck in a local minimum. With a lot of noise, the ef-
fective objective function (averaged over the noise) is flatter and convex, and if
the amount of noise is reduced sufficiently slowly, then one can show convergence
to the global minimum. However, the annealing schedule (the rate at which the
noise level is decreased, or equivalently the temperature is decreased when you
think of the physical annealing analogy) might need to be extremely slow, so an
NP-hard optimization problem remains NP-hard.
245
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Track local minima
Final solution
Easy to find minimum
Figure 8.5: Optimization based on continuation methods: start by optimizing a smoothed
out version of the target objective function (possibly convex), then gradually reduce
the amount of smoothing while tracking the local optimum. This approach tends to
find better local minima than a straight local descent approach on the target objective
function. Curriculum learning (starting from easy examples and gradually introducing
with higher probability more difficult examples) can be justified under that light (Bengio
et al., 2009).
Continuation methods have been extremely successful in recent years: see a
recent overview of recent literature, especially for AI applications in Mobahi and
Fisher III (2015). Continuation methods define a family of objective functions,
indexed by a single scalar index λ, with an easy to optimize objective function at
one end (usually convex, say λ = 1) and the target objective at the other end (say
λ = 0). The idea is to first find the solution for the easy problem (λ = 1) and
then gradually decrease λ towards the more difficult objectives, while tracking the
minimum.
Curriculum learning (Bengio et al., 2009) was introduced as a general strat-
egy for machine learning that is inspired by how humans learn, starting by learn-
ing to solve simple tasks, and then exploiting what has been learned to learn
slightly more difficult and abstract tasks, etc. It was justified as a continua-
tion method (Bengio et al., 2009) in the context of deep learning, where it was
previously observed that the optimization problem can be challenging. Experi-
ments showed that better results could be obtained by following a curriculum, in
particular on a large-scale neural language modeling task. One view on curricu-
lum learning introduced in that paper is that a particular intermediate objective
function corresponds to a reweighing on the examples: initially the easy to learn
examples are given more weights or a higher probability, and harder examples
246
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
see their weight or probability gradually increased as the learner gets sufficiently
ready to learn them. The idea of curriculum learning to help train difficult to
optimize models has been taken up successfully not only in natural language
tasks (Spitkovsky et al., 2010; Collobert et al., 2011a; Mikolov et al., 2011b; Tu
and Honavar, 2011) but also in computer vision (Kumar et al., 2010; Lee and
Grauman, 2011; Supancic and Ramanan, 2013). It was also found to be con-
sistent with the way in which humans teach (Khan et al., 2011): they start by
showing easier and more prototypical examples and then help the learner refine
the decision surface with the less obvious cases. In agreement with this, it was
found that such strategies are more effective when teaching to humans (Basu and
Christensen, 2013).
Another important contribution to research on curriculum learning arose in
the context of training recurrent neural networks to capture long-term dependen-
cies (Zaremba and Sutskever, 2014): it was found that much better results were
obtained with a stochastic curriculum, in which a random mix of easy and difficult
examples is always presented to the learner, but where the average proportion of
the more difficult examples (here, those with longer-term dependencies) is grad-
ually increased. Instead, with a deterministic curriculum, no improvement over
the baseline (ordinary training from the fully training set) was observed.
247