Chapter 8
Optimization for Training Deep
Models
In Section 4.3, we saw a brief overview of numerical optimization—the task of iteratively
searching for values of x that result in maximal or minimal values of f (x). That intro-
ductory section was fairly general and described the basic ideas that are used in a range
of optimization problems. Optimization arises in many contexts in deep learning, not
just in optimizing parameters but also in inference (which optimizes over the internal
states of the network).
Here we present optimization algorithms that are intended specifically for training
deep models. In this case, we will always minimize some cost function J(X
(train)
, θ)
with respect to the model parameters θ.
8.1 Optimization for Model Training
Optimization algorithms used for training of deep models differ from traditional opti-
mization algorithms in several ways.
8.1.1 Empirical Risk Minimization
Suppose that we have input feature x, targets y, and some loss function L(x, y). Our
ultimate goal is to minimize E
x,yp(x,y)
[L(x, y)]. This quantity is known as the risk.
If we knew the true distribution p(x, y), this would be an optimization task solveable
by an optimization algorithm. However, when we do not know p(x, y) but only have a
training set of samples from it, we have a machine learning problem.
The simplest way to convert a machine learning problem back into an optimization
problem is to minimize the expected loss on the training set. This means replacing the
true distribution p(x, y) with the empirical distribution ˆp(x, y) defined by the training
set. We now minimize the empirical risk
E
x,yˆp(x,y)
[L(x, y)] =
1
m
m
X
i=1
L(x
(i)
, y
(i)
)
156
where m is the number of training examples.
This process is known as empirical risk minimization. In this setting, machine learn-
ing is still very similar to straightforward optimization. Rather than optimizing the risk
directly, we optimize the empirical risk, and hope that the risk decreases significantly as
well. A variety of theoretical results establish conditions under which the true risk can
be expected to decrease by various amounts.
However, empirical risk minimization is prone to overfitting. Models with high ca-
pacity can simply memorize the training set. In many cases, empirical risk minimization
is not really feasible. The most effective modern optimization algorithms are based on
gradient descent, but many useful loss functions, such as 0-1 loss, have no useful deriva-
tives (the derivative is either zero or undefined everywhere). These two problems mean
that, in the context of deep learning, we rarely use empirical risk minimization. In-
stead, we must use a slightly different approach, in which the quantity that we actually
optimize is even more different from the quantity that we truly want to optimize.
TODO– make sure 0-1 loss is defined and in the index
8.1.2 Surrogate Loss Functions
TODO–coordinate with Yoshua / coordinate with MLP / ML chapters do we use term
loss function = map from a specific example to a real number or do we use it inter-
changeably with objective function / cost function? it seems some literature uses ”loss
function” in a very general sense while others use it to mean specifically a single-example
cost that you can take the expectation of, etc. this terminology seems a bit sub-optimal
since it relies a lot on using English words with essentially the same meaning to represent
different things with precise technical meanings are ”surrogate loss functions” specifi-
cally replacing the cost for an individual examples, or does this also include things like
minimizing the empirical risk rather than the true risk, adding a regularization term to
the likelihood terms, etc.?
TODO– in some cases, surrogate loss function actually results in being able to learn
more. for example, test 0-1 loss continues to decrease for a long time after train 0-1 loss
has reached zero when training using log likelihood surrogate
In some cases, using a surrogate loss function allows us to extract more information
8.1.3 Generalization
TODO– SGD on an infinite dataset optimizes the generalization error directly (note that
SGD is not introduced until later so this will need to be presented carefully) TODO–
A very important difference between optimization in general and optimization as we
use it for training algorithms is that training algorithms do not usually halt at a local
minimum. Instead, using a regularization method known as early stopping (see Sec.
7.8), they halt whenever overfitting begins to occur. This is often in the middle of a
wide, flat region, but it can also occur on a steep part of the surrogate loss function.
This is in contrast to general optimization, where converge is usually defined by arriving
at a point that is very near a (local) minimum.
157
8.1.4 Batches and Minibatches
One aspect of machine learning algorithms that separates them from general optimiza-
tion algorithms is that the objective function usually decomposes as a sum over the
training examples. Optimization algorithms for machine learning typically compute
each update to the parameters based on a subset of the terms of the objective function,
not based on the complete objective function itself.
For example, maximum likelihood estimation problems decompose into a sum over
each example: TODO equation, using same format as in original maximum likelihood
section, which isn’t written yet
TODO stochastic gradient descent
TODO examples can be redundant, so best computational efficiency comes from
minibatches TODO too small of batch size -¿ bad use of multicore architectures TODO
issues with Hessian being different for different batches, etc. TODO importance of
shuffling shuffle-once versus shuffle per epoch todo: define the term “epoch”
8.1.5 Data Parallelism
TODO asynch implementations, hogwild, distbelief
8.2 Challenges in Optimization
8.2.1 Local Minima
TODO check whether this is already covered in numerical.tex
8.2.2 Ill-Conditioning
TODO this is definitely already covered in numerical.tex
8.2.3 Plateaus, Saddle Points, and Other Flat Regions
The long-held belief that neural networks are hopeless to train because they are fraught
with local minima has been one of the reasons for the “neural networks winter” in the
1995-2005 decade. Indeed, one can show that there may be an exponentially large num-
ber of local minima, even in the simplest neural network optimization problems (Sontag
and Sussman, 1989; Brady et al., 1989; Gori and Tesi, 1992).
Theoretical work has shown that saddle points (and the flat regions surrounding
them) are important barriers to training neural networks, and may be more important
than local minima.
8.2.4 Cliffs and Exploding Gradients
Whereas the issues of ill-conditioning and saddle points discussed in the previous sections
arise because of the second-order structure of the objective function (as a function of the
parameters), neural networks involve stronger non-linearities which do not fit well with
158
this picture. In particular, the second-order Taylor series approximation of the objective
function yields a symmetric view of the landscape around the minimum, oriented ac-
cording to the axes defined by the principal eigenvectors of the Hessian matrix. (TODO:
REFER TO A PLOT FROM THE ILL-CONDITIONING SECTION WITH COUN-
TOURS OF VALLEY). Second-order methods and momentum or gradient-averaging
methods introduced in Section 8.4 are able to reduce the difficulty due to ill-conditioning
by increasing the size of the steps in the low-curvature directions (the “valley”, in Fig-
ure 8.1) and decreasing the size of the steps in the high-curvature directions (the steep
sides of the valley, in the figure).
Figure 8.1: The traditional view of the optimization difficulty in neural networks is
inspired by the ill-conditioning problem in quadratic optimization: some directions have
a high curvature (second derivative), corresponding to the rising sides of the valley, and
other directions have a low curvature, corresponding to the smooth slope of the valley.
Most second-order methods, as well as momentum or gradient averaging methods are
meant to address that problem, by increasing the step size in the direction of the valley
(where it’s most paying in the long run to go) and decreasing it in the directions of steep
rise, which would otherwise lead to oscillations. The objective is to smoothly go down,
staying at the bottom of the valley.
However, although classical second order methods can help, as shown in Figure 8.2,
due to higher order derivatives, the objective function may have a lot more non-linearity,
which often does not have the nice symmetrical shapes that the second-order “valley”
picture builds in our mind. Instead, there are cliffs where the gradient rises sharply.
When the parameters approach a cliff region, the gradient update step can move the
learner towards a very bad configuration, ruining much of the progress made during
recent training iterations.
159
Figure 8.2: Contrary to what is shown in Figure 8.1, the cost function for highly non-
linear deep neural networks or for recurrent neural networks is typically not made of
symmetrical sides. As shown in the figure, there are sharp non-linearities that give
rise to very high derivatives in some places. When the parameters get close to such a
cliff region, a gradient descent update can catapult the parameters very far, possibly
ruining a lot of the optimization work that had been done. Figure graciously provided
by Razvan Pascanu (Pascanu, 2014).
As illustrated in Figure 8.3, the cliff can be dangerous whether we approach it from
above or from below, but fortunately there are some fairly straightforward heuristics
that allow one to avoid its most serious consequences. The basic idea is to limit the size
of the jumps that one would make. Indeed, one should keep in mind that when we use
the gradient to make an update of the parameters, we are relying on the assumption of
infinitesimal moves. There is no guarantee that making a finite step of the parameters
θ in the direction of the gradient will yield an improvement. The only thing that is
guaranteed is that a small enough step in that direction will be helpful. As we can see
from Figure 8.3, in the presence of a cliff (and in general in the presence of very large
gradients), the decrease in the objective function expected from going in the direction
of the gradient is only valid for a very small step. In fact, because the objective function
is usually bounded in its actual value (within a finite domain), when the gradient is
large at θ, it typically only remains like this (especially, keeping its sign) in a small
region around θ. Otherwise, the value of the objective function would have to change
a lot: if the slope was consistently large in some direction as we would move in that
direction, we would be able to decrease the objective function value by a very large
amount by following it, simply because the total change is the integral over some path
of the directional derivatives along that path.
160
Figure 8.3: To address the presence of cliffs such as shown in Figure 8.2, a useful heuris-
tic is to clip the magnitude of the gradient, only keeping its direction if its magnitude
is above a threshold (which is a hyper-parameter, although not a very critical one).
This helps to avoid the destructive big moves which would happen when approach-
ing the cliff, either from above or from below. Figure graciously provided by Razvan
Pascanu (Pascanu, 2014).
The gradient clipping heuristics are described in more detail in Section 10.6.7. The
basic idea is to bound the magnitude of the update step, i.e., not trust the gradient too
much when it is very large in magnitude. The context in which such cliffs have been
shown to arise in particular is that of recurrent neural networks, when considering long
sequences, as discussed in the next section.
8.2.5 Vanishing and Exploding Gradients - An Introduction to the
Issue of Learning Long-Term Dependencies
Parametrized dynamical systems such as recurrent neural networks (Chapter 10) face a
particular optimization problem which is different but related to that of training very
deep networks. We introduce this issue here and refer to reader to Section 10.6 for
a deeper treatment along with a discussion of approaches that have been proposed to
reduce this difficulty.
Exploding or Vanishing Product of Jacobians
The simplest explanation of the problem, which is shared among very deep nets and
recurrent nets, is that in both cases the final output is the composition of a large
number of non-linear transformations. Even though each of these non-linear stages may
be relatively smooth (e.g. the composition of an affine transformation with a hyperbolic
tangent or sigmoid), their composition is going to be much “more non-linear”, in the
sense that derivatives through the whole composition will tend to be either very small or
very large, with more ups and downs. This arises simply because the Jacobian (matrix
of derivatives) of a composition is the product of the Jacobians of each stage, i.e., if
f = f
T
f
T 1
. . . , f
2
f
1
161
then the Jacobian matrix of derivatives of f(x) with respect to its input vector x is the
product
f
0
= f
0
T
f
0
T 1
. . . , f
0
2
f
1
(8.1)
where
f
0
=
f(x)
x
and
f
0
t
=
f
t
(a
t
)
a
t
where a
t
= f
t1
(f
t2
(. . . , f
2
(f
1
(x)))), i.e. composition has been replaced by matrix
multiplication. This is illustrated in Figure 8.4.
!!
="
!!
…"
Figure 8.4: When composing many non-linearities (like the activation non-linearity in
a deep or recurrent neural network), the result is highly non-linear, typically with most
of the values associated with a tiny derivative, some values with a large derivative, and
many ups and downs (not shown here).
In the scalar case, we can imagine that multiplying many numbers together tends
to be either very large or very small. In the special case where all the numbers in the
product have the same value α, this is obvious, since α
T
goes to 0 if α < 1 and goes
to if α > 1, as T increases. The more general case of non-identical numbers be
understood by taking the logarithm of these numbers, considering them to be random,
and computing the variance of the sum of these logarithms. Clearly, although some
cancellation can happen, the variance grows with T, and in fact if those numbers are
independent, the variance grows linearly with T , i.e., the size of the sum (which is
the standard deviation) grows as
T , which means that the product grows roughly as
e
T
(consider the variance of log-normal variate X if log X is normal with mean 0 and
variance T).
It would be interesting to push this analysis to the case of multiplying square matrices
instead of multiplying numbers, but one might expect qualitatively similar conclusions,
i.e., the size of the product somehow grows with the number of matrices, and that it
grows exponentially. In the case of matrices, one can get a new form of cancellation due
to leading eigenvectors being well aligned or not. The product of matrices will blow up
only if, among their leading eigenvectors with eigenvalue greater than 1, there is enough
“in common” (in the sense of the appropriate dot products of leading eigenvectors of
one matrix and another).
However, this analysis was for the case where these numbers are independent. In the
case of an ordinary recurrent neural network (developed in more detail in Chapter 10),
162
these Jacobian matrices are highly related to each other. Each layer-wise Jacobian is
actually the product of two matrices: (a) the recurrent matrix W and (b) the diagonal
matrix whose entries are the derivatives of the non-linearities associated with the hidden
units, which vary depending on the time step. This makes it likely that successive
Jacobians have similar eigenvectors, making the product of these Jacobians explode or
vanish even faster.
Consequence for Recurrent Networks: Difficulty of Learning Long-Term De-
pendencies
The consequence of the exponential convergence of these products of Jacobians towards
either very small or very large values is that it makes the learning of long-term depen-
dencies particularly difficult, as we explain below and was independently introduced
in Hochreiter (1991) and Bengio et al. (1993, 1994) for the first time.
Consider a fairly general parametrized dynamical system (which includes classical
recurrent networks as a special case, as well as all their known variants), processing a
sequence of inputs, x
1
, . . . , x
t
, . . ., involving iterating over the transition operator:
s
t
= F
θ
(s
t1
, x
t
) (8.2)
where s
t
is called the state of the system and F
θ
is the recurrent function that maps the
previous state and current input to the next state. The state can be used to produce an
output via an output function,
o
t
= g
ω
(s
t
), (8.3)
and a loss L
t
is computed at each time step t as a function of o
t
and possibly of some
targets y
t
. Let us consider the gradient of a loss L
T
at time T with respect to the
parameters θ of the recurrent function F
θ
. One particular way to decompose the gradient
L
T
θ
using the chain rule is the following:
L
T
θ
=
X
tT
L
T
s
t
s
t
θ
L
T
θ
=
X
tT
L
T
s
T
s
T
s
t
F
θ
(s
t1
, x
t
)
θ
(8.4)
where the last Jacobian matrix only accounts for the immediate effect of θ as a param-
eter of F
θ
when computing s
t
= F
θ
(s
t1
, x
t
), i.e., not taking into account the indirect
effect of θ via s
t1
(otherwise there would be double counting and the result would
be incorrect). To see that this decomposition is correct, please refer to the notions of
gradient computation in a flow graph introduced in Section 6.3, and note that we can
construct a graph in which θ influences each s
t
, each of which influences L
T
via s
T
. Now
let us note that each Jacobian matrix
s
T
s
t
can be decomposed as follows:
s
T
s
t
=
s
T
s
T 1
s
T 1
s
T 2
. . .
s
t+1
s
t
(8.5)
163
which is of the same form as Eq. 8.1 discussed above, i.e., which tends to either vanish
or explode.
As a consequence, we see from Eq. 8.4 that
L
T
θ
is a weighted sum of terms over
spans T t, with weights that are exponentially smaller (or larger) for longer-term
dependencies relating the state at t to the state at T . As shown in Bengio et al. (1994),
in order for a recurrent network to reliably store memories, the Jacobians
s
t
s
t1
relating
each state to the next must have a determinant that is less than 1 (i.e., yielding to
the formation of attractors in the corresponding dynamical system). Hence, when the
model is able to capture long-term dependencies it is also in a situation where gradients
vanish and long-term dependencies have an exponentially smaller weight than short-term
dependencies in the total gradient. It does not mean that it is impossible to learn, but
that it might take a very long time to learn long-term dependencies, because the signal
about these dependencies will tend to be hidden by the smallest fluctuations arising from
short-term dependencies. In practice, the experiments in Bengio et al. (1994) show that
as we increase the span of the dependencies that need to be captured, gradient-based
optimization becomes increasingly difficult, with the probability of successful learning
rapidly reaching 0 after only 10 or 20 steps in the case of the vanilla recurrent net and
stochastic gradient descent.
For a deeper treatment of the dynamical systems view of recurrent networks, con-
sider Doya (1993); Bengio et al. (1994); Siegelmann and Sontag (1995), with a review
in Pascanu et al. (2013a). Section 10.6 discusses various approaches that have been pro-
posed to reduce the difficulty of learning long-term dependencies (in some cases allowing
one to reach to hundreds of steps), but it remains one of the main challenges in deep
learning.
8.3 Optimization Algorithms
In Sec. 6.3, we discussed the backpropagation algorithm (backprop): that is, how to
efficiently compute the gradient of the loss with-respect-to the model parameters. The
backpropagation algorithm does not specify how we use this gradient to update the
weights of the model.
In this section we introduce a number of gradient-based learning algorithms that
have been proposed to optimize the parameters of deep learning models.
Deep learning
8.3.1 Gradient Descent
Gradient descent is the most basic gradient-based algorithms one might apply to train a
deep model. The algorithm involves updating the model parameters θ (in the case of a
deep neural network, these parameters would include the weights and biases associated
with each layer) with a small step in the direction of the gradient of the loss function
(including any regularization terms). For the case of supervised learning with data pairs
164
[x
(t)
, y
(t)
] we have:
θ θ +
θ
X
t
L(f(x
(t)
; θ), y
(t)
; θ), (8.6)
where is the learning rate, an optimization hyperparameter that controls the size of
the step the the parameters take in the direction of the gradient. Of course, following
the gradient in this way is only guaranteed to reduce the loss in the limit as 0.
learning rates
8.3.2 Stochastic Gradient Descent
One aspect of machine learning algorithms that separates them from general optimiza-
tion algorithms is that the objective function usually decomposes as a sum over the
training examples. Optimization algorithms for machine learning typically compute
each update to the parameters based on a subset of the terms of the objective function,
not based on the complete objective function itself.
For example, maximum likelihood estimation problems decompose into a sum over
each example:
TODO equation, using same format as in original maximum likelihood section, which
isn’t written yet
More general, we are really interested in minimizing the expected loss with the ex-
pectation taken with respect to the data distribution, i.e. E
X,Y
[L(f
θ
(X), Y )], with
X, Y P (X, Y ). As discussed in Sec. 6.1, we replace this expectation with an average
over the training data (eg. for n examples):
expected loss =
1
n
n
X
t=1
L(f(x
(t)
), y
(t)
) (8.7)
This form of the loss implies that the gradient also consists of an average of the gradient
contributions for each data point:
θ
expected loss =
1
n
n
X
t=1
L(f(x
(t)
), y
(t)
) (8.8)
Now
So we can interpret the right hand side of Eqn. 8.8 as an estimator of the gradient
of the expected loss. Seen in this light, it’s reasonble to think about the properties of
this estimator, such as its mean and variance.
Provided that there are a relatively large number of examples in the training set,
computing the gradient over all examples in the training dataset also known as batch
gradient descent would yeild a relatively small variance on the estimate
In application to training deep learning models, straightforward gradient descent
where each gradient step involves computing the gradient for all training examples
is well known to be inefficient. This is especially true when we are dealing with large
datasets.
TODO stochastic gradient descent
165
Algorithm 8.1 Stochastic gradient descent (SGD) update at time t
Require: Learning rate η.
Require: Initial parameter θ
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient estimate: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Apply update: θ θ ηg
end while
TODO examples can be redundant, so best computational efficiency comes from
minibatches TODO too small of batch size -¿ bad use of multicore architectures TODO
issues with Hessian being different for different batches, etc.
TODO importance of shuffling shuffle-once versus shuffle per epoch todo: define the
term “epoch”
discuss learning rates.
8.3.3 Momentum
While stochastic gradient descent remains a very popular optimization strategy, learn-
ing with it can sometimes be slow. This is especially true in situations where there
the gradient is small, but consistent across minibatches. From the consistency of the
gradient, we know that we can afford to take larger steps in this direction, yet we have
no way of really knowing when we are in such situations.
The Momentum method Polyak (1964) is designed to accelerate learning, especially
in the face of small and consistent gradients. The intuition behind momentum, as the
name suggests, is derived from a physical interpretation of the optimization process.
Imagine you have a small ball (think marble) that represents the current position in
parameter space (for our purposes here we can imagine a 2-D parameter space). Now
consider that the ball is on a gentle slope, while the instantaneous force pulling the ball
down hill is relatively small, their contributions combine and the downhill velocity of
the ball gradually begins to increase over time. The momentum method is designed to
inject this kind of downhill acceleration into gradient-based optimization.
Formally, we introduce a variable v that plays the role of velocity (or momentum)
that accumulates gradient. The update rule is given by:
v +αv + η
θ
1
n
n
X
t=1
L(f(x
(t)
; θ), y
(t)
)
!
θ θ + v
where the
166
Algorithm 8.2 Stochastic gradient descent (SGD) with momentum
Require: Learning rate η, momentum parameter α.
Require: Initial parameter θ, initial velocity v.
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient estimate: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Compute velocity update: v αv ηg
Apply update: θ θ + v
end while
Nesterov momentum Sutskever et al. (2013) introduced a variant of the momentum
algorithm that was inspired by Nesterov.
v +αv + η
θ
1
n
n
X
t=1
L(f(x
(t)
; θ + αv), y
(t)
)
!
,
θ θ + v,
where the parameters α and η play a similar role as in the standard momentum method.
The difference betwee Nesterov momentum and standard momentum is where the gra-
dient is evaluated.
Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum
Require: Learning rate η, momentum parameter α.
Require: Initial parameter θ, initial velocity v.
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Apply interim update: θ θ + αv
Set g = 0
for t = 1 to m do
Compute gradient (at interim point): g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Compute velocity update: v αv ηg
Apply update: θ θ + v
end while
8.3.4 Adagrad
8.3.5 RMSprop
167
Algorithm 8.4 The Adagrad algorithm
Require: Global learning rate η,
Require: Initial parameter θ
Initialize gradient accumulation variable r = 0,
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r r + g
2
Compute update: θ
η
r
g. % (
1
r
applied element-wise)
Apply update: θ θ + θ
t
end while
8.3.6 Adadelta
8.3.7 No Pesky Learning Rates
8.4 Approximate Natural Gradient and Second-Order Meth-
ods
8.5 Conjugate Gradients
8.6 BFGS
TONGA and “actual” NG, links with HF.
8.6.1 New
8.6.2 Optimization Strategies and Meta-Algorithms
8.6.3 Coordinate Descent
In some cases, it may be possible to solve an optimization problem quickly by breaking
it into separate pieces. If we minimize f(x) with respect to a single variable x
i
, then
minimize it with respect to another variable x
j
and so on, we are guaranteed to arrive at a
(local) minimum. This practice is known as coordinate descent, because we optimize one
coordinate at a time. More generally, block coordinate descent refers to minimizing with
respect to a subset of the variables simultaneously. The term “coordinate descent” is
168
Algorithm 8.5 The RMSprop algorithm
Require: Global learning rate η, decay rate ρ.
Require: Initial parameter θ
Initialize accumulation variables r = 0
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r ρr + (1 ρ)g
2
Compute parameter update: θ =
η
r
g. % (
1
r
applied element-wise)
Apply update: θ θ + θ
end while
often used to refer to block coordinate descent as well as the strictly individual coordinate
descent.
Coordinate descent makes the most sense when the different variables in the opti-
mization problem can be clearly separated into groups that play relatively isolated roles,
or when optimization with respect to one group of variables is significantly more effi-
cient than optimization with respect to all of the variables. For example, the objective
function most commonly used for sparse coding is not convex. However, we can divide
the inputs to the training algorithm into two sets: the dictionary parameters and the
code representations. Minimizing the objective function with respect to either one of
these sets of variables is a convex problem. Block coordinate descent thus gives us an
optimization strategy that allows us to use efficient convex optimization algorithms.
Coordinate descent is not a very good strategy when the value of one variable strongly
influences the optimal value of another variable, as in the function f(x) = (x
1
x
2
)
2
+
α
x
2
1
+ y
2
1
where α is a positive constant. As α approaches 0, coordinate descent ceases
to make any progress at all, while Newton’s method could solve the problem in a single
step.
8.6.4 Greedy Supervised Pre-training
TODO
8.7 Hints and Curriculum Learning
TODO
169
Algorithm 8.6 RMSprop algorithm with Nesterov momentum
Require: Global learning rate η, decay rate ρ, momentum para α.
Require: Initial parameter θ, initial velocity v.
Initialize accumulation variable r = 0
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Compute interim update: θ θ + αv
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r ρr + (1 ρ)g
2
Compute velocity update: v αv
η
r
g. % (
1
r
applied element-wise)
Apply update: θ θ + v
end while
Algorithm 8.7 The Adadelta algorithm
Require: Decay rate ρ, constant
Require: Initial parameter θ
Initialize accumulation variables r = 0, s = 0,
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Set g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: r ρr + (1 ρ)g
2
Compute update: θ =
s+
r+
g % (operations applied element-wise)
Accumulate update: s ρs + (1 ρ) [∆θ]
2
Apply update: θ θ + θ
end while
170
Algorithm 8.8 The vSGD-1 algorithm from Schaul et al. (2012)
Require: Initial parameter θ
0
Initialize accumulation variables q = 0, r = 0,s = 0
while Stopping criterion not met do
Sample a minibatch of m examples from the training set {x
(1)
, . . . , x
(m)
}.
Initialize the gradient g = 0
for t = 1 to m do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end for
Accumulate gradient: q ρq + (1 ρ)g
Accumulate squared gradient: r ρr + (1 ρ)g
2
Accumulate: s ρs + (1 ρ)
bbprop(θ)
(j)
i
estimate learning rate (element-wise calc.): η
q
2
sr
Update memory size: ρ
q
2
r
1
1
(1 ρ)
Compute update: θ = η
g
% All operations above should be interpreted as element-wise.
Apply update: θ θ + θ
end while
Algorithm 8.9 Conjugate gradient method
Require: Initial parameters θ
0
Initialize ρ
0
= 0
while stopping criterion not met do
Initialize the gradient g = 0
for t = 1 to n % loop over the training set. do
Compute gradient: g g +
θ
L(f(x
(t)
; θ), y
(t)
)
end forbackpropagation)
Compute β
t
=
(g
t
g
t1
)
>
g
t
g
>
t1
g
t1
(Polak Ribi`ere)
Compute search direction: ρ
t
= g
t
+ β
t
ρ
t1
Perform line search to find: η
= argmin
η
J(θ
t
+ ηρ
t
)
Apply update: θ
t+1
= θ
t
+ η
ρ
t
end while
171
Algorithm 8.10 BFGS method
Require: Initial parameters θ
0
Initialize inverse Hessian M
0
= I
while stopping criterion not met do
Compute gradient: g
t
= J(θ
t
) (via batch backpropagation)
Compute φ = g
t
g
t1
, = θ
t
θ
t1
Approx H
1
: M
t
= M
t1
+
1 +
φ
>
M
t1
φ
>
φ
φ
>
φ
>
φ
φ
>
M
t1
+M
t1
φ
>
>
φ
Compute search direction: ρ
t
= M
t
g
t
Perform line search to find: η
= argmin
η
J(θ
t
+ ηρ
t
)
Apply update: θ
t+1
= θ
t
+ η
ρ
t
end while
172