Chapter 8

Optimization for Training

Deep Models

TODO: Describe the relationship between the largest ”safe” learning rate and the

eigenvalues of the Hessian. -Draw the quadratic bowl -Draw the 2D cross section

and show how to compute the objective function along this line -Show how to solve

for the minimum along the cross section Explain that negative eigenvalues can

cause a diﬀerent problem, where your step become too big for the local information

to be relevant anymore

In Section 4.3, we saw a brief overview of numerical optimization—the task of

iteratively searching for values of x that result in maximal or minimal values of

f(x). That introductory section was fairly general and described the basic ideas

that are used in a range of optimization problems. Optimization arises in many

contexts in deep learning, not just in optimizing parameters but also in inference

(which optimizes over the internal states of the network).

Here we present optimization algorithms that are intended speciﬁcally for

training deep models. In this case, we will always minimize some cost function

J(x

(train)

, θ) with respect to the model parameters θ.

8.1 Optimization for Model Training

Optimization algorithms used for training of deep models diﬀer from traditional

optimization algorithms in several ways.

8.1.1 Empirical Risk Minimization

Suppose that we have input feature x, targets y, and some loss function L(x, y).

Our ultimate goal is to minimize E

x,y∼p(x,y)

[L(x, y)]. This quantity is known as

205

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

the risk. If we knew the true distribution p(x, y), this would be an optimization

task solveable by an optimization algorithm. However, when we do not know

p(x, y) but only have a training set of samples from it, we have a machine learning

problem.

The simplest way to convert a machine learning problem back into an opti-

mization problem is to minimize the expected loss on the training set. This means

replacing the true distribution p(x, y) with the empirical distribution ˆp(x, y) de-

ﬁned by the training set. We now minimize the empirical risk

x,y∼ˆp(x,y)

[L(x, y)] =

i=1

L(x

(i)

, y

(i)

)

where m is the number of training examples.

This process is known as empirical risk minimization. In this setting, machine

learning is still very similar to straightforward optimization. Rather than opti-

mizing the risk directly, we optimize the empirical risk, and hope that the risk

decreases signiﬁcantly as well. A variety of theoretical results establish conditions

under which the true risk can be expected to decrease by various amounts.

However, empirical risk minimization is prone to overﬁtting. Models with

high capacity can simply memorize the training set. In many cases, empirical

risk minimization is not really feasible. The most eﬀective modern optimization

algorithms are based on gradient descent, but many useful loss functions, such

as 0-1 loss, have no useful derivatives (the derivative is either zero or undeﬁned

everywhere). These two problems mean that, in the context of deep learning, we

rarely use empirical risk minimization. Instead, we must use a slightly diﬀerent

approach, in which the quantity that we actually optimize is even more diﬀerent

from the quantity that we truly want to optimize.

TODO– make sure 0-1 loss is deﬁned and in the index

8.1.2 Surrogate Loss Functions

TODO–coordinate with Yoshua / coordinate with MLP / ML chapters do we use

term loss function = map from a speciﬁc example to a real number or do we use it

interchangeably with objective function / cost function? it seems some literature

uses ”loss function” in a very general sense while others use it to mean speciﬁcally

a single-example cost that you can take the expectation of, etc. this terminology

seems a bit sub-optimal since it relies a lot on using English words with essentially

the same meaning to represent diﬀerent things with precise technical meanings

are ”surrogate loss functions” speciﬁcally replacing the cost for an individual

examples, or does this also include things like minimizing the empirical risk rather

than the true risk, adding a regularization term to the likelihood terms, etc.?

206

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

TODO– in some cases, surrogate loss function actually results in being able

to learn more. for example, test 0-1 loss continues to decrease for a long time

after train 0-1 loss has reached zero when training using log likelihood surrogate

In some cases, using a surrogate loss function allows us to extract more infor-

mation

8.1.3 Generalization

TODO– SGD on an inﬁnite dataset optimizes the generalization error directly

(note that SGD is not introduced until later so this will need to be presented

carefully) TODO–

A very important diﬀerence between optimization in general and optimization

as we use it for training algorithms is that training algorithms do not usually

halt at a local minimum. Instead, using a regularization method known as early

stopping (see Sec. 7.8), they halt whenever overﬁtting begins to occur. This

is often in the middle of a wide, ﬂat region, but it can also occur on a steep

part of the surrogate loss function. This is in contrast to general optimization,

where converge is usually deﬁned by arriving at a point that is very near a (local)

minimum.

8.1.4 Batches and Minibatches

One aspect of machine learning algorithms that separates them from general

optimization algorithms is that the objective function usually decomposes as a

sum over the training examples. Optimization algorithms for machine learning

typically compute each update to the parameters based on a subset of the terms

of the objective function, not based on the complete objective function itself.

For example, maximum likelihood estimation problems decompose into a sum

over each example: TODO equation, using same format as in original maximum

likelihood section, which isn’t written yet

TODO stochastic gradient descent see 8.3.2

TODO examples can be redundant, so best computational eﬃciency comes

from minibatches TODO too small of batch size -¿ bad use of multicore architec-

tures TODO issues with Hessian being diﬀerent for diﬀerent batches, etc. TODO

importance of shuﬄing shuﬄe-once versus shuﬄe per epoch todo: deﬁne the term

“epoch”

8.1.5 Data Parallelism

TODO asynch implementations, hogwild, distbelief

207

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

8.2 Challenges in Optimization

8.2.1 Local Minima

TODO check whether this is already covered in numerical.tex

8.2.2 Ill-Conditioning

TODO this is deﬁnitely already covered in numerical.tex

8.2.3 Plateaus, Saddle Points, and Other Flat Regions

The long-held belief that neural networks are hopeless to train because they are

fraught with local minima has been one of the reasons for the “neural networks

winter” in the 1995-2005 decade. Indeed, one can show that there may be an

exponentially large number of local minima, even in the simplest neural network

optimization problems (Sontag and Sussman, 1989; Brady et al., 1989; Gori and

Tesi, 1992).

Theoretical work has shown that saddle points (and the ﬂat regions surround-

ing them) are important barriers to training neural networks, and may be more

important than local minima.

Explain (Dauphin et al., 2014; Choromanska et al., 2014)

8.2.4 Cliﬀs and Exploding Gradients

Whereas the issues of ill-conditioning and saddle points discussed in the previous

sections arise because of the second-order structure of the objective function (as

a function of the parameters), neural networks involve stronger non-linearities

which do not ﬁt well with this picture. In particular, the second-order Taylor

series approximation of the objective function yields a symmetric view of the

landscape around the minimum, oriented according to the axes deﬁned by the

principal eigenvectors of the Hessian matrix. (TODO: REFER TO A PLOT

FROM THE ILL-CONDITIONING SECTION WITH COUNTOURS OF VAL-

LEY). Second-order methods and momentum or gradient-averaging methods in-

troduced in Section 8.4 are able to reduce the diﬃculty due to ill-conditioning by

increasing the size of the steps in the low-curvature directions (the “valley”, in

Figure 8.1) and decreasing the size of the steps in the high-curvature directions

(the steep sides of the valley, in the ﬁgure).

208

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Figure 8.1: The traditional view of the optimization diﬃculty in neural networks is

inspired by the ill-conditioning problem in quadratic optimization: some directions have

a high curvature (second derivative), corresponding to the rising sides of the valley, and

other directions have a low curvature, corresponding to the smooth slope of the valley.

Most second-order methods, as well as momentum or gradient averaging methods are

meant to address that problem, by increasing the step size in the direction of the valley

(where it’s most paying in the long run to go) and decreasing it in the directions of steep

rise, which would otherwise lead to oscillations. The objective is to smoothly go down,

staying at the bottom of the valley.

However, although classical second order methods can help, as shown in Fig-

ure 8.2, due to higher order derivatives, the objective function may have a lot

more non-linearity, which often does not have the nice symmetrical shapes that

the second-order “valley” picture builds in our mind. Instead, there are cliﬀs

where the gradient rises sharply. When the parameters approach a cliﬀ region,

the gradient update step can move the learner towards a very bad conﬁguration,

ruining much of the progress made during recent training iterations.

209

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Figure 8.2: Contrary to what is shown in Figure 8.1, the cost function for highly non-

linear deep neural networks or for recurrent neural networks is typically not made of

symmetrical sides. As shown in the ﬁgure, there are sharp non-linearities that give rise

to very high derivatives in some places. When the parameters get close to such a cliﬀ

region, a gradient descent update can catapult the parameters very far, possibly ruining

a lot of the optimization work that had been done. Figure graciously provided by Razvan

Pascanu (Pascanu, 2014).

As illustrated in Figure 8.3, the cliﬀ can be dangerous whether we approach it

from above or from below, but fortunately there are some fairly straightforward

heuristics that allow one to avoid its most serious consequences. The basic idea

is to limit the size of the jumps that one would make. Indeed, one should keep

in mind that when we use the gradient to make an update of the parameters, we

are relying on the assumption of inﬁnitesimal moves. There is no guarantee that

making a ﬁnite step of the parameters θ in the direction of the gradient will yield

an improvement. The only thing that is guaranteed is that a small enough step

in that direction will be helpful. As we can see from Figure 8.3, in the presence

of a cliﬀ (and in general in the presence of very large gradients), the decrease

in the objective function expected from going in the direction of the gradient is

only valid for a very small step. In fact, because the objective function is usually

bounded in its actual value (within a ﬁnite domain), when the gradient is large at

θ, it typically only remains like this (especially, keeping its sign) in a small region

around θ. Otherwise, the value of the objective function would have to change

a lot: if the slope was consistently large in some direction as we would move in

that direction, we would be able to decrease the objective function value by a

210

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

very large amount by following it, simply because the total change is the integral

over some path of the directional derivatives along that path.

Figure 8.3: To address the presence of cliﬀs such as shown in Figure 8.2, a useful heuristic

is to clip the magnitude of the gradient, only keeping its direction if its magnitude is above

a threshold (which is a hyperparameter, although not a very critical one). This helps to

avoid the destructive big moves which would happen when approaching the cliﬀ, either

from above or from below. Figure graciously provided by Razvan Pascanu (Pascanu,

2014).

The gradient clipping heuristics are described in more detail in Section 10.6.7.

The basic idea is to bound the magnitude of the update step, i.e., not trust the

gradient too much when it is very large in magnitude. The context in which such

cliﬀs have been shown to arise in particular is that of recurrent neural networks,

when considering long sequences, as discussed in the next section.

8.2.5 Vanishing and Exploding Gradients - An Introduction to

the Issue of Learning Long-Term Dependencies

Parametrized dynamical systems such as recurrent neural networks (Chapter 10)

face a particular optimization problem which is diﬀerent but related to that of

training very deep networks. We introduce this issue here and refer to reader to

Section 10.6 for a deeper treatment along with a discussion of approaches that

have been proposed to reduce this diﬃculty.

Exploding or Vanishing Product of Jacobians

The simplest explanation of the problem, which is shared among very deep nets

and recurrent nets, is that in both cases the ﬁnal output is the composition of a

large number of non-linear transformations. Even though each of these non-linear

stages may be relatively smooth (e.g. the composition of an aﬃne transformation

with a hyperbolic tangent or sigmoid), their composition is going to be much

211

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

“more non-linear”, in the sense that derivatives through the whole composition

will tend to be either very small or very large, with more ups and downs. This

arises simply because the Jacobian (matrix of derivatives) of a composition is the

product of the Jacobians of each stage, i.e., if

f = f

◦ f

T −1

◦ . . . , f

◦ f

then the Jacobian matrix of derivatives of f(x) with respect to its input vector x

is the product

= f

T −1

. . . , f

(8.1)

where

∂f(x)

∂x

and

∂f

)

∂a

where a

= f

t−1

t−2

(. . . , f

(x)))), i.e. composition has been replaced by

matrix multiplication. This is illustrated in Figure 8.4.

…"

Figure 8.4: When composing many non-linearities (like the activation non-linearity in a

deep or recurrent neural network), the result is highly non-linear, typically with most

of the values associated with a tiny derivative, some values with a large derivative, and

many ups and downs (not shown here).

In the scalar case, we can imagine that multiplying many numbers together

tends to be either very large or very small. In the special case where all the

numbers in the product have the same value α, this is obvious, since α

goes to

0 if α < 1 and goes to ∞ if α > 1, as T increases. The more general case of

non-identical numbers be understood by taking the logarithm of these numbers,

considering them to be random, and computing the variance of the sum of these

logarithms. Clearly, although some cancellation can happen, the variance grows

with T, and in fact if those numbers are independent, the variance grows linearly

with T , i.e., the size of the sum (which is the standard deviation) grows as

√

T ,

which means that the product grows roughly as e

(consider the variance of log-

normal variate X if log X is normal with mean 0 and variance T ).

212

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

It would be interesting to push this analysis to the case of multiplying square

matrices instead of multiplying numbers, but one might expect qualitatively sim-

ilar conclusions, i.e., the size of the product somehow grows with the number of

matrices, and that it grows exponentially. In the case of matrices, one can get

a new form of cancellation due to leading eigenvectors being well aligned or not.

The product of matrices will blow up only if, among their leading eigenvectors

with eigenvalue greater than 1, there is enough “in common” (in the sense of the

appropriate dot products of leading eigenvectors of one matrix and another).

However, this analysis was for the case where these numbers are independent.

In the case of an ordinary recurrent neural network (developed in more detail in

Chapter 10), these Jacobian matrices are highly related to each other. Each layer-

wise Jacobian is actually the product of two matrices: (a) the recurrent matrix W

and (b) the diagonal matrix whose entries are the derivatives of the non-linearities

associated with the hidden units, which vary depending on the time step. This

makes it likely that successive Jacobians have similar eigenvectors, making the

product of these Jacobians explode or vanish even faster.

Consequence for Recurrent Networks: Diﬃculty of Learning Long-

Term Dependencies

The consequence of the exponential convergence of these products of Jacobians

towards either very small or very large values is that it makes the learning of

long-term dependencies particularly diﬃcult, as we explain below and was inde-

pendently introduced in Hochreiter (1991) and Bengio et al. (1993, 1994) for the

ﬁrst time.

Consider a fairly general parametrized dynamical system (which includes clas-

sical recurrent networks as a special case, as well as all their known variants), pro-

cessing a sequence of inputs, x

, . . . , x

, . . ., involving iterating over the transition

operator:

= F

t−1

, x

) (8.2)

where s

is called the state of the system and F

is the recurrent function that

maps the previous state and current input to the next state. The state can be

used to produce an output via an output function,

= g

), (8.3)

and a loss L

is computed at each time step t as a function of o

and possibly

of some targets y

. Let us consider the gradient of a loss L

at time T with

respect to the parameters θ of the recurrent function F

. One particular way to

213

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

decompose the gradient

∂L

∂θ

using the chain rule is the following:

∂L

∂θ

t≤T

∂L

∂s

∂θ

∂L

∂θ

t≤T

∂L

∂s

∂F

t−1

, x

)

∂θ

(8.4)

where the last Jacobian matrix only accounts for the immediate eﬀect of θ as a

parameter of F

when computing s

= F

t−1

, x

), i.e., not taking into account

the indirect eﬀect of θ via s

t−1

(otherwise there would be double counting and the

result would be incorrect). To see that this decomposition is correct, please refer

to the notions of gradient computation in a ﬂow graph introduced in Section 6.3,

and note that we can construct a graph in which θ inﬂuences each s

, each of

which inﬂuences L

via s

. Now let us note that each Jacobian matrix

∂s

can

be decomposed as follows:

∂s

T −1

∂s

T −1

∂s

T −2

. . .

∂s

t+1

∂s

(8.5)

which is of the same form as Eq. 8.1 discussed above, i.e., which tends to either

vanish or explode.

As a consequence, we see from Eq. 8.4 that

∂L

∂θ

is a weighted sum of terms

over spans T −t, with weights that are exponentially smaller (or larger) for longer-

term dependencies relating the state at t to the state at T . As shown in Bengio

et al. (1994), in order for a recurrent network to reliably store memories, the Ja-

cobians

∂s

t−1

relating each state to the next must have a determinant that is less

than 1 (i.e., yielding to the formation of attractors in the corresponding dynam-

ical system). Hence, when the model is able to capture long-term dependencies

it is also in a situation where gradients vanish and long-term dependencies have

an exponentially smaller weight than short-term dependencies in the total gradi-

ent. It does not mean that it is impossible to learn, but that it might take a

very long time to learn long-term dependencies, because the signal about these

dependencies will tend to be hidden by the smallest ﬂuctuations arising from

short-term dependencies. In practice, the experiments in Bengio et al. (1994)

show that as we increase the span of the dependencies that need to be captured,

gradient-based optimization becomes increasingly diﬃcult, with the probability

of successful learning rapidly reaching 0 after only 10 or 20 steps in the case of

the vanilla recurrent net and stochastic gradient descent (Section 8.3.2).

For a deeper treatment of the dynamical systems view of recurrent networks,

consider Doya (1993); Bengio et al. (1994); Siegelmann and Sontag (1995), with a

review in Pascanu et al. (2013a). Section 10.6 discusses various approaches that

214

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

have been proposed to reduce the diﬃculty of learning long-term dependencies

(in some cases allowing one to reach to hundreds of steps), but it remains one of

the main challenges in deep learning.

8.3 Optimization Algorithms

In Sec. 6.3, we discussed the backpropagation algorithm (backprop): that is,

how to eﬃciently compute the gradient of the loss with-respect-to the model

parameters. The backpropagation algorithm does not specify how we use this

gradient to update the weights of the model.

In this section we introduce a number of gradient-based learning algorithms

that have been proposed to optimize the parameters of deep learning models.

Deep learning

8.3.1 Gradient Descent

Gradient descent is the most basic gradient-based algorithms one might apply to

train a deep model. The algorithm involves updating the model parameters θ (in

the case of a deep neural network, these parameters would include the weights

and biases associated with each layer) with a small step in the direction of the

gradient of the loss function (including any regularization terms). For the case of

supervised learning with data pairs [x

(t)

, y

(t)

] we have:

θ ← θ + ∇

L(f(x

(t)

; θ), y

(t)

; θ), (8.6)

where  is the learning rate, an optimization hyperparameter that controls the size

of the step the the parameters take in the direction of the gradient. Of course,

following the gradient in this way is only guaranteed to reduce the loss in the

limit as  → 0.

learning rates

8.3.2 Stochastic Gradient Descent

One aspect of machine learning algorithms that separates them from general

optimization algorithms is that the objective function usually decomposes as a

sum over the training examples. Optimization algorithms for machine learning

typically compute each update to the parameters based on a subset of the terms

of the objective function, not based on the complete objective function itself.

For example, maximum likelihood estimation problems decompose into a sum

over each example:

215

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

TODO equation, using same format as in original maximum likelihood section,

which isn’t written yet

More general, we are really interested in minimizing the expected loss with the

expectation taken with respect to the data distribution, i.e. E

X,Y

[L(f

(X), Y )],

with X, Y ∼ P (X, Y ). As discussed in Sec. 6.1, we replace this expectation

with an average over the training data (eg. for n examples):

expected loss =

t=1

L(f(x

(t)

), y

(t)

) (8.7)

This form of the loss implies that the gradient also consists of an average of the

gradient contributions for each data point:

∂

∂θ

expected loss =

t=1

L(f(x

(t)

), y

(t)

) (8.8)

Now

So we can interpret the right hand side of Eqn. 8.8 as an estimator of the

gradient of the expected loss. Seen in this light, it’s reasonble to think about the

properties of this estimator, such as its mean and variance.

Provided that there are a relatively large number of examples in the training

set, computing the gradient over all examples in the training dataset – also known

as batch gradient descent – would yeild a relatively small variance on the estimate

In application to training deep learning models, straightforward gradient de-

scent – where each gradient step involves computing the gradient for all training

examples – is well known to be ineﬃcient. This is especially true when we are

dealing with large datasets.

TODO stochastic gradient descent

Algorithm 8.1 Stochastic gradient descent (SGD) update at time t

Require: Learning rate η.

Require: Initial parameter θ

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Set g = 0

for t = 1 to m do

Compute gradient estimate: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Apply update: θ ← θ − ηg

end while

216

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

TODO examples can be redundant, so best computational eﬃciency comes

from minibatches TODO too small of batch size -¿ bad use of multicore architec-

tures TODO issues with Hessian being diﬀerent for diﬀerent batches, etc.

TODO importance of shuﬄing shuﬄe-once versus shuﬄe per epoch todo: de-

ﬁne the term “epoch”

discuss learning rates.

8.3.3 Momentum

While stochastic gradient descent remains a very popular optimization strategy,

learning with it can sometimes be slow. This is especially true in situations

where there the gradient is small, but consistent across minibatches. From the

consistency of the gradient, we know that we can aﬀord to take larger steps in this

direction, yet we have no way of really knowing when we are in such situations.

The Momentum method Polyak (1964) is designed to accelerate learning, es-

pecially in the face of small and consistent gradients. The intuition behind mo-

mentum, as the name suggests, is derived from a physical interpretation of the

optimization process. Imagine you have a small ball (think marble) that rep-

resents the current position in parameter space (for our purposes here we can

imagine a 2-D parameter space). Now consider that the ball is on a gentle slope,

while the instantaneous force pulling the ball down hill is relatively small, their

contributions combine and the downhill velocity of the ball gradually begins to

increase over time. The momentum method is designed to inject this kind of

downhill acceleration into gradient-based optimization. TODO: explain eﬀect of

drag on the marble and why it should be linear drag

Formally, we introduce a variable v that plays the role of velocity (or momen-

tum) that accumulates gradient. The update rule is given by:

v ← +αv + η∇

t=1

L(f(x

(t)

; θ), y

(t)

)

θ ← θ + v

where the

Nesterov momentum Sutskever et al. (2013) introduced a variant of the mo-

mentum algorithm that was inspired by Nesterov.

v ← +αv + η∇

t=1

L(f(x

(t)

; θ + αv), y

(t)

)

θ ← θ + v,

217

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.2 Stochastic gradient descent (SGD) with momentum

Require: Learning rate η, momentum parameter α.

Require: Initial parameter θ, initial velocity v.

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Set g = 0

for t = 1 to m do

Compute gradient estimate: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Compute velocity update: v ← αv − ηg

Apply update: θ ← θ + v

end while

where the parameters α and η play a similar role as in the standard momentum

method. The diﬀerence betwee Nesterov momentum and standard momentum is

where the gradient is evaluated.

Algorithm 8.3 Stochastic gradient descent (SGD) with Nesterov momentum

Require: Learning rate η, momentum parameter α.

Require: Initial parameter θ, initial velocity v.

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Apply interim update: θ ← θ + αv

Set g = 0

for t = 1 to m do

Compute gradient (at interim point): g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Compute velocity update: v ← αv − ηg

Apply update: θ ← θ + v

end while

8.3.4 Adagrad

8.3.5 RMSprop

8.3.6 Adadelta

218

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.4 The Adagrad algorithm

Require: Global learning rate η,

Require: Initial parameter θ

Initialize gradient accumulation variable r = 0,

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Set g = 0

for t = 1 to m do

Compute gradient: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Accumulate gradient: r ← r + g

Compute update: ∆θ ← −

√

g. % (

√

applied element-wise)

Apply update: θ ← θ + ∆θ

end while

8.3.7 No Pesky Learning Rates

8.4 Approximate Natural Gradient and Second-Order

Methods

8.5 Conjugate Gradients

8.6 BFGS

TONGA and “actual” NG, links with HF.

8.6.1 New

8.6.2 Optimization Strategies and Meta-Algorithms

8.6.3 Coordinate Descent

In some cases, it may be possible to solve an optimization problem quickly by

breaking it into separate pieces. If we minimize f(x) with respect to a single

variable x

, then minimize it with respect to another variable x

and so on, we are

guaranteed to arrive at a (local) minimum. This practice is known as coordinate

descent, because we optimize one coordinate at a time. More generally, block

coordinate descent refers to minimizing with respect to a subset of the variables

219

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.5 The RMSprop algorithm

Require: Global learning rate η, decay rate ρ.

Require: Initial parameter θ

Initialize accumulation variables r = 0

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Set g = 0

for t = 1 to m do

Compute gradient: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Accumulate gradient: r ← ρr + (1 −ρ)g

Compute parameter update: ∆θ = −

√

g. % (

√

applied element-wise)

Apply update: θ ← θ + ∆θ

end while

simultaneously. The term “coordinate descent” is often used to refer to block

coordinate descent as well as the strictly individual coordinate descent.

Coordinate descent makes the most sense when the diﬀerent variables in the

optimization problem can be clearly separated into groups that play relatively

isolated roles, or when optimization with respect to one group of variables is

signiﬁcantly more eﬃcient than optimization with respect to all of the variables.

For example, the objective function most commonly used for sparse coding is not

convex. However, we can divide the inputs to the training algorithm into two

sets: the dictionary parameters and the code representations. Minimizing the

objective function with respect to either one of these sets of variables is a convex

problem. Block coordinate descent thus gives us an optimization strategy that

allows us to use eﬃcient convex optimization algorithms.

Coordinate descent is not a very good strategy when the value of one variable

strongly inﬂuences the optimal value of another variable, as in the function f(x) =

− x

)

+ α



+ y



where α is a positive constant. As α approaches 0,

coordinate descent ceases to make any progress at all, while Newton’s method

could solve the problem in a single step.

8.6.4 Greedy Supervised Pre-training

TODO

220

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.6 RMSprop algorithm with Nesterov momentum

Require: Global learning rate η, decay rate ρ, momentum para α.

Require: Initial parameter θ, initial velocity v.

Initialize accumulation variable r = 0

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Compute interim update: θ ← θ + αv

Set g = 0

for t = 1 to m do

Compute gradient: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Accumulate gradient: r ← ρr + (1 −ρ)g

Compute velocity update: v ← αv −

√

g. % (

√

applied element-wise)

Apply update: θ ← θ + v

end while

8.7 Hints and Curriculum Learning

TODO YOSHUA

221

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.7 The Adadelta algorithm

Require: Decay rate ρ, constant 

Require: Initial parameter θ

Initialize accumulation variables r = 0, s = 0,

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Set g = 0

for t = 1 to m do

Compute gradient: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Accumulate gradient: r ← ρr + (1 −ρ)g

Compute update: ∆θ = −

√

s+

√

r+

g % (operations applied element-wise)

Accumulate update: s ← ρs + (1 − ρ) [∆θ]

Apply update: θ ← θ + ∆θ

end while

Algorithm 8.8 The vSGD-1 algorithm from Schaul et al. (2012)

Require: Initial parameter θ

Initialize accumulation variables q = 0, r = 0,s = 0

while Stopping criterion not met do

Sample a minibatch of m examples from the training set {x

(1)

, . . . , x

(m)

Initialize the gradient g = 0

for t = 1 to m do

Compute gradient: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end for

Accumulate gradient: q ← ρq + (1 − ρ)g

Accumulate squared gradient: r ← ρr + (1 − ρ)g

Accumulate: s ← ρs + (1 −ρ)



bbprop(θ)

(j)



estimate learning rate (element-wise calc.): η

∗

←

sr

Update memory size: ρ ←



− 1



−1

(1 − ρ)

Compute update: ∆θ = −η

∗

% All operations above should be interpreted as element-wise.

Apply update: θ ← θ + ∆θ

end while

222

CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS

Algorithm 8.9 Conjugate gradient method

Require: Initial parameters θ

Initialize ρ

= 0

while stopping criterion not met do

Initialize the gradient g = 0

for t = 1 to n % loop over the training set. do

Compute gradient: g ← g + ∇

L(f(x

(t)

; θ), y

(t)

)

end forbackpropagation)

Compute β

−g

t−1

)

t−1

(Polak — Ribi`ere)

Compute search direction: ρ

= −g

+ β

t−1

Perform line search to ﬁnd: η

∗

= argmin

J(θ

+ ηρ

)

Apply update: θ

t+1

= θ

+ η

∗

end while

Algorithm 8.10 BFGS method

Require: Initial parameters θ

Initialize inverse Hessian M

= I

while stopping criterion not met do

Compute gradient: g

= ∇J(θ

) (via batch backpropagation)

Compute φ = g

− g

t−1

, ∆ = θ

− θ

t−1

Approx H

−1

: M

= M

t−1



1 +

t−1

∆



∆

−



∆φ

t−1

φ∆

∆



Compute search direction: ρ

= M

Perform line search to ﬁnd: η

∗

= argmin

J(θ

+ ηρ

)

Apply update: θ

t+1

= θ

+ η

∗

end while

223