Chapter 8

Optimization for training deep

models

In chapter 4.3, we saw a brief overview of numerical optimization—the task of itera-

tively searching for values of x that result in maximal or minimal values of f(x). That

introductory chapter was fairly general and described the basic ideas that are used in a

range of optimization problems. Optimization arises in many contexts in deep learning,

such as inference.

In this chapter, we present optimization algorithms that are intended speciﬁcally

for training deep models. In this case, we will always minimize some cost function

J(X

(train)

, θ) with respect to the model parameters θ.

8.1 Optimization for model training

8.1.1 Early Stopping

8.1.2 Plateaus, saddle points, and other ﬂat regions

TODO

Theoretical work has shown that saddle points (and the ﬂat regions surrounding

them) are important barriers to training neural networks, and may be more important

than local minima.

8.1.3 Cliﬀs and Exploding Gradients

Whereas the issues of ill-conditioning and saddle points discussed in the previous sec-

tions arise because of the second-order structure of the objective function (as a function

of the parameters), neural networks involve stronger non-linearities which do not ﬁt

well this picture. In particular, the second-order Taylor series approximation of the

objective function yields a symmetric view of the landscape around the minimum,

oriented according to the axes deﬁned by the principal eigenvectors of the Hessian

150

matrix. (TODO: REFER TO A PLOT FROM THE ILL-CONDITIONING SEC-

TION WITH COUNTOURS OF VALLEY). Second-order methods and momentum or

gradient-averaging methods introduced in Section 8.2.1 are able to reduce the diﬃculty

due to ill-conditioning by increasing the size of the steps in the low-curvature directions

(the “valley”, in Figure 8.1) and decreasing the size of the steps in the high-curvature

directions (the steep sides of the valley, in the ﬁgure).

Figure 8.1: The traditional view of the optimization diﬃculty in neural networks is

inspired by the ill-conditioning problem in quadratic optimization: some directions have

a high curvature (second derivative), corresponding to the rising sides of the valley, and

other directions have a low curvature, corresponding to the smooth slope of the valley.

Most second-order methods, as well as momentum or gradient averaging methods are

meant to address that problem, by increasing the step size in the direction of the valley

(where it’s most paying in the long run to go) and decreasing it in the directions of steep

rise, which would otherwise lead to oscillations. The objective is to smoothly go down,

staying at the bottom of the valley.

However, although classical second order methods can help, as shown in Figure 8.2,

due to higher order derivatives, the objective function may have a lot more non-linearity,

which often does not have the nice symmetrical shapes that the second-order “valley”

picture builds in our mind. Instead, there are cliﬀs where the gradient rises sharply.

When the parameters approach a cliﬀ region, the gradient update step can move the

learner towards a very bad conﬁguration, ruining much of the progress made during

recent training iterations.

151

Figure 8.2: Contrary to what is shown in Figure 8.1, the cost function for highly non-

linear deep neural networks or for recurrent neural networks is typically not made of

symmetrical sides. As shown in the ﬁgure, there are sharp non-linearities that give

rise to very high derivatives in some places. When the parameters get close to such a

cliﬀ region, a gradient descent update can catapult the parameters very far, possibly

ruining a lot of the optimization work that had been done. Figure graciously provided

by Razvan Pascanu (Pascanu, 2014).

As illustrated in Figure 8.3, the cliﬀ can be dangerous whether we approach it from

above or from below, but fortunately there are some fairly straightforward heuristics

that allow one to avoid its most serious consequences. The basic idea is to limit the size

of the jumps that one would make. Indeed, one should keep in mind that when we use

the gradient to make an update of the parameters, we are relying on the assumption of

inﬁnitesimal moves. There is no guarantee that making a ﬁnite step of the parameters

θ in the direction of the gradient will yield an improvement. The only thing that is

guaranteed is that a small enough step in that direction will be helpful. As we can see

from Figure 8.3, in the presence of a cliﬀ (and in general in the presence of very large

gradients), the decrease in the objective function expected from going in the direction

of the gradient is only valid for a very small step. In fact, because the objective function

is usually bounded in its actual value (within a ﬁnite domain), when the gradient is

large at θ, it typically only remains like this (especially, keeping its sign) in a small

region around θ. Otherwise, the value of the objective function would have to change

a lot: if the slope was consistently large in some direction as we would move in that

direction, we would be able to decrease the objective function value by a very large

amount by following it, simply because the total change is the integral over some path

152

of the directional derivatives along that path.

Figure 8.3: To address the presence of cliﬀs such as shown in Figure 8.2, a useful heuris-

tic is to clip the magnitude of the gradient, only keeping its direction if its magnitude

is below a threshold (which is a hyper-parameter, although not a very critical one).

This helps to avoid the destructive big moves which would happen when approach-

ing the cliﬀ, either from above or from below. Figure graciously provided by Razvan

Pascanu (Pascanu, 2014).

The gradient clipping heuristics are described in more detail in Section 12.6.7. The

basic idea is to bound the magnitude of the update step, i.e., not trust the gradient too

much when it is very large in magnitude. The context in which such cliﬀs have been

shown to arise in particular is that of recurrent neural networks, when considering long

sequences, as discussed in the next section.

8.1.4 Vanishing and Exploding Gradients - An Introduction to the

Issue of Learning Long-Term Dependencies

Parametrized dynamical systems such as recurrent neural networks (Chapter 12) face a

particular optimization problem which is diﬀerent but related to that of training very

deep networks. We introduce this issue here and refer to reader to Section 12.6 for

a deeper treatment along with a discussion of approaches that have been proposed to

reduce this diﬃculty.

Exploding or Vanishing Product of Jacobians

The simplest explanation of the problem, which is shared among very deep nets and

recurrent nets, is that in both cases the ﬁnal output is the composition of a large

number of non-linear transformations. Even though each of these non-linear stages may

be relatively smooth (e.g. the composition of an aﬃne transformation with a hyperbolic

tangent or sigmoid), their composition is going to be much “more non-linear”, in the

sense that derivatives through the whole composition will tend to be either very small or

very large, with more ups and downs. This arises simply because the Jacobian (matrix

153

of derivatives) of a composition is the product of the Jacobians of each stage, i.e., if

f = f

◦ f

T −1

◦ . . . f

◦ f

then the Jacobian matrix of derivatives of f(x) with respect to its input vector x is the

product



= f



T −1

. . . f



(8.1)

where



∂f(x)

∂x

and



∂f

)

∂a

where a

= f

t−1

t−2

(. . . f

(x)))), i.e. composition has been replaced by matrix

multiplication. This is illustrated in Figure 8.4.

…"

Figure 8.4: When composing many non-linearities (like the activation non-linearity in

a deep or recurrent neural network), the result is highly non-linear, typically with most

of the values associated with a tiny derivative, some values with a large derivative, and

many ups and downs (not shown here).

In the scalar case, we can imagine that multiplying many numbers together tends

to be either very large or very small. In the special case where all the numbers in the

product have the same value α, this is obvious, since α

goes to 0 if α < 1 and to

∞ if alpha > 1, as T increases. The more general case of non-identical numbers be

understood by taking the logarithm of these numbers, considering them to be random,

and computing the variance of the sum of these logarithms. Clearly, although some

cancellation can happen, the variance grows with T, and in fact if those numbers are

independent, it grows linearly with T, i.e., the size of the sum grows as

√

T , which means

that the product grows roughly as e

(consider the variance of log-normal variate X if

log X is normal with mean 0 and variance T ).

It would be interesting to push this analysis to the case of multiplying square matrices

instead of multiplying numbers, but one might expect qualitatively similar conclusions,

i.e., the size of the product somehow grows with the number of matrices, and that it

grows exponentially. In the case of matrices, one can get a new form of cancellation due

to leading eigenvectors being well aligned or not. The product of matrices will blow up

only if, among their leading eigenvectors with eigenvalue greater than 1, there is enough

154

“in common” (in the sense of the appropriate dot products of leading eigenvectors of

one matrix and another).

However, this analysis was for the case where these numbers are independent. In the

case of an ordinary recurrent neural network (developed in more detail in Chapter 12),

these Jacobian matrices are highly related to each other. Each layer-wise Jacobian is

actually the product of two matrices: (a) the recurrent matrix W and (b) the diagonal

matrix whose entries are the derivatives of of the non-linearities associated with the

hidden units, which vary depending on the time step. This makes it likely that successive

Jacobians have similar eigenvectors, making the product of these Jacobians explode or

vanish even faster.

Consequence for Recurrent Networks: Diﬃculty of Learning Long-Term De-

pendencies

The consequence of the exponential convergence of these products of Jacobians towards

either very small or very large values is that it makes the learning of long-term depen-

dencies particularly diﬃcult, as we explain below and was independently introduced

in Hochreiter (1991) and Bengio et al. (1993, 1994) for the ﬁrst time.

Consider a fairly general parametrized dynamical system (which includes classical

recurrent networks as a special case, as well as all their known variants), processing a

sequence of inputs x

, . . . x

, . . ., following Eq. 12.2:

= F

t−1

, x

) (8.2)

where s

is called the state of the system and F

is the recurrent function that maps the

previous state and current input to the next state. The state can be used to produce an

output via an output function

= g

), (8.3)

and a loss L

is computed at each time step t as a function of o

and possibly of some

targets y

. Let us consider the gradient of a loss L

at time T with respect to the

parameters θ of the recurrent function F

. One particular way to decompose the gradient

∂L

∂θ

is the following:

∂L

∂θ



t≤T

∂L

∂s

∂F

t−1

, x

)

∂θ

(8.4)

where the last Jacobian matrix only accounts for the immediate eﬀect of θ as a parameter

of F

when computing F

t−1

, x

), i.e., not taking into account the indirect eﬀect of θ via

t−1

(otherwise there would be double counting and the result would be incorrect). To

see that this decomposition is correct, please refer to the notions of gradient computation

in a ﬂow graph introduced in Section 6.3, and note that we can construct a graph in

which θ inﬂuences each s

, each of which inﬂuences L

via s

. Now let us note that

each Jacobian matrix

∂s

can be decomposed as follows:

∂s

T −1

∂s

T −1

∂s

T −2

. . .

∂s

t+1

∂s

(8.5)

155

which is of the same form as Eq. 8.1 discussed above, i.e., which tends to either vanish

or explode.

As a consequence, we see from Eq. 8.4 that

∂L

∂θ

is a weighted sum of terms over

spans T − t, with weights that are exponentially smaller (or larger) for longer-term

dependencies relating the state at t to the state at T . As shown in Bengio et al. (1994),

in order for a recurrent network to reliably store memories, the Jacobians

∂s

t−1

relating

each state to the next must have a determinant that is less than 1 (i.e., yielding to

the formation of attractors in the corresponding dynamical system). Hence, when the

model is able to capture long-term dependencies it is also in a situation where gradients

vanish and long-term dependencies have an exponentially smaller weight than short-term

dependencies in the total gradient. It does not mean that it is impossible to learn, but

that it might take a very long time to learn long-term dependencies, because the signal

about these dependencies will tend to be hidden by the smallest ﬂuctuations arising from

short-term dependencies. In practice, the experiments in Bengio et al. (1994) show that

as we increase the span of the dependencies that need to be captured, gradient-based

optimization becomes increasingly diﬃcult, with the probability of successful learning

rapidly reaching 0 after only 10 or 20 steps in the case of the vanilla recurrent net and

stochastic gradient descent.

For a deeper treatment of the dynamical systems view of recurrent networks, con-

sider Doya (1993); Bengio et al. (1994); Siegelmann and Sontag (1995), with a review

in Pascanu et al. (2013a). Section 12.6 discusses various approaches that have been

proposed to reduce the diﬃculty of learning long-term dependencies (in some cases al-

lowing to reach to hundreds of steps), but it remains one of the main challenges in deep

learning.

8.2 Optimization algorithms

8.2.1 Approximate Natural Gradient and Second-Order Methods

TODO: Momentum and adaptive learning rates.

TONGA and “actual” NG, links with HF.

8.2.2 Optimization strategies and meta-algorithms

8.2.3 Coordinate descent

In some cases, it may be possible to solve an optimization problem quickly by breaking

it into separate pieces. If we minimize f(x) with respect to a single variable x

, then

minimize it with respect to another variable x

and so on, we are guaranteed to arrive at a

(local) minimum. This practice is known as coordinate descent, because we optimize one

coordinate at a time. More generally, block coordinate descent refers to minimizing with

respect to a subset of the variables simultaneously. The term “coordinate descent” is

often used to refer to block coordinate descent as well as the strictly individual coordinate

descent.

156

Coordinate descent makes the most sense when the diﬀerent variables in the opti-

mization problem can be clearly separated into groups that play relatively isolated roles,

or when optimization with respect to one group of variables is signiﬁcantly more eﬃcient

than optimization with respect to all of the variables. For example, the objective func-

tion most commonly used for sparse coding is not convex. However, we can divide the

inputs to the training algorithm into two sets. Minimizing the objective function with

respect to either one of these sets of variables is a convex problem. Block coordinate

descent thus gives us an optimization strategy that allows us to use eﬃcient convex

optimization algorithms.

Coordinate descent is not a very good strategy when the value of one variable strongly

inﬂuences the optimal value of another variable, as in the function f(x) = (x

−x

)



+ y



where α is a positive constant. As α approaches 0, coordinate descent ceases

to make any progress at all, while Newton’s method could solve the problem in a single

step.

8.2.4 Greedy supervised pre-training

TODO

8.3 Hints and Curriculum Learning

TODO

157