
which is of the same form as Eq. 8.1 discussed above, i.e., which tends to either vanish
or explode.
As a consequence, we see from Eq. 8.4 that
∂L
T
∂θ
is a weighted sum of terms over
spans T − t, with weights that are exponentially smaller (or larger) for longer-term
dependencies relating the state at t to the state at T . As shown in Bengio et al. (1994),
in order for a recurrent network to reliably store memories, the Jacobians
∂s
t
∂s
t−1
relating
each state to the next must have a determinant that is less than 1 (i.e., yielding to
the formation of attractors in the corresponding dynamical system). Hence, when the
model is able to capture long-term dependencies it is also in a situation where gradients
vanish and long-term dependencies have an exponentially smaller weight than short-term
dependencies in the total gradient. It does not mean that it is impossible to learn, but
that it might take a very long time to learn long-term dependencies, because the signal
about these dependencies will tend to be hidden by the smallest fluctuations arising from
short-term dependencies. In practice, the experiments in Bengio et al. (1994) show that
as we increase the span of the dependencies that need to be captured, gradient-based
optimization becomes increasingly difficult, with the probability of successful learning
rapidly reaching 0 after only 10 or 20 steps in the case of the vanilla recurrent net and
stochastic gradient descent.
For a deeper treatment of the dynamical systems view of recurrent networks, con-
sider Doya (1993); Bengio et al. (1994); Siegelmann and Sontag (1995), with a review
in Pascanu et al. (2013a). Section 12.6 discusses various approaches that have been
proposed to reduce the difficulty of learning long-term dependencies (in some cases al-
lowing to reach to hundreds of steps), but it remains one of the main challenges in deep
learning.
8.2 Optimization algorithms
8.2.1 Approximate Natural Gradient and Second-Order Methods
TODO: Momentum and adaptive learning rates.
TONGA and “actual” NG, links with HF.
8.2.2 Optimization strategies and meta-algorithms
8.2.3 Coordinate descent
In some cases, it may be possible to solve an optimization problem quickly by breaking
it into separate pieces. If we minimize f(x) with respect to a single variable x
i
, then
minimize it with respect to another variable x
j
and so on, we are guaranteed to arrive at a
(local) minimum. This practice is known as coordinate descent, because we optimize one
coordinate at a time. More generally, block coordinate descent refers to minimizing with
respect to a subset of the variables simultaneously. The term “coordinate descent” is
often used to refer to block coordinate descent as well as the strictly individual coordinate
descent.
156