Chapter 10

Sequence Modeling: Recurrent

and Recursive Nets

Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of

neural networks for processing sequential data. Much as a convolutional network

is a neural network that is specialized for processing a grid of values

such as

an image, a recurrent neural network is a neural network that is specialized for

processing a sequence of values

(1)

, . . . , x

(τ)

. Just as convolutional networks

can readily scale to images with large width and height, and some convolutional

networks can process images of variable size, recurrent networks can scale to much

longer sequences than would be practical for networks without sequence-based

specialization. Most recurrent networks can also process sequences of variable

length.

To go from multi-layer networks to recurrent networks, we need take advantage

of one of the early ideas found in machine learning and statistical models of the

1980s: sharing parameters across diﬀerent parts of a model. Parameter sharing

makes it possible to extend and apply the model to examples of diﬀerent forms

(diﬀerent lengths, here) and generalize across them. If we had separate parameters

for each value of the time index, we could not generalize to sequence lengths not

seen during training, nor share statistical strength across diﬀerent sequence lengths

and across diﬀerent positions in time. Such sharing is particularly important when

a speciﬁc piece of information can occur at multiple positions within the sequence.

For example, consider the two sentences “I went to Nepal in 2009” and “In 2009,

I went to Nepal.” If we ask a machine learning model to read each sentence and

extract the year in which the narrator went to Nepal, we would like it to recognize

the year 2009 as the relevant piece of information, whether it appears in the sixth

376

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

word or the second word of the sentence. Suppose that we trained a feedforward

network that processes sentences of ﬁxed length. A traditional fully connected

feedforward network would have separate parameters for each input feature, so it

would need to learn all of the rules of the language separately at each position in

the sentence. By comparison, a recurrent neural network shares the same weights

across several time steps.

A related idea is the use of convolution across a 1-D temporal sequence. This

convolutional approach is the basis for time-delay neural networks (Lang and

Hinton, 1988; Waibel et al., 1989; Lang et al., 1990). The convolution operation

allows a network to share parameters across time, but is shallow. The output

of convolution is a sequence where each member of the output is a function of

a small number of neighboring members of the input. The idea of parameter

sharing manifests in the application of the same convolution kernel at each time

step. Recurrent networks share parameters in a diﬀerent way. Each member of the

output is a function of the previous members of the output. Each member of the

output is produced using the same update rule applied to the previous outputs.

This recurrent formulation results in the sharing of parameters through a very

deep computational graph.

For the simplicity of exposition, we refer to RNNs as operating on a sequence

that contains vectors

(t)

with the time step index

ranging from 1 to

. In

practice, recurrent networks usually operate on minibatches of such sequences,

with a diﬀerent sequence length

for each member of the minibatch. We have

omitted the minibatch indices to simplify notation. Moreover, the time step index

need not literally refer to the passage of time in the real world, but only to the

position in the sequence. RNNs may also be applied in two dimensions across

spatial data such as images, and even when applied to data involving time, the

network may have connections that go backwards in time, provided that the entire

sequence is observed before it is provided to the network.

This chapter extends the idea of a computational graph to include cycles. These

cycles represent the inﬂuence of the present value of a variable on its own value

at a future time step. Such computational graphs allow us to deﬁne recurrent

neural networks. We then describe many diﬀerent ways to construct, train, and

use recurrent neural networks.

For more information on recurrent neural networks than is available in this

chapter, we refer the reader to the textbook of Graves (2012).

377

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.1 Unfolding Computational Graphs

A computational graph is a way to formalize the structure of a set of computations,

such as those involved in mapping inputs and parameters to outputs and loss.

Please refer to Sec. 6.5.1 for a general introduction. In this section we explain

the idea of unfolding a recursive or recurrent computation into a computational

graph that has a repetitive structure, typically corresponding to a chain of events.

Unfolding this graph results in the sharing of parameters across a deep network

structure.

For example, consider the classical form of a dynamical system:

(t)

= f(s

(t−1)

; θ), (10.1)

where s

(t)

is called the state of the system.

Eq. 10.1 is recurrent because the deﬁnition of

at time

refers back to the

same deﬁnition at time t − 1.

For a ﬁnite number of time steps

, the graph can be unfolded by applying the

deﬁnition

τ −

1 times. For example, if we unfold Eq. 10.1 for

= 3 time steps, we

obtain

(3)

=f(s

(2)

; θ)

=f(f(s

(1)

; θ); θ) (10.2)

Unfolding the equation by repeatedly applying the deﬁnition in this way has

yielded an expression that does not involve recurrence. Such an expression can

now be represented by a traditional directed acyclic computational graph. The

unfolded computational graph of Eq. 10.1 and Eq. 10.2 is illustrated in Fig. 10.1.

(t1)

(t)

(t+1)

(... )

Figure 10.1: The classical dynamical system described by Eq. 10.1, illustrated as an

unfolded computational graph. Each node represents the state at some time

and the

function

maps the state at

to the state at

+ 1. The same parameters (the same value

of θ used to parametrize f) are used for all time steps.

As another example, let us consider a dynamical system driven by an external

signal x

(t)

= f(s

(t−1)

, x

(t)

; θ), (10.3)

378

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

where we see that the state now contains information about the whole past sequence.

Recurrent neural networks can be built in many diﬀerent ways. Much as

almost any function can be considered a feedforward neural network, essentially

any function involving recurrence can be considered a recurrent neural network.

Many recurrent neural networks use Eq. 10.4 or a similar equation to deﬁne

the values of their hidden units. To indicate that the state is the hidden units of

the network, we now rewrite Eq. 10.3 using the variable h to represent the state:

(t)

= f(h

(t−1)

, x

(t)

; θ), (10.4)

illustrated in Fig. 10.2, typical RNNs will add extra architectural features such as

output layers that read information out of the state h to make predictions.

When the recurrent network is trained to perform a task that requires predicting

the future from the past, the network typically learns to use

(t)

as a kind of lossy

summary of the task-relevant aspects of the past sequence of inputs up to

. This

summary is in general necessarily lossy, since it maps an arbitrary length sequence

(

(t)

, x

(t−1)

, x

(t−2)

, . . . , x

(2)

, x

(1)

) to a ﬁxed length vector

(t)

. Depending on the

training criterion, this summary might selectively keep some aspects of the past

sequence with more precision than other aspects. For example, if the RNN is used

in statistical language modeling, typically to predict the next word given previous

words, it may not be necessary to store all of the information in the input sequence

up to time

, but rather only enough information to predict the rest of the sentence.

The most demanding situation is when we ask

(t)

to be rich enough to allow

one to approximately recover the input sequence, as in autoencoder frameworks

(Chapter 14).

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(... )

Unfold

Figure 10.2: A recurrent network with no outputs. This recurrent network just processes

information from the input

by incorporating it into the state

that is passed forward

through time. (Left) Circuit diagram. The black square indicates a delay of 1 time step.

(Right) The same network seen as an unfolded computational graph, where each node is

now associated with one particular time instance.

Eq. 10.4 can be drawn in two diﬀerent ways. One way to draw the RNN is

with a diagram containing one node for every component that might exist in a

379

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

physical implementation of the model, such as a biological neural network. In

this view, the network deﬁnes a circuit that operates in real time, with physical

parts whose current state can inﬂuence their future state, as in the left of Fig. 10.2.

Throughout this chapter, we use a black square in a circuit diagram to indicate

that an interaction takes place with a delay of 1 time step, from the state at time

to the state at time

+ 1. The other way to draw the RNN is as an unfolded

computational graph, in which each component is represented by many diﬀerent

variables, with one variable per time step, representing the state of the component

at that point in time. Each variable for each time step is drawn as a separate node

of the computational graph, as in the right of Fig. 10.2. What we call unfolding is

the operation that maps a circuit as in the left side of the ﬁgure to a computational

graph with repeated pieces as in the right side. The unfolded graph now has a size

that depends on the sequence length.

We can represent the unfolded recurrence after t steps with a function g

(t)

, x

(t−1)

, x

(t−2)

, . . . , x

(2)

, x

(1)

)

=f(h

(t−1)

, x

(t)

; θ) (10.5)

The function

(t)

takes the whole past sequence (

(t)

, x

(t−1)

, x

(t−2)

, . . . , x

(2)

, x

(1)

)

as input and produces the current state, but the unfolded recurrent structure

allows us to factorize

(t)

into repeated application of a function

. The unfolding

process thus introduces two major advantages:

Regardless of the sequence length, the learned model always has the same

input size, because it is speciﬁed in terms of transition from one state to

another state, rather than speciﬁed in terms of a variable-length history of

states.

It is possible to use the

same

transition function

with the same parameters

at every time step.

These two factors make it possible to learn a single model

that operates on

all time steps and all sequence lengths, rather than needing to learn a separate

model

(t)

for all possible time steps. Learning a single, shared model allows

generalization to sequence lengths that did not appear in the training set, and

allows the model to be estimated with far fewer training examples than would be

required without parameter sharing.

Both the recurrent graph and the unrolled graph have their uses. The recurrent

graph is succinct. The unfolded graph provides an explicit description of which

computations to perform. The unfolded graph also helps to illustrate the idea of

380

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

information ﬂow forward in time (computing outputs and losses) and backward

in time (computing gradients) by explicitly showing the path along which this

information ﬂows.

10.2 Recurrent Neural Networks

Armed with the graph unrolling and parameter sharing ideas of Sec. 10.1, we can

design a wide variety of recurrent neural networks.

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(... )

Unfold

Figure 10.3: The computational graph to compute the training loss of a recurrent network

that maps an input sequence of

values to a corresponding sequence of output

values.

A loss

measures how far each

is from the corresponding training target

. When using

softmax outputs, we assume

is the unnormalized log probabilities. The loss

internally

computes

softmax

(

) and compares this to the target

. The RNN has input to hidden

connections parametrized by a weight matrix

, hidden-to-hidden recurrent connections

parametrized by a weight matrix

, and hidden-to-output connections parametrized by

a weight matrix

. Eq. 10.6 deﬁnes forward propagation in this model. (Left) The RNN

and its loss drawn with recurrent connections. (Right) The same seen as an time-unfolded

computational graph, where each node is now associated with one particular time instance.

Some examples of important design patterns for recurrent neural networks

include the following:

•

Recurrent networks that produce an output at each time step and have

381

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

recurrent connections between hidden units, illustrated in Fig. 10.3.

•

Recurrent networks that produce an output at each time step and have

recurrent connections only from the output at one time step to the hidden

units at the next time step, illustrated in Fig. 10.4

•

Recurrent networks with recurrent connections between hidden units, that

read an entire sequence and then produce a single output, illustrated in Fig.

10.5.

Fig. 10.3 is a reasonably representative example that we return to throughout

most of the chapter.

The recurrent neural network of Fig. 10.3 and Eq. 10.6 is universal in the

sense that any function computable by a Turing machine can be computed by such

a recurrent network of a ﬁnite size. The output can be read from the RNN after

a number of time steps that is asymptotically linear in the number of time steps

used by the Turing machine and asymptotically linear in the length of the input

(Siegelmann and Sontag, 1991; Siegelmann, 1995; Siegelmann and Sontag, 1995;

Hyotyniemi, 1996). The functions computable by a Turing machine are discrete,

so these results regard exact implementation of the function, not approximations.

The RNN, when used as a Turing machine, takes a binary sequence as input and its

outputs must be discretized to provide a binary output. It is possible to compute all

functions in this setting using a single speciﬁc RNN of ﬁnite size (Siegelmann and

Sontag (1995) use 886 units). The “input” of the Turing machine is a speciﬁcation

of the function to be computed, so the same network that simulates this Turing

machine is suﬃcient for all problems. The theoretical RNN used for the proof

can simulate an unbounded stack by representing its activations and weights with

rational numbers of unbounded precision.

We now develop the forward propagation equations for the RNN depicted in

Fig. 10.3. The ﬁgure does not specify the choice of activation function for the

hidden units. Here we assume the hyperbolic tangent activation function. Also,

the ﬁgure does not specify exactly what form the output and loss function take.

Here we assume that the output is discrete, as if the RNN is used to predict words

or characters. A natural way to represent discrete variables is to regard the output

as giving the unnormalized log probabilities of each possible value of the discrete

variable. We can then apply the softmax operation as a post-processing step to

obtain a vector

of normalized probabilities over the output. Forward propagation

begins with a speciﬁcation of the initial state

(0)

. Then, for each time step from

382

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

WW W W

(... )

V V V

U U U

Unfold

Figure 10.4: An RNN whose only recurrence is the feedback connection from the output

to the hidden layer. At each time step

, the input is

, the hidden layer activations are

(t)

, the outputs are

(t)

, the targets are

(t)

and the loss is

(t)

. (Left) Circuit diagram.

(Right) Unfolded computational graph. Such an RNN is less powerful (can express a

smaller set of functions) than those in the family represented by Fig. 10.3. The RNN

in Fig. 10.3 can choose to put any information it wants about the past into its hidden

representation

and transmit

to the future. The RNN in this ﬁgure is trained to

put a speciﬁc output value into

, and

is the only information it is allowed to send

to the future. There are no direct connections from

going forward. The previous

is connected to the present only indirectly, via the predictions it was used to produce.

Unless

is very high-dimensional and rich, it will usually lack important information

from the past. This makes the RNN in this ﬁgure less powerful, but it may be easier to

train because each time step can be trained in isolation from the others, allowing greater

parallelization during training, as described in Sec. 10.2.1.

383

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t = 1 to t = τ, we apply the following update equations:

(t)

= b + W h

(t−1)

+ Ux

(t)

= tanh(a

(t)

)

(t)

= c + V h

(t)

= softmax(o

(t)

) (10.6)

where the parameters are the bias vectors

and

along with the weight matrices

and

, respectively for input-to-hidden, hidden-to-output and hidden-to-

hidden connections. This is an example of a recurrent network that maps an

input sequence to an output sequence of the same length. The total loss for a

given sequence of x values paired with a sequence of y values would then be just

the sum of the losses over all the time steps. For example, if

(t)

is the negative

log-likelihood of y

(t)

given x

(1)

, . . . , x

(t)

, then



(1)

, . . . , x

(τ)

}, {y

(1)

, . . . , y

(τ)

}



(10.7)



(t)

(10.8)

= −



log p

model



(t)

| {x

(1)

, . . . , x

(t)

}



, (10.9)

where

model



(t)

| {x

(1)

, . . . , x

(t)

}



is given by reading the entry for

(t)

from the

model’s output vector

(t)

. Computing the gradient of this loss function with

respect to the parameters is an expensive operation. The gradient computation

involves performing a forward propagation pass moving left to right through our

illustration of the unrolled graph in Fig. 10.3, followed by a backward propagation

pass moving right to left through the graph. The runtime is

(

) and cannot be

reduced by parallelization because the forward propagation graph is inherently

sequential; each time step may only be computed after the previous one. States

computed in the forward pass must be stored until they are reused during the

backward pass, so the memory cost is also

(

). The back-propagation algorithm

applied to the unrolled graph with

(

) cost is called back-propagation through

time or BPTT and is discussed further Sec. 10.2.2. The network with recurrence

between hidden units is thus very powerful but also expensive to train. Is there an

alternative?

10.2.1 Teacher Forcing and Networks with Output Recurrence

The network with recurrent connections only from the output at one time step to

the hidden units at the next time step (shown in Fig. 10.4) is strictly less powerful

384

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

because it lacks hidden-to-hidden recurrent connections. For example, it cannot

simulate a universal Turing machine. Because this network lacks hidden-to-hidden

recurrence, it requires that the output units capture all of the information about

the past that the network will use to predict the future. Because the output units

are explicitly trained to match the training set targets, they are unlikely to capture

the necessary information about the past history of the input, unless the user

knows how to describe the full state of the system and provides it as part of the

training set targets. The advantage of eliminating hidden-to-hidden recurrence

is that, for any loss function based on comparing the prediction at time

to the

training target at time

, all the time steps are decoupled. Training can thus be

parallelized, with the gradient for each step

computed in isolation. There is no

need to compute the output for the previous time step ﬁrst, because the training

set provides the ideal value of that output.

(t1)

(t)

......

(t1)

(t)

(...)

W W

U U

(⌧)

......

Figure 10.5: Time-unfolded recurrent neural network with a single output at the end

of the sequence. Such a network can be used to summarize a sequence and produce a

ﬁxed-size representation used as input for further processing. There might be a target

right at the end (as depicted here) or the gradient on the output

(t)

can be obtained by

back-propagating from further downstream modules.

Models that have recurrent connections from their outputs leading back into

the model may be trained with teacher forcing. Teacher forcing is a procedure

that emerges from the maximum likelihood criterion, in which during training the

model receives the ground truth output

(t)

as input at time

+ 1. We can see

this by examining a sequence with two time steps. The conditional maximum

385

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(t1)

(t)

(t1)

(t)

(t1)

(t)

V V

U U

(t1)

(t)

(t1)

(t)

(t1)

(t)

(t1)

(t)

(t1)

(t)

V V

U U

Train time Test time

Figure 10.6: Illustration of teacher forcing. Teacher forcing is a training technique that is

applicable to RNNs that have connections from their output to their hidden states at the

next time step. (Left) At train time, we feed the correct output

(t)

drawn from the train

set as input to

(t+1)

. (Right) When the model is deployed, the true output is generally

not known. In this case, we approximate the correct output

(t)

with the model’s output

(t)

, and feed the output back into the model.

386

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

likelihood criterion is

log p



(1)

, y

(2)

| x

(1)

, x

(2)



= log p



(2)

| y

(1)

, x

(1)

, x

(2)



+ log p



(1)

| x

(1)

, x

(2)



(10.10)

In this example, we see that at time

= 2, the model is trained to maximize the

conditional probability of

(2)

given

both

the

sequence so far and the previous

value from the training set. Maximum likelihood thus speciﬁes that during training,

rather than feeding the model’s own output back into itself, these connections

should be fed with the target values specifying what the correct output should be.

This is illustrated in Fig. 10.6.

We originally motivated teacher forcing as allowing us to avoid back-propagation

through time in models that lack hidden-to-hidden connections. Teacher forcing

may still be applied to models that have hidden-to-hidden connections so long as

they have connections from the output at one time step to values computed in the

next time step. However, as soon as the hidden units become a function of earlier

time steps, the BPTT algorithm is necessary. Some models may thus be trained

with both teacher forcing and BPTT.

The disadvantage of strict teacher forcing arises if the network is going to be

later used in an open-loop mode, with the network outputs (or samples from the

output distribution) fed back as input. In this case, the kind of inputs that the

network sees during training could be quite diﬀerent from the kind of inputs that

it will see at test time. One way to mitigate this problem is to train with both

teacher-forced inputs and with free-running inputs, for example by predicting the

correct target a number of steps in the future through the unfolded recurrent

output-to-input paths. In this way, the network can learn to take into account

input conditions (such as those it generates itself in the free-running mode) not

seen during training and how to map the state back towards one that will make

the network generate proper outputs after a few steps. Another approach (Bengio

et al., 2015b) to mitigate the gap between the inputs seen at train time and the

inputs seen at test time randomly chooses to use generated values or actual data

values as input. This approach exploits a curriculum learning strategy to gradually

use more of the generated values as input.

10.2.2 Computing the Gradient in a Recurrent Neural Network

Computing the gradient through a recurrent neural network is straightforward.

One simply applies the generalized back-propagation algorithm of Sec. 6.5.6 to the

387

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

unrolled computational graph. No specialized algorithms are necessary. The use

of back-propagation on the unrolled graph is called the back-propagation through

time (BPTT) algorithm. Gradients obtained by back-propagation may then be

used with any general-purpose gradient-based techniques to train an RNN.

To gain some intuition for how the BPTT algorithm behaves, we provide an

example of how to compute gradients by BPTT for the RNN equations above

(Eq. 10.6 and Eq. 10.7). The nodes of our computational graph include the

parameters

and

as well as the sequence of nodes indexed by

for

(t)

and

(t)

. For each node

we need to compute the gradient

∇

recursively, based on the gradient computed at nodes that follow it in the graph.

We start the recursion with the nodes immediately preceding the ﬁnal loss

∂L

(t)

= 1. (10.11)

In this derivation we assume that the outputs

(t)

are used as the argument to the

softmax function to obtain the vector

of probabilities over the output. We also

assume that the loss is the negative log-likelihood of the true target

(t)

given the

input so far. The gradient

∇

(t)

on the outputs at time step

, for all

i, t

, is as

follows:

(∇

(t)

∂L

∂o

(t)

∂L

(t)

∂L

(t)

∂o

(t)

= ˆy

(t)

− 1

i,y

(t)

. (10.12)

We work our way backwards, starting from the end of the sequence. At the ﬁnal

time step τ, h

(τ)

only has o

(τ)

as a descendent, so its gradient is simple:

∇

(τ)

L = (∇

(τ)

∂o

(τ)

∂h

(τ)

= (∇

(τ)

L) V . (10.13)

We can then iterate backwards in time to back-propagate gradients through time,

from

τ −

1 down to

= 1, noting that

(t)

(for

t < τ

) has as descendents both

(t)

and h

(t+1)

. Its gradient is thus given by

∇

(t)

L = (∇

(t+1)

∂h

(t+1)

∂h

(t)

+ (∇

(t)

∂o

(t)

∂h

(t)

(10.14)

= (∇

(t+1)

L) diag



1 −



(t+1)





W + (∇

(t)

L) V (10.15)

where

diag



1 −



(t+1)





indicates the diagonal matrix containing the elements

−

(

(t+1)

)

. This is the Jacobian of the hyperbolic tangent associated with the

hidden unit i at time t + 1.

388

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Once the gradients on the internal nodes of the computational graph are ob-

tained, we can obtain the gradients on the parameter nodes, which have descendents

at all the time steps:

∇

L =



(∇

(t)

∂o

(t)

∂c



∇

(t)

∇

L =



(∇

(t)

∂h

(t)

∂b



(∇

(t)

L) diag



1 −



(t)





∇

L =



(∇

(t)

∂o

(t)

∂V



(∇

(t)

L) h

(t)



∇

L =



(∇

(t)

∂h

(t)

∂W



(∇

(t)

L) diag



1 −



(t)





(t−1)



∇

L =



(∇

(t)

∂h

(t)

∂U



(∇

(t)

L) diag



1 −



(t)





(t)



We do not need to compute the gradient with respect to

(t)

for training because

it does not have any parameters as ancestors in the computational graph deﬁning

the loss.

We are abusing notation somewhat in the above equations. We correctly use

∇

(t)

to indicate the full inﬂuence of

(t)

through all paths from

(t)

. This

is in contrast to our usage of

∂h

(t)

∂W

∂h

(t)

∂b

, which we use here in an unconventional

manner. By

∂h

(t)

∂W

we refer to the eﬀect of

(t)

only via the use of

at time

step

. This is not standard calculus notation, because the standard deﬁnition of

the Jacobian would actually include the complete inﬂuence of

(t)

via its

use in all of the preceding time steps to produce h

(t−1)

. What we refer to here is

in fact the

bprop

method of Sec. 6.5.6, that computes the contribution of a single

edge in the computational graph to the gradient.

10.2.3 Recurrent Networks as Directed Graphical Models

In the example recurrent network we have developed so far, the losses

(t)

were

cross-entropies between training targets

(t)

and outputs

(t)

. As with a feedforward

network, it is in principle possible to use almost any loss with a recurrent network.

The loss should be chosen based on the task. As with a feedforward network, we

usually wish to interpret the output of the RNN as a probability distribution, and

we usually use the cross-entropy associated with that distribution to deﬁne the loss.

389

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Mean squared error is the cross-entropy loss associated with an output distribution

that is a unit Gaussian, for example, just as with a feedforward network.

When we use a predictive log-likelihood training objective, such as Eq. 10.7, we

train the RNN to estimate the conditional distribution of the next sequence element

(t)

given the past inputs. This may mean that we maximize the log-likelihood

log p(y

(t)

| x

(1)

, . . . , x

(t)

), (10.16)

or, if the model includes connections from the output at one time step to the next

time step,

log p(y

(t)

| x

(1)

, . . . , x

(t)

, y

(1)

, . . . , y

(t−1)

). (10.17)

Decomposing the joint probability over the sequence of

values as a series of

one-step probabilistic predictions is one way to capture the full joint distribution

across the whole sequence. When we do not feed past

values as inputs that

condition the next step prediction, the directed graphical model contains no edges

from any

(i)

in the past to the current

(t)

. In this case, the outputs

are

conditionally independent given the sequence of

values. When we do feed the

actual

values (not their prediction, but the actual observed or generated values)

back into the network, the directed graphical model contains edges from all

(i)

values in the past to the current y

(t)

value.

(1)

(2)

(3)

(4)

(5)

(...)

Figure 10.7: Fully connected graphical model for a sequence

(1)

, y

(2)

, . . . , y

(t)

, . . .

: every

past observation

(i)

may inﬂuence the conditional distribution of some

(t)

(for

t > i

given the previous values. Parametrizing the graphical model directly according to this

graph (as in Eq. 10.5) might be very ineﬃcient, with an ever growing number of inputs

and parameters for each element of the sequence. RNNs obtain the same full connectivity

but eﬃcient parametrization, as illustrated in Fig. 10.8.

As a simple example, let us consider the case where the RNN models only a

390

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

sequence of scalar random variables

(1)

, . . . , y

(τ)

}

, with no additional inputs

. The input at time step

is simply the output at time step

t −

1. The RNN then

deﬁnes a directed graphical model over the y variables. We parametrize the joint

distribution of these observations using the chain rule (Eq. 3.6) for conditional

probabilities:

P (Y) = P (y

(1)

, . . . , y

(τ)

) =



t=1

P (y

(t)

| y

(t−1)

, y

(t−2)

, . . . , y

(1)

) (10.18)

where the right-hand side of the bar is empty for

= 1, of course. Hence the

negative log-likelihood of a set of values

(1)

, . . . , y

(τ)

}

according to such a model

L =



(t)

(10.19)

where

(t)

= −log P (y

(t)

= y

(t)

| y

(t−1)

, y

(t−2)

, . . . , y

(1)

). (10.20)

(1)

(2)

(3)

(4)

(5)

(...)

(1)

(2)

(3)

(4)

(5)

(... )

Figure 10.8: Introducing the state variable in the graphical model of the RNN, even

though it is a deterministic function of its inputs, helps to see how we can obtain a very

eﬃcient parametrization, based on Eq. 10.4. Every stage in the sequence (for

(t)

and

(t)

) involves the same structure (the same number of inputs for each node) and can share

the same parameters with the other stages.

The edges in a graphical model indicate which variables depend directly on other

variables. Many graphical models aim to achieve statistical and computational

eﬃciency by omitting edges that do not correspond to strong interactions. For

example, it is common to make the Markov assumption that the graphical model

should only contain edges from

(t−k)

, . . . , y

(t−1)

}

(t)

, rather than containing

edges from the entire past history. However, in some cases, we believe that all past

inputs should have an inﬂuence on the next element of the sequence. RNNs are

useful when we believe that the distribution over

(t)

may depend on a value of

(i)

from the distant past in a way that is not captured by the eﬀect of y

(i)

on y

(t−1)

One way to interpret an RNN as a graphical model is to view the RNN as

deﬁning a graphical model whose structure is the complete graph, able to represent

391

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

direct dependencies between any pair of

values. The graphical model over the

values with the complete graph structure is shown in Fig. 10.7. The complete

graph interpretation of the RNN is based on ignoring the hidden units

(t)

marginalizing them out of the model.

It is more interesting to consider the graphical model structure of RNNs that

results from regarding the hidden units

(t)

as random variables.

Including the

hidden units in the graphical model reveals that the RNN provides a very eﬃcient

parametrization of the joint distribution over the observations. Suppose that we

represented an arbitrary joint distribution over discrete values with a tabular

representation—an array containing a separate entry for each possible assignment

of values, with the value of that entry giving the probability of that assignment

occurring. If

can take on

diﬀerent values, the tabular representation would

have

(

) parameters. By comparison, due to parameter sharing, the number

of parameters in the RNN is

(1) as a function of sequence length. The number

of parameters in the RNN may be adjusted to control model capacity but is not

forced to scale with sequence length. Eq. 10.4 shows that the RNN parametrizes

long-term relationships between variables eﬃciently, using recurrent applications

of the same function

and same parameters

at each time step. Fig. 10.8

illustrates the graphical model interpretation. Incorporating the

(t)

nodes in

the graphical model decouples the past and the future, acting as an intermediate

quantity between them. A variable

(i)

in the distant past may inﬂuence a variable

(t)

via its eﬀect on

. The structure of this graph shows that the model can be

eﬃciently parametrized by using the same conditional probability distributions at

each time step, and that when the variables are all observed, the probability of the

joint assignment of all variables can be evaluated eﬃciently.

Even with the eﬃcient parametrization of the graphical model, some operations

remain computationally challenging. For example, it is diﬃcult to predict missing

values in the middle of the sequence.

The price recurrent networks pay for their reduced number of parameters is

that optimizing the parameters may be diﬃcult.

The parameter sharing used in recurrent networks relies on the assumption

that the same parameters can be used for diﬀerent time steps. Equivalently, the

assumption is that the conditional probability distribution over the variables at

time

+ 1 given the variables at time

is stationary, meaning that the relationship

between the previous time step and the next time step does not depend on

. In

The conditional distribution over these variables given their parents is deterministic. This is

perfectly legitimate, though it is somewhat rare to design a graphical model with such deterministic

hidden units.

392

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

principle, it would be possible to use

as an extra input at each time step and let

the learner discover any time-dependence while sharing as much as it can between

diﬀerent time steps. This would already be much better than using a diﬀerent

conditional probability distribution for each

, but the network would then have to

extrapolate when faced with new values of t.

To complete our view of an RNN as a graphical model, we must describe how

to draw samples from the model. The main operation that we need to perform is

simply to sample from the conditional distribution at each time step. However,

there is one additional complication. The RNN must have some mechanism for

determining the length of the sequence. This can be achieved in various ways.

In the case when the output is a symbol taken from a vocabulary, one can

add a special symbol corresponding to the end of a sequence (Schmidhuber, 2012).

When that symbol is generated, the sampling process stops. In the training set,

we insert this symbol as an extra member of the sequence, immediately after

(τ)

in each training example.

Another option is to introduce an extra Bernoulli output to the model that

represents the decision to either continue generation or halt generation at each

time step. This approach is more general than the approach of adding an extra

symbol to the vocabulary, because it may be applied to any RNN, rather than

only RNNs that output a sequence of symbols. For example, it may be applied to

an RNN that emits a sequence of real numbers. The new output unit is usually a

sigmoid unit trained with the cross-entropy loss. In this approach the sigmoid is

trained to maximize the log-probability of the correct prediction as to whether the

sequence ends or continues at each time step.

Another way to determine the sequence length

is to add an extra output to

the model that predicts the integer

itself. The model can sample a value of

and then sample

steps worth of data. This approach requires adding an extra

input to the recurrent update at each time step so that the recurrent update is

aware of whether it is near the end of the generated sequence. This extra input

can either consist of the value of

or can consist of

τ −t

, the number of remaining

time steps. Without this extra input, the RNN might generate sequences that

end abruptly, such as a sentence that ends before it is complete. This approach is

based on the decomposition

P (x

(1)

, . . . , x

(τ)

) = P (τ)P (x

(1)

, . . . , x

(τ)

| τ). (10.21)

The strategy of predicting

directly is used for example by Goodfellow et al.

(2014d).

393

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.2.4 Modeling Sequences Conditioned on Context with RNNs

In the previous section we described how an RNN could correspond to a directed

graphical model over a sequence of random variables

(t)

with no inputs

. Of

course, our development of RNNs as in Eq. 10.6 included a sequence of inputs

(1)

, x

(2)

, . . . , x

(τ)

. In general, RNNs allow the extension of the graphical model

view to represent not only a joint distribution over the

variables but also a

conditional distribution over

given

. As discussed in the context of feedforward

networks in Sec. 6.2.1.1, any model representing a variable

(

;

) can be reinter-

preted as a model representing a conditional distribution

(

y|ω

) with

. We

can extend such a model to represent a distribution

(

y | x

) by using the same

(

y | ω

) as before, but making

a function of

. In the case of an RNN, this

can be achieved in diﬀerent ways. We review here the most common and obvious

choices.

Previously, we have discussed RNNs that take a sequence of vectors

(t)

for

= 1

, . . . , τ

as input. Another option is to take only a single vector

as input.

When

is a ﬁxed-size vector, we can simply make it an extra input of the RNN

that generates the

sequence. Some common ways of providing an extra input to

an RNN are:

1. as an extra input at each time step, or

2. as the initial state h

(0)

, or

3. both.

The ﬁrst and most common approach is illustrated in Fig. 10.9. The interaction

between the input

and each hidden unit vector

(t)

is parametrized by a newly

introduced weight matrix

that was absent from the model of only the sequence

values. The same product



is added as additional input to the hidden

units at every time step. We can think of the choice of

as determining the value



that is eﬀectively a new bias parameter used for each of the hidden units.

The weights remain independent of the input. We can think of this model as taking

the parameters

of the non-conditional model and turning them into

, where

the bias parameters within ω are now a function of the input.

Rather than receiving only a single vector

as input, the RNN may receive a

sequence of vectors

(t)

as input. The RNN described in Eq. 10.6 corresponds to a

conditional distribution

(

(1)

, . . . , y

(τ)

| x

(1)

, . . . , x

(τ)

) that makes a conditional

independence assumption that this distribution factorizes as



P (y

(t)

| x

(1)

, . . . , x

(t)

). (10.22)

394

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

WW W W

(... )

V V V

U U U

(...)

R R R R R

Figure 10.9: An RNN that maps a ﬁxed-length vector

into a distribution over sequences

. This RNN is appropriate for tasks such as image captioning, where a single image is

used as input to a model that then produces a sequence of words describing the image.

Each element

(t)

of the observed output sequence serves both as input (for the current

time step) and, during training, as target (for the previous time step).

395

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

WW W W

(... )

V V V

U U U

(t1)

(t)

(t+1)

R R

Figure 10.10: A conditional recurrent neural network mapping a variable-length sequence

values into a distribution over sequences of

values of the same length. Compared

to Fig. 10.3, this RNN contains connections from the previous output to the current state.

These connections allow this RNN to model an arbitrary distribution over sequences of

given sequences of x of the same length. The RNN of Fig. 10.3 is only able to represent

distributions in which the

values are conditionally independent from each other given

the x values.

396

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

To remove the conditional independence assumption, we can add connections from

the output at time

to the hidden unit at time

+ 1, as shown in Fig. 10.10. The

model can then represent arbitrary probability distributions over the

sequence.

This kind of model representing a distribution over a sequence given another

sequence still has one restriction, which is that the length of both sequences must

be the same. We describe how to remove this restriction in Sec. 10.4.

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

(t1)

(t)

(t+1)

Figure 10.11: Computation of a typical bidirectional recurrent neural network, meant

to learn to map input sequences

to target sequences

, with loss

(t)

at each step

The

recurrence propagates information forward in time (towards the right) while the

recurrence propagates information backward in time (towards the left). Thus at each

point

, the output units

(t)

can beneﬁt from a relevant summary of the past in its

(t)

input and from a relevant summary of the future in its g

(t)

input.

10.3 Bidirectional RNNs

All of the recurrent networks we have considered up to now have a “causal” struc-

ture, meaning that the state at time

only captures information from the past,

397

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(1)

, . . . , x

(t−1)

, and the present input

(t)

. Some of the models we have discussed

also allow information from past

values to aﬀect the current state when the

values are available.

However, in many applications we want to output a prediction of

(t)

which

may depend on

the whole input sequence

. For example, in speech recognition,

the correct interpretation of the current sound as a phoneme may depend on the

next few phonemes because of co-articulation and potentially may even depend on

the next few words because of the linguistic dependencies between nearby words: if

there are two interpretations of the current word that are both acoustically plausible,

we may have to look far into the future (and the past) to disambiguate them.

This is also true of handwriting recognition and many other sequence-to-sequence

learning tasks, described in the next section.

Bidirectional recurrent neural networks (or bidirectional RNNs) were invented

to address that need (Schuster and Paliwal, 1997). They have been extremely suc-

cessful (Graves, 2012) in applications where that need arises, such as handwriting

recognition (Graves et al., 2008; Graves and Schmidhuber, 2009), speech recogni-

tion (Graves and Schmidhuber, 2005; Graves et al., 2013) and bioinformatics (Baldi

et al., 1999).

As the name suggests, bidirectional RNNs combine an RNN that moves forward

through time beginning from the start of the sequence with another RNN that

moves backward through time beginning from the end of the sequence. Fig. 10.11

illustrates the typical bidirectional RNN, with

(t)

standing for the state of the

sub-RNN that moves forward through time and

(t)

standing for the state of the

sub-RNN that moves backward through time. This allows the output units

(t)

compute a representation that depends on

both the past and the future

but

is most sensitive to the input values around time

, without having to specify a

ﬁxed-size window around

(as one would have to do with a feedforward network,

a convolutional network, or a regular RNN with a ﬁxed-size look-ahead buﬀer).

This idea can be naturally extended to 2-dimensional input, such as images,

by having

four

RNNs, each one going in one of the four directions: up, down,

left, right. At each point (

i, j

) of a 2-D grid, an output

i,j

could then compute a

representation that would capture mostly local information but could also depend

on long-range inputs, if the RNN is able to learn to carry that information.

Compared to a convolutional network, RNNs applied to images are typically more

expensive but allow for long-range lateral interactions between features in the

same feature map (Visin et al., 2015; Kalchbrenner et al., 2015). Indeed, the

forward propagation equations for such RNNs may be written in a form that shows

they use a convolution that computes the bottom-up input to each layer, prior

398

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

to the recurrent propagation across the feature map that incorporates the lateral

interactions.

10.4 Encoder-Decoder Sequence-to-Sequence Architec-

tures

We have seen in Fig. 10.5 how an RNN can map an input sequence to a ﬁxed-size

vector. We have seen in Fig. 10.9 how an RNN can map a ﬁxed-size vector to a

sequence. We have seen in Fig. 10.3, Fig. 10.4, Fig. 10.10 and Fig. 10.11 how an

RNN can map an input sequence to an output sequence of the same length.

Encoder

…

(1)

(2)

(...)

)

Decoder

…

(1)

(2)

(...)

)

Figure 10.12: Example of an encoder-decoder or sequence-to-sequence RNN architecture,

for learning to generate an output sequence (

(1)

, . . . , y

)

) given an input sequence

(

(1)

, x

(2)

, . . . , x

)

). It is composed of an encoder RNN that reads the input sequence

and a decoder RNN that generates the output sequence (or computes the probability of a

given output sequence). The ﬁnal hidden state of the encoder RNN is used to compute a

generally ﬁxed-size context variable

which represents a semantic summary of the input

sequence and is given as input to the decoder RNN.

399

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Here we discuss how an RNN can be trained to map an input sequence to an

output sequence which is not necessarily of the same length. This comes up in

many applications, such as speech recognition, machine translation or question

answering, where the input and output sequences in the training set are generally

not of the same length (although their lengths might be related).

We often call the input to the RNN the “context.” We want to produce a

representation of this context,

. The context

might be a vector or sequence of

vectors that summarize the input sequence X = (x

(1)

, . . . , x

)

The simplest RNN architecture for mapping a variable-length sequence to

another variable-length sequence was ﬁrst proposed by Cho et al. (2014a) and

shortly after by Sutskever et al. (2014b), who independently developed that

architecture and were the ﬁrst to obtain state-of-the-art translation using this

approach. The former system is based on scoring proposals generated by another

machine translation system, while the latter uses a standalone recurrent network

to generate the translations. These authors respectively called this architecture,

illustrated in Fig. 10.12, the encoder-decoder or sequence-to-sequence architecture.

The idea is very simple: (1) an encoder or reader or input RNN processes the

input sequence. The encoder emits the context

, usually as a simple function

of its ﬁnal hidden state. (2) a decoder or writer or output RNN is conditioned on

that ﬁxed-length vector (just like in Fig. 10.9) to generate the output sequence

= (

(1)

, . . . , y

)

). The innovation of this kind of architecture over those

presented in earlier sections of this chapter is that the lengths

and

can

vary from each other, while previous architectures constrained

. In a

sequence-to-sequence architecture, the two RNNs are trained jointly to maximize

the average of

log P

(

(1)

, . . . , y

)

| x

(1)

, . . . , x

)

) over all the pairs of

and

sequences in the training set. The last state

of the encoder RNN is typically

used as a representation

of the input sequence that is provided as input to the

decoder RNN.

If the context

is a vector, then the decoder RNN is simply a vector-to-

sequence RNN as described in Sec. 10.2.4. As we have seen, there are at least two

ways for a vector-to-sequence RNN to receive input. The input can be provided as

the initial state of the RNN, or the input can be connected to the hidden units at

each time step. These two ways can also be combined.

There is no constraint that the encoder must have the same size of hidden layer

as the decoder.

One clear limitation of this architecture is when the context

output by the

encoder RNN has a dimension that is too small to properly summarize a long

sequence. This phenomenon was observed by Bahdanau et al. (2015) in the context

400

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

of machine translation. They proposed to make

a variable-length sequence rather

than a ﬁxed-size vector. Additionally, they introduced an attention mechanism

that learns to associate elements of the sequence

to elements of the output

sequence. See Sec. 12.4.5.1 for more details.

10.5 Deep Recurrent Networks

The computation in most RNNs can be decomposed into three blocks of parameters

and associated transformations:

1. from the input to the hidden state,

2. from the previous hidden state to the next hidden state, and

3. from the hidden state to the output.

With the RNN architecture of Fig. 10.3, each of these three blocks is associated

with a single weight matrix. In other words, when the network is unfolded, each

of these corresponds to a shallow transformation. By a shallow transformation,

we mean a transformation that would be represented by a single layer within

a deep MLP. Typically this is a transformation represented by a learned aﬃne

transformation followed by a ﬁxed nonlinearity.

Would it be advantageous to introduce depth in each of these operations?

Experimental evidence (Graves et al., 2013; Pascanu et al., 2014a) strongly suggests

so. The experimental evidence is in agreement with the idea that we need enough

depth in order to perform the required mappings. See also Schmidhuber (1992),

El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs.

Graves et al. (2013) were the ﬁrst to show a signiﬁcant beneﬁt of decomposing

the state of an RNN into multiple layers as in Fig. 10.13 (left). We can think

of the lower layers in the hierarchy depicted in Fig. 10.13a as playing a role in

transforming the raw input into a representation that is more appropriate, at

the higher levels of the hidden state. Pascanu et al. (2014a) go a step further

and propose to have a separate MLP (possibly deep) for each of the three blocks

enumerated above, as illustrated in Fig. 10.13b. Considerations of representational

capacity suggest to allocate enough capacity in each of these three steps, but doing

so by adding depth may hurt learning by making optimization diﬃcult. In general,

it is easier to optimize shallower architectures, and adding the extra depth of

Fig. 10.13b makes the shortest path from a variable in time step

to a variable

in time step

+ 1 become longer. For example, if an MLP with a single hidden

401

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(a) (b) (c)

Figure 10.13: A recurrent neural network can be made deep in many ways (Pascanu

et al., 2014a). (a) The hidden recurrent state can be broken down into groups organized

hierarchically. (b) Deeper computation (e.g., an MLP) can be introduced in the input-to-

hidden, hidden-to-hidden and hidden-to-output parts. This may lengthen the shortest

path linking diﬀerent time steps. (c) The path-lengthening eﬀect can be mitigated by

introducing skip connections.

402

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

layer is used for the state-to-state transition, we have doubled the length of the

shortest path between variables in any two diﬀerent time steps, compared with the

ordinary RNN of Fig. 10.3. However, as argued by Pascanu et al. (2014a), this

can be mitigated by introducing skip connections in the hidden-to-hidden path, as

illustrated in Fig. 10.13c.

10.6 Recursive Neural Networks

(1)

(2)

(3)

V V V

(4)

U W U W

Figure 10.14: A recursive network has a computational graph that generalizes that of the

recurrent network from a chain to a tree. A variable-size sequence

(1)

, x

(2)

, . . . , x

(t)

can

be mapped to a ﬁxed-size representation (the output

), with a ﬁxed set of parameters

(the weight matrices

). The ﬁgure illustrates a supervised learning case in which

some target y is provided which is associated with the whole sequence.

Recursive neural networks

represent yet another generalization of recurrent

networks, with a diﬀerent kind of computational graph, which is structured as a

We suggest to not abbreviate “recursive neural network” as “RNN” to avoid confusion with

“recurrent neural network.”

403

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

deep tree, rather than the chain-like structure of RNNs. The typical computational

graph for a recursive network is illustrated in Fig. 10.14. Recursive neural networks

were introduced by Pollack (1990) and their potential use for learning to reason

was described by by Bottou (2011). Recursive networks have been successfully

applied to processing

data structures

as input to neural nets (Frasconi et al.,

1997, 1998), in natural language processing (Socher et al., 2011a,c, 2013a) as well

as in computer vision (Socher et al., 2011b).

One clear advantage of recursive nets over recurrent nets is that for a sequence

of the same length

, the depth (measured as the number of compositions of

nonlinear operations) can be drastically reduced from

(

log τ

), which might

help deal with long-term dependencies. An open question is how to best structure

the tree. One option is to have a tree structure which does not depend on the data,

such as a balanced binary tree. In some application domains, external methods

can suggest the appropriate tree structure. For example, when processing natural

language sentences, the tree structure for the recursive network can be ﬁxed to

the structure of the parse tree of the sentence provided by a natural language

parser (Socher et al., 2011a, 2013a). Ideally, one would like the learner itself to

discover and infer the tree structure that is appropriate for any given input, as

suggested by Bottou (2011).

Many variants of the recursive net idea are possible. For example, Frasconi

et al. (1997) and Frasconi et al. (1998) associate the data with a tree structure,

and associate the inputs and targets with individual nodes of the tree. The

computation performed by each node does not have to be the traditional artiﬁcial

neuron computation (aﬃne transformation of all inputs followed by a monotone

nonlinearity). For example, Socher et al. (2013a) propose using tensor operations

and bilinear forms, which have previously been found useful to model relationships

between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are

represented by continuous vectors (embeddings).

10.7 The Challenge of Long-Term Dependencies

The mathematical challenge of learning long-term dependencies in recurrent net-

works was introduced in Sec. 8.2.5. The basic problem is that gradients propagated

over many stages tend to either vanish (most of the time) or explode (rarely, but

with much damage to the optimization). Even if we assume that the parameters are

such that the recurrent network is stable (can store memories, with gradients not

exploding), the diﬃculty with long-term dependencies arises from the exponentially

smaller weights given to long-term interactions (involving the multiplication of

404

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

−60 −40 −20 0 20 40 60

Input coordinate

−4

−3

−2

−1

Projection of output

Repeated function composition

Figure 10.15: When composing many nonlinear functions (like the linear-

tanh

layer shown

here), the result is highly nonlinear, typically with most of the values associated with a tiny

derivative, some values with a large derivative, and many alternations between increasing

and decreasing. In this plot, we plot a linear projection of a 100-dimensional hidden state

down to a single dimension, plotted on the

-axis. The

-axis is the coordinate of the

initial state along a random direction in the 100-dimensional space. We can thus view this

plot as a linear cross-section of a high-dimensional function. The plots show the function

after each time step, or equivalently, after each number of times the transition function

has been composed.

many Jacobians) compared to short-term ones. Many other sources provide a

deeper treatment (Hochreiter, 1991; Doya, 1993; Bengio et al., 1994; Pascanu et al.,

2013a) . In this section, we describe the problem in more detail. The remaining

sections describe approaches to overcoming the problem.

Recurrent networks involve the composition of the same function multiple

times, once per time step. These compositions can result in extremely nonlinear

behavior, as illustrated in Fig. 10.15.

In particular, the function composition employed by recurrent neural networks

somewhat resembles matrix multiplication. We can think of the recurrence relation

(t)

= W



(t−1)

(10.23)

as a very simple recurrent neural network lacking a nonlinear activation function,

and lacking inputs

. As described in Sec. 8.2.5, this recurrence relation essentially

describes the power method. It may be simpliﬁed to

(t)







(0)

, (10.24)

405

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

and if W admits an eigendecomposition

W = QΛQ



, (10.25)

the recurrence may be simpliﬁed further to

(t)

= Q



(0)

. (10.26)

The eigenvalues are raised to the power of

causing eigenvalues with magnitude

less than one to decay to zero and eigenvalues with magnitude greater than one to

explode. Any component of

(0)

that is not aligned with the largest eigenvector

will eventually be discarded.

This problem is particular to recurrent networks. In the scalar case, imagine

multiplying a weight

by itself many times. The product

will either vanish or

explode depending on the magnitude of

. However, if we make a non-recurrent

network that has a diﬀerent weight

(t)

at each time step, the situation is diﬀerent.

If the initial state is given by 1, then the state at time

is given by



(t)

. Suppose

that the

(t)

values are generated randomly, independently from one another, with

zero mean and variance

. The variance of the product is

(

). To obtain some

desired variance

∗

we may choose the individual weights with variance

√

∗

Very deep feedforward networks with carefully chosen scaling can thus avoid the

vanishing and exploding gradient problem, as argued by Sussillo (2014).

The vanishing and exploding gradient problem for RNNs was independently

discovered by separate researchers (Hochreiter, 1991; Bengio et al., 1993, 1994).

One may hope that the problem can be avoided simply by staying in a region of

parameter space where the gradients do not vanish or explode. Unfortunately, in

order to store memories in a way that is robust to small perturbations, the RNN

must enter a region of parameter space where gradients vanish (Bengio et al., 1993,

1994). Speciﬁcally, whenever the model is able to represent long term dependencies,

the gradient of a long term interaction has exponentially smaller magnitude than

the gradient of a short term interaction. It does not mean that it is impossible

to learn, but that it might take a very long time to learn long-term dependencies,

because the signal about these dependencies will tend to be hidden by the smallest

ﬂuctuations arising from short-term dependencies. In practice, the experiments

in Bengio et al. (1994) show that as we increase the span of the dependencies that

need to be captured, gradient-based optimization becomes increasingly diﬃcult,

with the probability of successful training of a traditional RNN via SGD rapidly

reaching 0 for sequences of only length 10 or 20.

For a deeper treatment of the dynamical systems view of recurrent networks,

see Doya (1993), Bengio et al. (1994) and Siegelmann and Sontag (1995), with a

406

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

review in Pascanu et al. (2013a). The remaining sections of this chapter discuss

various approaches that have been proposed to reduce the diﬃculty of learning

long-term dependencies (in some cases allowing an RNN to learn dependencies

across hundreds of steps), but the problem of learning long-term dependencies

remains one of the main challenges in deep learning.

10.8 Echo State Networks

The recurrent weights mapping from

(t−1)

(t)

and the input weights mapping

from x

(t)

to h

(t)

are some of the most diﬃcult parameters to learn in a recurrent

network. One proposed (Jaeger, 2003; Maass et al., 2002; Jaeger and Haas, 2004;

Jaeger, 2007b) approach to avoiding this diﬃculty is to set the recurrent weights

such that the recurrent hidden units do a good job of capturing the history of

past inputs, and

only learn the output weights

. This is the idea that was

independently proposed for echo state networks or ESNs (Jaeger and Haas, 2004;

Jaeger, 2007b) and liquid state machines (Maass et al., 2002). The latter is similar,

except that it uses spiking neurons (with binary outputs) instead of the continuous-

valued hidden units used for ESNs. Both ESNs and liquid state machines are

termed reservoir computing (Lukoševičius and Jaeger, 2009) to denote the fact

that the hidden units form of reservoir of temporal features which may capture

diﬀerent aspects of the history of inputs.

One way to think about these reservoir computing recurrent networks is that

they are similar to kernel machines: they map an arbitrary length sequence (the

history of inputs up to time

) into a ﬁxed-length vector (the recurrent state

(t)

on which a linear predictor (typically a linear regression) can be applied to solve

the problem of interest. The training criterion may then be easily designed to be

convex as a function of the output weights. For example, if the output consists

of linear regression from the hidden units to the output targets, and the training

criterion is mean squared error, then it is convex and may be solved reliably with

simple learning algorithms (Jaeger, 2003).

The important question is therefore: how do we set the input and recurrent

weights so that a rich set of histories can be represented in the recurrent neural

network state? The answer proposed in the reservoir computing literature is to

view the recurrent net as a dynamical system, and set the input and recurrent

weights such that the dynamical system is near the edge of stability.

The original idea was to make the eigenvalues of the Jacobian of the state-to-

state transition function be close to 1. As explained in Sec. 8.2.5, an important

characteristic of a recurrent network is the eigenvalue spectrum of the Jacobians

407

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(t)

∂s

(t)

∂s

(t−1)

. Of particular importance is the spectral radius of

(t)

, deﬁned to be

the maximum of the absolute values of its eigenvalues.

To understand the eﬀect of the spectral radius, consider the simple case of

back-propagation with a Jacobian matrix

that does not change with

. This

case happens, for example, when the network is purely linear. Suppose that

has

an eigenvector

with corresponding eigenvalue

. Consider what happens as we

propagate a gradient vector backwards through time. If we begin with a gradient

vector

, then after one step of back-propagation, we will have

, and after

steps we will have

. Now consider what happens if we instead back-propagate

a perturbed version of

. If we begin with

δv

, then after one step, we will

have

(

δv

). After

steps, we will have

(

δv

). From this we can see

that back-propagation starting from

and back-propagation starting from

δv

diverge by

δJ

after

steps of back-propagation. If

is chosen to be a unit

eigenvector of

with eigenvalue

, then multiplication by the Jacobian simply

scales the diﬀerence at each step. The two executions of back-propagation are

separated by a distance of

δ|λ|

When

corresponds to the largest value of

|λ|

this perturbation achieves the widest possible separation of an initial perturbation

of size δ.

When

|λ| >

1, the deviation size

δ|λ|

grows exponentially large. When

|λ| <

the deviation size becomes exponentially small.

Of course, this example assumed that the Jacobian was the same at every

time step, corresponding to a recurrent network with no nonlinearity. When a

nonlinearity is present, the derivative of the nonlinearity will approach zero on

many time steps, and help to prevent the explosion resulting from a large spectral

radius. Indeed, the most recent work on echo state networks advocates using a

spectral radius much larger than unity (Yildiz et al., 2012; Jaeger, 2012).

Everything we have said about back-propagation via repeated matrix multipli-

cation applies equally to forward propagation in a network with no nonlinearity,

where the state h

(t+1)

= h

(t)

W .

When a linear map



always shrinks

as measured by the

norm, then

we say that the map is contractive. When the spectral radius is less than one, the

mapping from

(t)

(t+1)

is contractive, so a small change becomes smaller after

each time step. This necessarily makes the network forget information about the

past when we use a ﬁnite level of precision (such as 32 bit integers) to store the

state vector.

The Jacobian matrix tells us how a small change of

(t)

propagates one step

forward, or equivalently, how the gradient on

(t+1)

propagates one step backward,

during back-propagation. Note that neither

nor

need to be symmetric (al-

408

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

though they are square and real), so they can have complex-valued eigenvalues and

eigenvectors, with imaginary components corresponding to potentially oscillatory

behavior (if the same Jacobian was applied iteratively). Even though

(t)

or a

small variation of

(t)

of interest in back-propagation are real-valued, they can

be expressed in such a complex-valued basis. What matters is what happens to

the magnitude (complex absolute value) of these possibly complex-valued basis

coeﬃcients, when we multiply the matrix by the vector. An eigenvalue with

magnitude greater than one corresponds to magniﬁcation (exponential growth, if

applied iteratively) or shrinking (exponential decay, if applied iteratively).

With a nonlinear map, the Jacobian is free to change at each step. The

dynamics therefore become more complicated. However, it remains true that a

small initial variation can turn into a large variation after several steps. One

diﬀerence between the purely linear case and the nonlinear case is that the use of

a squashing nonlinearity such as

tanh

can cause the recurrent dynamics to become

bounded. Note that it is possible for back-propagation to retain unbounded

dynamics even when forward propagation has bounded dynamics, for example,

when a sequence of

tanh

units are all in the middle of their linear regime and are

connected by weight matrices with spectral radius greater than 1. However, it is

rare for all of the tanh units to simultaneously lie at their linear activation point.

The strategy of echo state networks is simply to ﬁx the weights to have some

spectral radius such as 3, where information is carried forward through time but

does not explode due to the stabilizing eﬀect of saturating nonlinearities like

tanh

More recently, it has been shown that the techniques used to set the weights

in ESNs could be used to

initialize

the weights in a fully trainable recurrent net-

work (with the hidden-to-hidden recurrent weights trained using back-propagation

through time), helping to learn long-term dependencies (Sutskever, 2012; Sutskever

et al., 2013). In this setting, an initial spectral radius of 1.2 performs well, combined

with the sparse initialization scheme described in Sec. 8.4.

10.9 Leaky Units and Other Strategies for Multiple

Time Scales

One way to deal with long-term dependencies is to design a model that operates

at multiple time scales, so that some parts of the model operate at ﬁne-grained

time scales and can handle small details, while other parts operate at coarse time

scales and transfer information from the distant past to the present more eﬃciently.

Various strategies for building both ﬁne and coarse time scales are possible. These

409

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

include the addition of skip connections across time, “leaky units” that integrate

signals with diﬀerent time constants, and the removal of some of the connections

used to model ﬁne-grained time scales.

10.9.1 Adding Skip Connections through Time

One way to obtain coarse time scales is to add direct connections from variables in

the distant past to variables in the present. The idea of using such skip connections

dates back to Lin et al. (1996) and follows from the idea of incorporating delays in

feedforward neural networks (Lang and Hinton, 1988). In an ordinary recurrent

network, a recurrent connection goes from a unit at time

to a unit at time

+ 1.

It is possible to construct recurrent networks with longer delays (Bengio, 1991).

As we have seen in Sec. 8.2.5, gradients may vanish or explode exponentially

with respect to the number of time steps

. Lin et al. (1996) introduced

recurrent connections with a time-delay of

to mitigate this problem. Gradients

now diminish exponentially as a function of

rather than

. Since there are both

delayed and single step connections, gradients may still explode exponentially in

This allows the learning algorithm to capture longer dependencies although not all

long-term dependencies may be represented well in this way.

10.9.2 Leaky Units and a Spectrum of Diﬀerent Time Scales

Another way to obtain paths on which the product of derivatives is close to one is to

have units with

linear

self-connections and a weight near one on these connections.

When we accumulate a running average

(t)

of some value

(t)

by applying the

update

(t)

← αµ

(t−1)

+ (1

−α

)

(t)

the

parameter is an example of a linear self-

connection from

(t−1)

(t)

. When

is near one, the running average remembers

information about the past for a long time, and when

is near zero, information

about the past is rapidly discarded. Hidden units with linear self-connections can

behave similarly to such running averages. Such hidden units are called leaky units.

Skip connections through

time steps are a way of ensuring that a unit can

always learn to be inﬂuenced by a value from

time steps earlier. The use of a

linear self-connection with a weight near one is a diﬀerent way of ensuring that the

unit can access values from the past. The linear self-connection approach allows

this eﬀect to be adapted more smoothly and ﬂexibly by adjusting the real-valued

α rather than by adjusting the integer-valued skip length.

These ideas were proposed by Mozer (1992) and by El Hihi and Bengio (1996).

Leaky units were also found to be useful in the context of echo state networks

410

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

(Jaeger et al., 2007).

There are two basic strategies for setting the time constants used by leaky

units. One strategy is to manually ﬁx them to values that remain constant, for

example by sampling their values from some distribution once at initialization time.

Another strategy is to make the time constants free parameters and learn them.

Having such leaky units at diﬀerent time scales appears to help with long-term

dependencies (Mozer, 1992; Pascanu et al., 2013a).

10.9.3 Removing Connections

Another approach to handle long-term dependencies is the idea of organizing

the state of the RNN at multiple time-scales (El Hihi and Bengio, 1996), with

information ﬂowing more easily through long distances at the slower time scales.

This idea diﬀers from the skip connections through time discussed earlier

because it involves actively

removing

length-one connections and replacing them

with longer connections. Units modiﬁed in such a way are forced to operate on a

long time scale. Skip connections through time

add

edges. Units receiving such

new connections may learn to operate on a long time scale but may also choose to

focus on their other short-term connections.

There are diﬀerent ways in which a group of recurrent units can be forced to

operate at diﬀerent time scales. One option is to make the recurrent units leaky,

but to have diﬀerent groups of units associated with diﬀerent ﬁxed time scales.

This was the proposal in Mozer (1992) and has been successfully used in Pascanu

et al. (2013a). Another option is to have explicit and discrete updates taking place

at diﬀerent times, with a diﬀerent frequency for diﬀerent groups of units. This is

the approach of El Hihi and Bengio (1996) and Koutnik et al. (2014). It worked

well on a number of benchmark datasets.

10.10 The Long Short-Term Memory and Other Gated

RNNs

As of this writing, the most eﬀective sequence models used in practical applications

are called gated RNNs. These include the long short-term memory and networks

based on the gated recurrent unit.

Like leaky units, gated RNNs are based on the idea of creating paths through

time that have derivatives that neither vanish nor explode. Leaky units did

this with connection weights that were either manually chosen constants or were

411

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

parameters. Gated RNNs generalize this to connection weights that may change

at each time step.

input input gate forget gate output gate

output

state

self-loop

+ ×

Figure 10.16: Block diagram of the LSTM recurrent network “cell.” Cells are connected

recurrently to each other, replacing the usual hidden units of ordinary recurrent networks.

An input feature is computed with a regular artiﬁcial neuron unit. Its value can be

accumulated into the state if the sigmoidal input gate allows it. The state unit has a

linear self-loop whose weight is controlled by the forget gate. The output of the cell can

be shut oﬀ by the output gate. All the gating units have a sigmoid nonlinearity, while the

input unit can have any squashing nonlinearity. The state unit can also be used as an

extra input to the gating units. The black square indicates a delay of 1 time unit.

Leaky units allow the network to

accumulate

information (such as evidence

for a particular feature or category) over a long duration. However, once that

information has been used, it might be useful for the neural network to

forget

the

old state. For example, if a sequence is made of sub-sequences and we want a leaky

unit to accumulate evidence inside each sub-subsequence, we need a mechanism to

forget the old state by setting it to zero. Instead of manually deciding when to

412

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

clear the state, we want the neural network to learn to decide when to do it. This

is what gated RNNs do.

10.10.1 LSTM

The clever idea of introducing self-loops to produce paths where the gradient can

ﬂow for long durations is a core contribution of the initial long short-term memory

(LSTM) model (Hochreiter and Schmidhuber, 1997). A crucial addition has been

to make the weight on this self-loop conditioned on the context, rather than ﬁxed

(Gers et al., 2000). By making the weight of this self-loop gated (controlled by

another hidden unit), the time scale of integration can be changed dynamically. In

this case, we mean that even for an LSTM with ﬁxed parameters, the time scale of

integration can change based on the input sequence, because the time constants are

output by the model itself. The LSTM has been found extremely successful in many

applications, such as unconstrained handwriting recognition (Graves et al., 2009),

speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting

generation (Graves, 2013), machine translation (Sutskever et al., 2014a), image

captioning (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015b) and

parsing (Vinyals et al., 2014a).

The LSTM block diagram is illustrated in Fig. 10.16. The corresponding

forward propagation equations are given below, in the case of a shallow recurrent

network architecture. Deeper architectures have also been successfully used (Graves

et al., 2013; Pascanu et al., 2014a). Instead of a unit that simply applies an element-

wise nonlinearity to the aﬃne transformation of inputs and recurrent units, LSTM

recurrent networks have “LSTM cells” that have an internal recurrence (a self-loop),

in addition to the outer recurrence of the RNN. Each cell has the same inputs

and outputs as an ordinary recurrent network, but has more parameters and a

system of gating units that controls the ﬂow of information. The most important

component is the state unit

(t)

that has a linear self-loop similar to the leaky

units described in the previous section. However, here, the self-loop weight (or the

associated time constant) is controlled by a forget gate unit

(t)

(for time step

and cell i), that sets this weight to a value between 0 and 1 via a sigmoid unit:

(t)

= σ







i,j

(t)



i,j

(t−1)





, (10.27)

where

(t)

is the current input vector and

(t)

is the current hidden layer vector,

containing the outputs of all the LSTM cells, and

are respectively

413

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

biases, input weights and recurrent weights for the forget gates. The LSTM cell

internal state is thus updated as follows, but with a conditional self-loop weight

(t)

= f

(t)

(t−1)

+ g

(t)







i,j

(t)



i,j

(t−1)





, (10.28)

where

and

respectively denote the biases, input weights and recurrent

weights into the LSTM cell. The external input gate unit

(t)

is computed similarly

to the forget gate (with a sigmoid unit to obtain a gating value between 0 and 1),

but with its own parameters:

(t)

= σ







i,j

(t)



i,j

(t−1)





. (10.29)

The output

(t)

of the LSTM cell can also be shut oﬀ, via the output gate

(t)

which also uses a sigmoid unit for gating:

(t)

= tanh



(t)



(t)

= σ







i,j

(t)



i,j

(t−1)





(10.30)

which has parameters

for its biases, input weights and recurrent

weights, respectively. Among the variants, one can choose to use the cell state

(t)

as an extra input (with its weight) into the three gates of the

-th unit, as shown

in Fig. 10.16. This would require three additional parameters.

LSTM networks have been shown to learn long-term dependencies more easily

than the simple recurrent architectures, ﬁrst on artiﬁcial data sets designed for

testing the ability to learn long-term dependencies (Bengio et al., 1994; Hochreiter

and Schmidhuber, 1997; Hochreiter et al., 2000), then on challenging sequence

processing tasks where state-of-the-art performance was obtained (Graves, 2012;

Graves et al., 2013; Sutskever et al., 2014a). Variants and alternatives to the LSTM

have been studied and used and are discussed next.

10.10.2 Other Gated RNNs

Which pieces of the LSTM architecture are actually necessary? What other

successful architectures could be designed that allow the network to dynamically

control the time scale and forgetting behavior of diﬀerent units?

414

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Some answers to these questions are given with the recent work on gated

RNNs, whose units are also known as gated recurrent units or GRUs (Cho et al.,

2014b; Chung et al., 2014, 2015a; Jozefowicz et al., 2015b; Chrupala et al., 2015).

The main diﬀerence with the LSTM is that a single gating unit simultaneously

controls the forgetting factor and the decision to update the state unit. The update

equations are the following:

(t)

= u

(t−1)

+ (1 − u

(t−1)

)σ







i,j

(t−1)



i,j

(t−1)





(10.31)

where

stands for “update” gate and

for “reset” gate. Their value is deﬁned as

usual:

(t)

= σ







i,j

(t)



i,j

(t)





(10.32)

and

(t)

= σ







i,j

(t)



i,j

(t)





. (10.33)

The reset and updates gates can individually “ignore” parts of the state vector.

The update gates act like conditional leaky integrators that can linearly gate any

dimension, thus choosing to copy it (at one extreme of the sigmoid) or completely

ignore it (at the other extreme) by replacing it by the new “target state” value

(towards which the leaky integrator wants to converge). The reset gates control

which parts of the state get used to compute the next target state, introducing an

additional nonlinear eﬀect in the relationship between past state and future state.

Many more variants around this theme can be designed. For example the

reset gate (or forget gate) output could be shared across multiple hidden units.

Alternately, the product of a global gate (covering a whole group of units, such as

an entire layer) and a local gate (per unit) could be used to combine global control

and local control. However, several investigations over architectural variations

of the LSTM and GRU found no variant that would clearly beat both of these

across a wide range of tasks (Greﬀ et al., 2015; Jozefowicz et al., 2015a). Greﬀ

et al. (2015) found that a crucial ingredient is the forget gate, while Jozefowicz

et al. (2015a) found that adding a bias of 1 to the LSTM forget gate, a practice

advocated by Gers et al. (2000), makes the LSTM as strong as the best of the

explored architectural variants.

415

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.11 Optimization for Long-Term Dependencies

Sec. 8.2.5 and Sec. 10.7 have described the vanishing and exploding gradient

problems that occur when optimizing RNNs over many time steps.

An interesting idea proposed by Martens and Sutskever (2011) is that second

derivatives may vanish at the same time that ﬁrst derivatives vanish. Second-order

optimization algorithms may roughly be understood as dividing the ﬁrst derivative

by the second derivative (in higher dimension, multiplying the gradient by the

inverse Hessian). If the second derivative shrinks at a similar rate to the ﬁrst

derivative, then the ratio of ﬁrst and second derivatives may remain relatively

constant. Unfortunately, second-order methods have many drawbacks, including

high computational cost, the need for a large minibatch, and a tendency to be

attracted to saddle points. Martens and Sutskever (2011) found promising results

using second-order methods. Later, Sutskever et al. (2013) found that simpler

methods such as Nesterov momentum with careful initialization could achieve

similar results. See Sutskever (2012) for more detail. Both of these approaches

have largely been replaced by simply using SGD (even without momentum) applied

to LSTMs. This is part of a continuing theme in machine learning that it is often

much easier to design a model that is easy to optimize than it is to design a more

powerful optimization algorithm.

10.11.1 Clipping Gradients

As discussed in Sec. 8.2.4, strongly nonlinear functions such as those computed by

a recurrent net over many time steps tend to have derivatives that can be either

very large or very small in magnitude. This is illustrated in Fig. 8.3 and Fig. 10.17,

in which we see that the objective function (as a function of the parameters) has a

“landscape” in which one ﬁnds “cliﬀs”: wide and rather ﬂat regions separated by

tiny regions where the objective function changes quickly, forming a kind of cliﬀ.

The diﬃculty that arises is that when the parameter gradient is very large, a

gradient descent parameter update could throw the parameters very far, into a

region where the objective function is larger, undoing much of the work that had

been done to reach the current solution. The gradient tells us the direction that

corresponds to the steepest descent within an inﬁnitesimal region surrounding the

current parameters. Outside of this inﬁnitesimal region, the cost function may

begin to curve back upwards. The update must be chosen to be small enough to

avoid traversing too much upward curvature. We typically use learning rates that

decay slowly enough that consecutive steps have approximately the same learning

416

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

rate. A step size that is appropriate for a relatively linear part of the landscape is

often inappropriate and causes uphill motion if we enter a more curved part of the

landscape on the next step.

J(w;b)

Without clipping

J(w;b)

With clipping

Figure 10.17: Example of the eﬀect of gradient clipping in a recurrent network with

two parameters

and

. Gradient clipping can make gradient descent perform more

reasonably in the vicinity of extremely steep cliﬀs. These steep cliﬀs commonly occur

in recurrent networks near where a recurrent network behaves approximately linearly.

The cliﬀ is exponentially steep in the number of time steps because the weight matrix

is multiplied by itself once for each time step. (Left) Gradient descent without gradient

clipping overshoots the bottom of this small ravine, then receives a very large gradient

from the cliﬀ face. The large gradient catastrophically propels the parameters outside the

axes of the plot. (Right) Gradient descent with gradient clipping has a more moderate

reaction to the cliﬀ. While it does ascend the cliﬀ face, the step size is restricted so that

it cannot be propelled away from steep region near the solution. Figure adapted with

permission from Pascanu et al. (2013a).

A simple type of solution has been in use by practitioners for many years:

clipping the gradient. There are diﬀerent instances of this idea (Mikolov, 2012;

Pascanu et al., 2013a). One option is to clip the parameter gradient from a

minibatch element-wise (Mikolov, 2012) just before the parameter update. Another

is to clip the norm

||g||

of the gradient

(Pascanu et al., 2013a) just before the

parameter update:

if ||g|| > v

g ←

||g||

(10.34)

where

is the norm threshold and

is used to update parameters. Because the

gradient of all the parameters (including diﬀerent groups of parameters, such as

417

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

weights and biases) is renormalized jointly with a single scaling factor, the latter

method has the advantage that it guarantees that each step is still in the gradient

direction, but experiments suggest that both forms work similarly. Although

the parameter update has the same direction as the true gradient, with gradient

norm clipping, the parameter update vector norm is now bounded. This bounded

gradient avoids performing a detrimental step when the gradient explodes. In

fact, even simply taking a random step when the gradient magnitude is above

a threshold tends to work almost as well. If the explosion is so severe that the

gradient is numerically

Inf

Nan

(considered inﬁnite or not-a-number), then

a random step of size

can be taken and will typically move away from the

numerically unstable conﬁguration. Clipping the gradient norm per-minibatch will

not change the direction of the gradient for an individual minibatch. However,

taking the average of the norm-clipped gradient from many minibatches is not

equivalent to clipping the norm of the true gradient (the gradient formed from

using all examples). Examples that have large gradient norm, as well as examples

that appear in the same minibatch as such examples, will have their contribution

to the ﬁnal direction diminished. This stands in contrast to traditional minibatch

gradient descent, where the true gradient direction is equal to the average over all

minibatch gradients. Put another way, traditional stochastic gradient descent uses

an unbiased estimate of the gradient, while gradient descent with norm clipping

introduces a heuristic bias that we know empirically to be useful. With element-

wise clipping, the direction of the update is not aligned with the true gradient

or the minibatch gradient, but it is still a descent direction. It has also been

proposed (Graves, 2013) to clip the back-propagated gradient (with respect to

hidden units) but no comparison has been published between these variants; we

conjecture that all these methods behave similarly.

10.11.2 Regularizing to Encourage Information Flow

Gradient clipping helps to deal with exploding gradients, but it does not help with

vanishing gradients. To address vanishing gradients and better capture long-term

dependencies, we discussed the idea of creating paths in the computational graph of

the unfolded recurrent architecture along which the product of gradients associated

with arcs is near 1. One approach to achieve this is with LSTMs and other self-loops

and gating mechanisms, described above in Sec. 10.10. Another idea is to regularize

or constrain the parameters so as to encourage “information ﬂow.” In particular,

we would like the gradient vector

∇

(t)

being back-propagated to maintain its

magnitude, even if the loss function only penalizes the output at the end of the

418

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

sequence. Formally, we want

(∇

(t)

∂h

(t)

∂h

(t−1)

(10.35)

to be as large as

∇

(t)

L. (10.36)

With this objective, Pascanu et al. (2013a) propose the following regularizer:

Ω =









|(∇

(t)

∂h

(t)

∂h

(t−1)



||∇

(t)

L||

− 1





. (10.37)

Computing the gradient of this regularizer may appear diﬃcult, but Pascanu

et al. (2013a) propose an approximation in which we consider the back-propagated

vectors

∇

(t)

as if they were constants (for the purpose of this regularizer, so

that there is no need to back-propagate through them). The experiments with

this regularizer suggest that, if combined with the norm clipping heuristic (which

handles gradient explosion), the regularizer can considerably increase the span of

the dependencies that an RNN can learn. Because it keeps the RNN dynamics

on the edge of explosive gradients, the gradient clipping is particularly important.

Without gradient clipping, gradient explosion prevents learning from succeeding.

A key weakness of this approach is that it is not as eﬀective as the LSTM for

tasks where data is abundant, such as language modeling.

10.12 Explicit Memory

Intelligence requires knowledge and acquiring knowledge can be done via learning,

which has motivated the development of large-scale deep architectures. However,

there are diﬀerent kinds of knowledge. Some knowledge can be implicit, sub-

conscious, and diﬃcult to verbalize—such as how to walk, or how a dog looks

diﬀerent from a cat. Other knowledge can be explicit, declarative, and relatively

straightforward to put into words—every day commonsense knowledge, like “a cat

is a kind of animal,” or very speciﬁc facts that you need to know to accomplish

your current goals, like “the meeting with the sales team is at 3:00 PM in room

141.”

Neural networks excel at storing implicit knowledge. However, they struggle

to memorize facts. Stochastic gradient descent requires many presentations of

the same input before it can be stored in a neural network parameters, and even

419

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Task network,

controlling the memory

Memory cells

Writing

mechanism

Reading

mechanism

Figure 10.18: A schematic of an example of a network with an explicit memory, capturing

some of the key design elements of the neural Turing machine. In this diagram we

distinguish the “representation” part of the model (the “task network,” here a recurrent

net in the bottom) from the “memory” part of the model (the set of cells), which can

store facts. The task network learns to “control” the memory, deciding where to read from

and where to write to within the memory (through the reading and writing mechanisms,

indicated by bold arrows pointing at the reading and writing addresses).

420

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

then, that input will not be stored especially precisely. Graves et al. (2014b)

hypothesized that this is because neural networks lack the equivalent of the working

memory system that allows human beings to explicitly hold and manipulate pieces

of information that are relevant to achieving some goal. Such explicit memory

components would allow our systems not only to rapidly and “intentionally” store

and retrieve speciﬁc facts but also to sequentially reason with them. The need

for neural networks that can process information in a sequence of steps, changing

the way the input is fed into the network at each step, has long been recognized

as important for the ability to reason rather than to make automatic, intuitive

responses to the input (Hinton, 1990).

To resolve this diﬃculty, Weston et al. (2014) introduced memory networks that

include a set of memory cells that can be accessed via an addressing mechanism.

Memory networks originally required a supervision signal instructing them how

to use their memory cells. Graves et al. (2014b) introduced the neural Turing

machine, which is able to learn to read from and write arbitrary content to

memory cells without explicit supervision about which actions to undertake,

and allowed end-to-end training without this supervision signal, thanks the use

of a content-based soft attention mechanism (see Bahdanau et al. (2015) and

Sec. 12.4.5.1). This soft addressing mechanism has become standard with other

related architectures emulating algorithmic mechanisms in a way that still allows

gradient-based optimization (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015;

Kumar et al., 2015; Vinyals et al., 2015a; Grefenstette et al., 2015).

Each memory cell can be thought of as an extension of the memory cells in

LSTMs and GRUs. The diﬀerence is that the network outputs an internal state

that chooses which cell to read from or write to, just as memory accesses in a

digital computer read from or write to a speciﬁc address.

It is diﬃcult to optimize functions that produce exact, integer addresses. To

alleviate this problem, NTMs actually read to or write from many memory cells

simultaneously. To read, they take a weighted average of many cells. To write, they

modify multiple cells by diﬀerent amounts. The coeﬃcients for these operations

are chosen to be focused on a small number of cells, for example, by producing

them via a softmax function. Using these weights with non-zero derivatives allows

the functions controlling access to the memory to be optimized using gradient

descent. The gradient on these coeﬃcients indicates whether each of them should

be increased or decreased, but the gradient will typically be large only for those

memory addresses receiving a large coeﬃcient.

These memory cells are typically augmented to contain a vector, rather than

the single scalar stored by an LSTM or GRU memory cell. There are two reasons

421

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

to increase the size of the memory cell. One reason is that we have increased the

cost of accessing a memory cell. We pay the computational cost of producing a

coeﬃcient for many cells, but we expect these coeﬃcients to cluster around a small

number of cells. By reading a vector value, rather than a scalar value, we can

oﬀset some of this cost. Another reason to use vector-valued memory cells is that

they allow for content-based addressing, where the weight used to read to or write

from a cell is a function of that cell. Vector-valued cells allow us to retrieve a

complete vector-valued memory if we are able to produce a pattern that matches

some but not all of its elements. This is analogous to the way that people can

recall the lyrics of a song based on a few words. We can think of a content-based

read instruction as saying, “Retrieve the lyrics of the song that has the chorus ‘We

all live in a yellow submarine.’ ” Content-based addressing is more useful when we

make the objects to be retrieved large—if every letter of the song was stored in a

separate memory cell, we would not be able to ﬁnd them this way. By comparison,

location-based addressing is not allowed to refer to the content of the memory. We

can think of a location-based read instruction as saying “Retrieve the lyrics of

the song in slot 347.” Location-based addressing can often be a perfectly sensible

mechanism even when the memory cells are small.

If the content of a memory cell is copied (not forgotten) at most time steps, then

the information it contains can be propagated forward in time and the gradients

propagated backward in time without either vanishing or exploding.

The explicit memory approach is illustrated in Fig. 10.18, where we see that

a “task neural network” is coupled with a memory. Although that task neural

network could be feedforward or recurrent, the overall system is a recurrent network.

The task network can choose to read from or write to speciﬁc memory addresses.

Explicit memory seems to allow models to learn tasks that ordinary RNNs or LSTM

RNNs cannot learn. One reason for this advantage may be because information and

gradients can be propagated (forward in time or backwards in time, respectively)

for very long durations.

As an alternative to back-propagation through weighted averages of memory

cells, we can interpret the memory addressing coeﬃcients as probabilities and

stochastically read just one cell (Zaremba and Sutskever, 2015). Optimizing models

that make discrete decisions requires specialized optimization algorithms, described

in Sec. 20.9.1. So far, training these stochastic architectures that make discrete

decisions remains harder than training deterministic algorithms that make soft

decisions.

Whether it is soft (allowing back-propagation) or stochastic and hard, the mech-

anism for choosing an address is in its form identical to the attention mechanism

422

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

which had been previously introduced in the context of machine translation (Bah-

danau et al., 2015) and discussed in Sec. 12.4.5.1. The idea of attention mechanisms

for neural networks was introduced even earlier, in the context of handwriting

generation (Graves, 2013), with an attention mechanism that was constrained to

move only forward in time through the sequence. In the case of machine translation

and memory networks, at each step, the focus of attention can move to a completely

diﬀerent place, compared to the previous step.

Recurrent neural networks provide a way to extend deep learning to sequential

data. They are the last major tool in our deep learning toolbox. Our discussion now

moves to how to choose and use these tools and how to apply them to real-world

tasks.

423