Chapter 10

Sequence Modeling: Recurrent

and Recursive Nets

One of the early ideas found in machine learning and statistical models of the 80’s

is that of sharing parameters

across diﬀerent parts of a model, allowing to extend

the model, to examples of diﬀerent forms, e.g. of diﬀerent lengths. This can be

found in hidden Markov models (HMMs) (Rabiner and Juang, 1986), where the

same state-to-state transition matrix P (s

t−1

) is used for diﬀerent time steps,

allowing one to model variable length sequences. It can also be found in the idea

of recurrent neural network (Rumelhart et al., 1986c) or RNN

: the same weights

are used for diﬀerent instances of the artiﬁcial neurons at diﬀerent time steps,

allowing us to apply the network to input sequences of diﬀerent lengths. This

idea is made more explicit in the early work on time-delay neural networks (Lang

and Hinton, 1988; Waibel et al., 1989), where a fully connected network is replaced

by one with local connections that are shared across diﬀerent temporal instances

of the hidden units. Such networks are the ancestors of convolutional neural

networks, covered in more detail in Section 9. Recurrent nets are covered below in

Section 10.2. As shown in Section 10.1 below, the ﬂow graph (notion introduced

in Section 6.3) associated with a recurrent network is structured like a chain.

Recurrent neural networks have been generalized into recursive neural networks,

in which the structure can be more general, i.e., and it is typically viewed as a

tree. Recursive neural networks are discussed in more detail in Section 10.4.

see Section 7.9 for an introduction to the concept of parameter sharing

Unfortunately, the RNN acronym is sometimes also used for denoting Recursive Neural

Networks. However, since the RNN acronym has been around for much longer, we suggest

keeping this acronym for Recurrent Neural Network.

250

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.1 Unfolding Flow Graphs and Sharing Parameters

A ﬂow graph is a way to formalize the structure of a set of computations, such

as those involved in mapping inputs and parameters to outputs and loss. Please

refer to Section 6.3 for a general introduction. In this section we explain the idea

of unfolding a recursive or recurrent computation into a ﬂow graph that has a

repetitive structure.

For example, consider the classical form of a dynamical system:

= F

t−1

) (10.1)

where s

is called the state of the system. The unfolded ﬂow graph of such a

system looks like in Figure 10.1.

t−1

t+1

Figure 10.1: Classical dynamical system equation 10.1 illustrated as an unfolded ﬂow-

graph. Each node represents the state at some time t and function F

maps the state at

t to the state at t + 1. The same parameters (the same function F

) is used for all time

steps.

As another example, let us consider a dynamical system driven by an external

signal x

= F

t−1

, x

) (10.2)

illustrated in Figure 10.2, where we see that the state now contains informa-

tion about the whole past sequence, i.e., the above equation implicitly deﬁnes a

function

= G

, x

t−1

, x

t−2

, . . . , x

, x

) (10.3)

which maps the whole past sequence (x

, x

t−1

, x

t−2

, . . . , x

, x

) to the current

state. Equation 10.2 is actually part of the deﬁnition of a recurrent net. We

can think of s

is a kind of summary of the past sequence of inputs up to t.

Note that this summary is in general necessarily lossy, since it maps an arbitrary

length sequence (x

, x

t−1

, x

t−2

, . . . , x

, x

) to a ﬁxed length vector s

. Depending

on the training criterion, this summary might selectively keep some aspects of

the past sequence with more precision than other aspects. For example, if the

RNN is used in statistical language modeling, typically to predict the next word

given previous words, it may not be necessary to distinctly keep track of all

the bits of information, only those required to predict the rest of the sentence.

The most demanding situation is when we ask s

to be rich enough to allow

251

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

one to approximately recover the input sequence, as in auto-encoder frameworks

(Chapter 16).

If we had to deﬁne a diﬀerent function G

for each possible sequence length

(imagine a separate neural network, each with a diﬀerent input size), each with

its own parameters, we would not get any generalization to sequences of a size

not seen in the training set. Furthermore, one would need to see a lot more

training examples, because a separate model would have to be trained for each

sequence length. By instead deﬁning the state through a recurrent formulation

as in Eq. 10.2, the same parameters are used for any sequence length, allowing

much better generalization properties.

Equation 10.2 can be drawn in two diﬀerent ways. One is in a way that is

inspired by how a physical implementation (such as a real neural network) might

look like, i.e., like a circuit which operates in real time, as in the left of Figure 10.2.

The other is as a ﬂow graph, in which the computations occurring at diﬀerent

time steps in the circuit are unfolded as diﬀerent nodes of the ﬂow graph, as in

the right of Figure 10.2. What we call unfolding is the operation that maps a

circuit as in the left side of the ﬁgure to a ﬂow graph with repeated pieces as

in the right side. Note how the unfolded graph now has a size that depends on

the sequence length. The black square indicates a delay of 1 time step on the

recurrent connection, from the state at time t to the state at time t + 1.

t−1

t+1

t−1

t+1

unfold

Figure 10.2: Left: input processing part of a recurrent neural network, seen as a circuit.

The black square indicates a delay of 1 time step. Right: the same seen as an unfolded

ﬂow graph, where each node is now associated with one particular time instance.

The other important observation to make from Figure 10.2 is that the same

parameters (θ) are shared over diﬀerent parts of the graph, corresponding here to

diﬀerent time steps.

10.2 Recurrent Neural Networks

Armed with the ideas introduced in the previous section, we can design a wide

variety of recurrent circuits, which are compact and simple to understand visu-

ally, and automatically obtain their equivalent unfolded graph, which are useful

252

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

computationally and also help focus on the idea of information ﬂow forward (com-

puting outputs and losses) and backward in time (computing gradients).

Some of the early circuit designs for recurrent neural networks are illustrated

in Figures 10.3, 10.5 and 10.4. Figure 10.3 shows the vanilla recurrent network

whose equations are laid down below in Eq. 10.4, and which has been shown to be a

universal approximation machine for sequences, i.e., able to implement a Turing

machine (Siegelmann and Sontag, 1991; Siegelmann, 1995; Hyotyniemi, 1996).

On the other hand, the network with output recurrence shown in Figure 10.4

has a more limited memory or state, which is its output, i.e., the prediction

of the previous target, which potentially limits its expressive power, but also

makes it easier to train. Indeed, the “intermediate state” of the corresponding

unfolded deep network is not hidden anymore: targets are available to guide this

intermediate representation, which should make it easier to train. Teacher forcing

is the training process in which the fed back inputs are not the predicted outputs

but the targets themselves, as shown in ﬁgure TODO.The disadvantage of strict

teacher forcing is that if the network is going to be later used in an open-loop

mode, i.e., with the network outputs (or samples from the output distribution)

fed back as input, then the kind of inputs that the network will have seen during

training could be quite diﬀerent from the kind of inputs that it will see at test

time, potentially yielding very poor generalizations. One way to mitigate this

problem is to train with both teacher-forced inputs and with free-running inputs,

e.g., predicting the correct target a number of steps in the future through the

unfolded recurrent output-to-input paths. In this way, the network can learn to

take into account input conditions (such as those it generates itself in the free-

running mode) not seen during training and how to map the state back towards

one that will make the network generate proper outputs after a few steps.

The vanilla recurrent network of Figure 10.3 corresponds to the following for-

ward propagation equations, if we assume that hyperbolic tangent non-linearities

are used in the hidden units and softmax is used in output (for classiﬁcation

problems):

= b + W s

t−1

+ Ux

= tanh(a

)

= c + V s

= softmax(o

) (10.4)

where the parameters are the bias vectors b and c along with the weight matrices

U, V and W , respectively for input-to-hidden, hidden-to-output, and hidden-

to-hidden connections. This is an example of a recurrent network that maps an

input sequence to an output sequence of the same length. The total loss for a

given input/target sequence pair (x, y) would then be just the sum of the losses

253

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

over all the time steps, e.g.,

L(x, y) =

−log p

(10.5)

where y

is the category that should be associated with time step t in the output

sequence.

10.2.1 Computing the Gradient in a Recurrent Neural Network

Using the generalized back-propagation algorithm (for arbitrary ﬂow-graphs) in-

troduced in Section 6.3, one can thus obtain the so-called Back-Propagation

Through Time (BPTT) algorithm. Once we know how to compute gradients,

we can in principle apply any of the gradient-based techniques to train an RNN,

introduced in Section 4.3 and developed in greater depth in Chapter 8.

Let us thus work out how to compute gradients by BPTT for the RNN equa-

tions above (Eqs. 10.4 and 10.5). The nodes of our ﬂow graph will be the sequence

of x

’s, s

’s, o

’s, L

’s, and the parameters U , V , W , b, c. For each node a we

need to compute the gradient ∇

L recursively, based on the gradient computed

at nodes that follow it in the graph. We start the recursion with the nodes

immediately preceding the ﬁnal loss

∂L

= 1

and the gradient on the outputs i at time step t, for all i, t:

∂L

∂o

∂L

∂o

= p

t,i

− 1

i,y

and work our way backwards, starting from the end of the sequence, say T , at

which point s

only has o

as descendent:

∇

L = ∇

∂o

∂s

= ∇

LV .

Note how the above equation is vector-wise and corresponds to

∂L

∂s

T j

∂L

∂o

T i

scalar-wise. We can then iterate backwards in time to back-propagate gradients

through time, from t = T − 1 down to t = 1, noting that s

(for t < T ) has as

descendent both o

and s

t+1

∇

L = ∇

t+1

∂s

t+1

∂s

+ ∇

∂o

∂s

= ∇

t+1

Ldiag(1 −s

t+1

)W + ∇

where diag(1 − s

t+1

) indicates the diagonal matrix containing the elements 1 −

t+1,i

, i.e., the derivative of the hyperbolic tangent associated with the hidden

unit i at time t + 1.

254

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Once the gradients on the internal nodes of the ﬂow graph are obtained, we

can obtain the gradients on the parameter nodes, which have descendents at all

the time steps:

∇

L =

∇

∂o

∂c

∇

L =

∇

∂s

∂b

∇

Ldiag(1 −s

)

∇

L =

∇

∂o

∂V

∇

L =

∇

∂s

∂W

∇

Ldiag(1 −s

t−1

Note in the above (and elsewhere) that whereas ∇

L refers to the full inﬂuence

of s

through all paths from s

to L,

∂s

∂W

∂s

∂b

refers to the immediate eﬀect

of the denominator on the numerator, i.e., when we consider the denominator

as a parent of the numerator and only that direct dependency is accounted for.

Otherwise, we would get “double counting” of derivatives.

10.2.2 Recurrent Networks as Generative Directed Acyclic Mod-

els

Up to here, we have not clearly stated what the losses L

associated with the

outputs o

of a recurrent net should be. It is because there are many possible

ways in which RNNs can be used. Here we consider the case where the RNN

models a probability distribution over a sequence of observations.

As introduced in Section 14.2.1, directed graphical models represent a proba-

bility distribution over a sequence of T random variables x = (x

, x

, . . . , x

) by

parametrizing their joint distribution as

P (x) = P (x

, . . . , x

) =

t=1

P (x

t−1

, x

t−2

, . . . , x

) (10.6)

where the right-hand side is empty for t = 1, of course. Hence the negative

log-likelihood of x according to such a model is

L =

where

= −log P (x

t−1

, x

t−2

, . . . , x

255

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

In general directed graphical models, x

can be predicted using only a subset of its

predecessors (x

, . . . , x

t−1

). However, for RNNs, the graphical model is generally

fully connected, not removing any dependency a priori. This can be achieved

eﬃciently through the recurrent parametrization, such as in Eq. 10.2, since s

summarizes the whole previous sequence (Eq. 10.3).

Hence, instead of cutting statistical complexity by removing arcs in the di-

rected graphical model for (x

, . . . , x

), the core idea of recurrent networks is

that we introduce a state variable which decouples all the past and future ob-

servations, but we make that state variable a generally deterministic function of

the past, through the recurrence, Eq. 10.2. Consequently, the number of parame-

ters required to parametrize P (x

t−1

, x

t−2

, . . . , x

) does not grow exponentially

with t (as it would if we parametrized that probability by a straight probability

table, when the data is discrete) but remains constant with t. It only grows with

the dimension of the state s

. The price to be paid for that great advantage

is that optimizing the parameters may be more diﬃcult, as discussed below in

Section 10.6. The decomposition of the likelihood thus becomes:

P (x) =

t=1

P (x

t−1

, x

t−2

, . . . , x

))

where

= G

, x

t−1

, x

t−2

, . . . , x

, x

) = F

t−1

, x

Note that if the self-recurrence function F

is learned, it can discard some of the

information in some of the past values x

t−k

that are not needed for predicting

the future data. In fact, because the state generally has a ﬁxed dimension smaller

than the length of the sequences (times the dimension of the input), it has to

discard some information. However, we leave it to the learning procedure to

choose what information to keep and what information to throw away.

The above decomposition of the joint probability of a sequence of variables

into ordered conditionals precisely corresponds to what an RNN can compute:

the target at each time step t is the next element in the sequence, while the

input at each time step is the previous element in the sequence, and the output is

interpreted as parametrizing the probability distribution of the target given the

state. This is illustrated in Figure 10.6.

If the RNN is actually going to be used to generate sequences, one must also

incorporate in the output information allowing to stochastically decide when to

stop generating new output elements. This can be achieved in various ways. In the

case when the output is a symbol taken from a vocabulary, one can add a special

symbol corresponding to the end of a sequence. When that symbol is generated, a

complete sequence has been generated. The target for that special symbol occurs

256

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

exactly once per sequence, as the last target after the last output element x

. In

general, one may train a binomial output associated with that stopping variable,

for example using a sigmoid output non-linearity and the cross-entropy loss, i.e.,

again negative log-likelihood for the event “end of the sequence”. Another kind

of solution is to model the integer T itself, through any reasonable parametric

distribution, and use the number of time steps left (and possibly the number of

time steps since the beginning of the sequence) as extra inputs at each time step.

Thus we would have decomposed P (x

. . . , x

) into P (T ) and P(x

. . . , x

|T ).

In general, one must therefore keep in mind that in order to fully generate a

sequence we must not only generate the x

’s, but also the sequence length T,

either implicitly through a series of continue/stop decisions (or a special “end-of-

sequence” symbol), or explicitly through modeling the distribution of T itself as

an integer random variable.

If we take the RNN equations of the previous section (Eq. 10.4 and 10.5),

they could correspond to a generative RNN if we simply make the target y

equal

to the next input x

t+1

(and because the outputs are the result of a softmax, it

must be that the input sequence is a sequence of symbols, i.e., x

is a symbol or

bounded integer).

Other types of data can clearly be modeled in a similar way, following the

discussions about the encoding of outputs and the probabilistic interpretation of

losses as negative log-likelihoods, in Sections 5.7 and 6.2.2.

10.2.3 RNNs to Represent Conditional Probability Distributions

In general, as discussed in Section 6.2.2 (see especially the end of that section,

in Subsection 6.2.2 ), when we can represent a parametric probability distribu-

tion P(y|ω), we can make it conditional by making ω a function of the desired

conditioning variable:

P (y|ω = f(x)).

In the case of an RNN, this can be achieved in diﬀerent ways, and we review here

the most common and obvious choices.

If x is a ﬁxed-size vector, then we can simply make it an extra input of the

RNN that generates the y sequence. Some common ways of providing an extra

input to an RNN are:

1. as an extra input at each time step, or

2. as the initial state s

, or

3. both.

257

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

In general, one may need to add extra parameters (and parametrization) to map

x = x into the “extra bias” going either into only s

, into the other s

(t > 0),

or into both. The ﬁrst (and most commonly used) approach is illustrated in

Figure 10.7.

As an example, we could imagine that x is encoding the identity of a phoneme

and the identity of a speaker, and that y represents an acoustic sequence corre-

sponding to that phoneme, as pronounced by that speaker.

Consider the case where the input x is a sequence of the same length as the

output sequence y, and the y

’s are independent of each other when the past

input sequence is given, i.e., P (y

t−1

, . . . , y

, x) = P (y

, x

t−1

, . . . , x

). We

therefore have a causal relationship between the x

’s and the predictions of the

’s, in addition to the independence of the y

’s, given x. Under these (pretty

strong) assumptions, we can return to Fig. 10.3 and interpret the t-th output o

as parameters for a conditional distribution for y

, given x

, x

t−1

, . . . , x

If we want to remove the conditional independence assumption, we can do so

by making the past y

’s inputs into the state as well. That situation is illustrated

in Fig. 10.8.

10.3 Bidirectional RNNs

All of the recurrent networks we have considered up to now have a “causal”

structure, meaning that the state at time t only captures information from the

past, x

, . . . , x

. However, in many applications we want to output at time t a

prediction regarding an output which may depend on the whole input sequence.

For example, in speech recognition, the correct interpretation of the current sound

as a phoneme may depend on the next few phonemes because of co-articulation

and potentially may even depend on the next few words because of the linguistic

dependencies between nearby words: if there are two interpretations of the current

word that are both acoustically plausible, we may have to look far into the future

(and the past) to disambiguate them. This is also true of handwriting recognition

and many other sequence-to-sequence learning tasks.

Bidirectional recurrent neural networks (or bidirectional RNNs) were invented

to address that need (Schuster and Paliwal, 1997). They have been extremely

successful (Graves, 2012) in applications where that need arises, such as hand-

writing (Graves et al., 2008; Graves and Schmidhuber, 2009), speech recogni-

tion (Graves and Schmidhuber, 2005; Graves et al., 2013) and bioinformatics (Baldi

et al., 1999).

As the name suggests, the basic idea behind bidirectional RNNs is to combine

a forward-going RNN and a backward-going RNN. Figure 10.9 illustrates the

typical bidirectional RNN, with h

standing for the state of the forward-going

258

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

RNN and g

standing for the state of the backward-going RNN. This allows the

units o

to compute a representation that depends on both the past and the future

but is most sensitive to the input values around time t, without having to specify

a ﬁxed-size window around t (as one would have to do with a feedforward network,

a convolutional network, or a regular RNN with a ﬁxed-size look-ahead buﬀer).

This idea can be naturally extended to 2-dimensional input, such as images,

by having four RNNs, each one going in one of the four directions: up, down,

left, right. At each point (i, j) of a 2-D grid, an output o

i,j

could then compute a

representation that would capture mostly local information but could also depend

on long-range inputs, if the RNN are able to learn to carry that information.

259

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.10: A recursive network has a computational graph that generalizes that of

the recurrent network from a chain to a tree. In the ﬁgure, a variable-size sequence

, x

, . . . can be mapped to a ﬁxed-size representation (the output o), with a ﬁxed

number of parameters (e.g. the weight matrices U, V , W ). The ﬁgure illustrates a

supervised learning case in which some target y is provided which is associated with the

whole sequence.

10.4 Recursive Neural Networks

Recursive net represent yet another generalization of recurrent networks, with a

diﬀerent kind of computational graph, which this times looks like a tree. The

typical computational graph for a recursive network is illustrated in Figure 10.10.

Recursive neural networks were introduced by Pollack (1990) and their potential

use for learning to reason were nicely laid down by Bottou (2011). Recursive

260

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

networks have been successfully applied in processing data structures as input to

neural nets (Frasconi et al., 1997, 1998), in natural language processing (Socher

et al., 2011a,c, 2013) as well as in computer vision (Socher et al., 2011b).

One clear advantage of recursive net over recurrent nets is that for a sequence

of the same length N , depth can be drastically reduced from N to O(log N). An

open question is how to best structure the tree, though. One option is to have a

tree structure which does not depend on the data, e.g., a balanced binary tree.

Another is to use an external method, such as a natural language parser (Socher

et al., 2011a, 2013). Ideally, one would like the learner itself to discover and infer

the tree structure that is appropriate for any given input, as suggested in Bottou

(2011).

Many variants of the recursive net idea are possible. For example, in Frasconi

et al. (1997, 1998), the data is associated with a tree structure in the ﬁrst place,

and inputs and/or targets with each node of the tree. The computation performed

by each node does not have to be the traditional artiﬁcial neuron computation

(aﬃne transformation of all inputs followed by a monotone non-linearity). For

example, Socher et al. (2013) propose using tensor operations and bilinear forms,

which have previously been found useful to model relationships between con-

cepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are represented

by continuous vectors (embeddings).

10.5 Auto-Regressive Networks

One of the basic ideas behind recurrent networks is that of directed graphical

models with a twist: we decompose a probability distribution as a product of

conditionals without explicitly cutting any arc in the graphical model, but instead

reducing complexity by parametrizing the transition probability in a recursive

way that requires a ﬁxed (and not exponential) number of parameters, due to a

form of parameter sharing (see Section 7.9 for an introduction to the concept).

Instead of reducing P (x

t−1

, . . . , x

) to something like P(x

t−1

, . . . , x

t−k

) (as-

suming the k previous ones as the parents), we keep the full dependency but we

parametrize the conditional eﬃciently in a way that does not grow with t, exploit-

ing parameter sharing. When the above conditional probability distribution is in

some sense stationary, i.e., the relation between the past and the next observation

does not depend on t, only on the values of the past observations, then this form

of parameter sharing makes a lot of sense, and for recurrent nets it allows to use

the same model for sequences of diﬀerent lengths.

Auto-regressive networks are similar to recurrent networks in the sense that

we also decompose a joint probability over the observed variables as a product of

conditionals of the form P(x

t−1

, . . . , x

) but we drop the form of parameter

261

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

sharing that makes these conditionals all share the same parametrization. This

makes sense when the variables are not elements of a translation-invariant se-

quence, but instead form an arbitrary tuple without any particular ordering that

would correspond to a translation-invariant form of relationship between variables

at position k and variables at position k

. In some forms of auto-regressive net-

works, such as NADE (Larochelle and Murray, 2011), described in Section 10.5.3

below, we can re-introduce a form of parameter sharing that is diﬀerent from

the one found in recurrent networks, but that brings both a statistical advantage

(less parameters) and a computational advantage (less computation). Although

we drop the sharing over time, as we see below in Section 10.5.2, using a deep

learning concept of reuse of features, we can share features that have been com-

puted for predicting x

t−k

with the sub-network that predicts x

P(x

)

P(x

)

P(x

)

P(x

,#x

)

Figure 10.11: An auto-regressive network predicts the i-th variable from the i−1 previous

ones. Left: corresponding graphical model (which is the same as that of a recurrent

network). Right: corresponding computational graph, in the case of the logistic auto-

regressive network, where each prediction has the form of a logistic regression, i.e., with i

free parameters (for the i−1 weights associated with i−1 inputs, and an oﬀset parameter).

10.5.1 Logistic Auto-Regressive Networks

Let us ﬁrst consider the simplest auto-regressive network, without hidden units,

and hence no sharing at all. Each P (x

t−1

, . . . , x

) is parametrized as a linear

model, e.g., a logistic regression. This model was introduced by Frey (1998) and

has O(T

) parameters when there are T variables to be modeled. It is illustrated

in Figure 10.11, showing both the graphical model (left) and the computational

graph (right).

A clear disadvantage of the logistic auto-regressive network is that one cannot

easily increase its capacity in order to capture more complex data distributions. It

262

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

deﬁnes a parametric family of ﬁxed capacity, like the linear regression, the logistic

regression, or the Gaussian distribution. In fact, if the variables are continuous,

one gets a linear auto-regressive model, which is thus another way to formulate

a Gaussian distribution, i.e., only capturing pairwise interactions between the

observed variables.

P(x

)

P(x

)

P(x

)

P(x

,#x

)

Figure 10.12: A neural auto-regressive network predicts the i-th variable x

from the i−1

previous ones, but is parametrized so that features (groups of hidden units denoted h

)

that are functions of x

, . . . , x

can be reused in predicting all of the subsequent variables

i+1

, x

i+2

, . . ..

10.5.2 Neural Auto-Regressive Networks

Neural Auto-Regressive Networks have the same left-to-right graphical model as

logistic auto-regressive networks (Figure 10.11, left) but a diﬀerent parametriza-

tion that is at once more powerful (allowing to extend the capacity as needed and

approximate any joint distribution) and can improve generalization by introduc-

ing a parameter sharing and feature sharing principle common to deep learning in

general. The ﬁrst paper on neural auto-regressive networks by Bengio and Bengio

(2000b) (see also Bengio and Bengio (2000a) for the more extensive journal ver-

sion) were motivated by the objective to avoid the curse of dimensionality arising

out of traditional non-parametric graphical models, sharing the same structure

as Figure 10.11 (left). In the non-parametric discrete distribution models, each

conditional distribution is represented by a table of probabilities, with one entry

and one parameter for each possible conﬁguration of the variables involved. By

using a neural network instead, two advantages are obtained:

263

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

1. The parametrization of each P (x

t−1

, . . . , x

) by a neural network with

(t − 1) × K inputs and K outputs (if the variables are discrete and take

K values, encoded one-hot) allows to estimate the conditional probability

without requiring an exponential number of parameters (and examples),

yet still allowing to capture high-order dependencies between the random

variables.

2. Instead of having a diﬀerent neural network for the prediction of each x

, a

left-to-right connectivity illustrated in Figure 10.12 allows to merge all the

neural networks into one. Equivalently, it means that the hidden layer fea-

tures computed for predicting x

can be reused for predicting x

t+k

(k > 0).

The hidden units are thus organized in groups that have the particular-

ity that all the units in the t-th group only depend on the input values

, . . . , x

. In fact the parameters used to compute these hidden units are

jointly optimized to help the prediction of all the variables x

, x

t+1

, x

t+2

, . . ..

This is an instance of the reuse principle that makes multi-task learning and

transfer learning successful with neural networks and deep learning in gen-

eral (See Sections 7.12 and 17.2).

Each P (x

t−1

, . . . , x

) can represent a conditional distribution by having

outputs of the neural network predict parameters of the conditional distribution

of x

, as discussed in Section 6.2.2. Although the original neural auto-regressive

networks were initially evaluated in the context of purely discrete multivariate

data (e.g., with a sigmoid - Bernoulli case - or softmax output - multinoulli

case) it is straightforward to extend such models to continuous variables or joint

distributions involving both discrete and continuous variables, as for example with

RNADE introduced below (Uria et al., 2013).

264

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

P(x

)

P(x

)

P(x

)

P(x

,#x

)

Figure 10.13: NADE (Neural Auto-regressive Density Estimator) is a neural auto-

regressive network, i.e., the hidden units are organized in groups h

so that only the

inputs x

, . . . , x

participate in computing h

and predicting P (x

j−1

, . . . , x

), for

j > i. The particularity of NADE is the use of a particular weight sharing pattern:

the same W

jki

= W

is shared (same color and line pattern in the ﬁgure) for all the

weights outgoing from x

to the k-th unit of any group j ≥ i. The vector (W

, W

, . . .)

is denoted W

here.

10.5.3 NADE

A very successful recent form of neural auto-regressive network was proposed

by Larochelle and Murray (2011). The architecture is basically the same as for

the original neural auto-regressive network of Bengio and Bengio (2000b) except

for the introduction of a weight-sharing scheme: as illustrated in Figure 10.13.

The parameteres of the hidden units of diﬀerent groups j are shared, i.e., the

weights W

jki

from the i-th input x

to the k-th element of the j-th group of

hidden unit h

(j ≥ i) are shared:

jki

= W

with (W

, W

, . . .) denoted W

in Figure 10.13.

This particular sharing pattern is motivated in Larochelle and Murray (2011)

by the computations performed in the mean-ﬁeld inference

of an RBM, when only

Here, unlike in Section 14.5, the inference is over some of the input variables that are missing,

given the observed ones.

265

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

the ﬁrst i inputs are given and one tries to infer the subsequent ones. This mean-

ﬁeld inference corresponds to running a recurrent network with shared weights

and the ﬁrst step of that inference is the same as in NADE. The only diﬀerence

is that with the proposed NADE, the output weights are not forced to be simply

transpose values of the input weights (they are not tied). One could imagine

actually extending this procedure to not just one time step of the mean-ﬁeld

recurrent inference but to k steps, as in Raiko et al. (2014).

Although the neural auto-regressive networks and NADE were originally pro-

posed to deal with discrete distributions, they can in principle be generalized

to continuous ones by replacing the conditional discrete probability distributions

(for P (x

j−1

, . . . , x

)) by continuous ones, following general practice to pre-

dict continuous random variables with neural networks (see Section 6.2.2) using

the log-likelihood framework. A fairly generic way of parametrizing a continuous

density is as a Gaussian mixture, and this avenue has been successfully evaluated

for the neural auto-regressive architecture with RNADE (Uria et al., 2013). One

interesting point to note is that stochastic gradient descent can be numerically

ill-behaved due to the interactions between the conditional means and the condi-

tional variances (especially when the variances become small). Uria et al. (2013)

have used a heuristic to rescale the gradient on the component means by the

associated standard deviation which seems to have helped optimizing RNADE.

Another very interesting extension of the neural auto-regressive architectures

gets rid of the need to choose an arbitrary order for the observed variables (Mur-

ray and Larochelle, 2014). The idea is to train the network to be able to cope with

any order by randomly sampling orders and providing the information to hidden

units specifying which of the inputs are observed (on the - right - conditioning

side of the bar) and which are to be predicted and are thus considered missing

(on the - left - side of the conditioning bar). This is nice because it allows to

use a trained auto-regressive network to perform any inference (i.e. predict or

sample from the probability distribution over any subset of variables given any

subset) extremely eﬃciently. Finally, since many orders are possible, the joint

probability of some set of variables can be computed in many ways (N! for N

variables), and this can be exploited to obtain a more robust probability estima-

tion and better log-likelihood, by simply averaging the log-probabilities predicted

by diﬀerent randomly chosen orders. In the same paper, the authors propose to

consider deep versions of the architecture, but unfortunately that immediately

makes computation as expensive as in the original neural auto-regressive neural

network (Bengio and Bengio, 2000b). The ﬁrst layer and the output layer can

still be computed in O(NH) multiply-add operations, as in the regular NADE,

where H is the number of hidden units (the size of the groups h

, in Figures 10.13

and 10.12), whereas it is O(N

H) in Bengio and Bengio (2000b). However, for

266

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

the other hidden layers, the computation is O(N

) if every “previous” group

at layer k participates in predicting the “next” group at layer k + 1, assuming N

groups of H hidden units at each layer. Making the i-th group at layer k + 1 only

depend on the i-th group, as in Murray and Larochelle (2014) at layer k reduces

it to O(NH

), which is still H times worse than the regular NADE.

10.6 Facing the Challenge of Long-Term Dependen-

cies

The mathematical challenge of learning long-term dependencies in recurrent net-

works was introduced in Section 8.2.5. The basic problem is that gradients prop-

agated over many stages tend to either vanish (most of the time) or explode

(rarely, but with much damage to the optimization). Even if we assume that the

parameters are such that the recurrent network is stable (can store memories,

with gradients not exploding), the diﬃculty with long-term dependencies arises

from the exponentially smaller weights given to long-term interactions (involving

the multiplication of many Jacobians) compared short-term ones. See Hochreiter

(1991); Doya (1993); Bengio et al. (1994); Pascanu et al. (2013a) for a deeper

treatment.

In this section we discuss various approaches that have been proposed to

alleviate this diﬃculty with learning long-term dependencies.

10.6.1 Echo State Networks: Choosing Weights to Make Dynam-

ics Barely Contractive

The recurrent weights and input weights of a recurrent network are those that

deﬁne the state representation captured by the model, i.e., how the state s

(hid-

den units vector) at time t (Eq. 10.2) captures and summarizes information from

the previous inputs x

, x

, . . . , x

. Since learning the recurrent and input weights

is diﬃcult, one option that has been proposed (Jaeger and Haas, 2004; Jaeger,

2007b; Maass et al., 2002) is to set those weights such that the recurrent hid-

den units do a good job of capturing the history of past inputs, and only learn

the ouput weights. This is the idea that was independently proposed for Echo

State Networks or ESNs (Jaeger and Haas, 2004; Jaeger, 2007b) and Liquid State

Machines (Maass et al., 2002). The latter is similar, except that it uses spik-

ing neurons (with binary outputs) instead of the continuous valued hidden units

used for ESNs. Both ESNs and liquid state machines are termed reservoir com-

puting (Lukoˇseviˇcius and Jaeger, 2009) to denote the fact that the hidden units

form of reservoir of temporal features which may capture diﬀerent aspects of the

history of inputs.

267

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

One way to think about these reservoir computing recurrent networks is that

they are similar to kernel machines: they allow to map an arbitrary length se-

quence (the history of inputs up to time t) into a ﬁxed-length vector (the recurrent

state s

), on which a linear predictor (typically a linear regression) can be applied

to solve the problem of interest. The training criterion is therefore convex in the

parameters (which are just the output weights) and can actually be solved online

in the linear regression case (using online updates for linear regression (Jaeger,

2003)).

The important question is therefore: how do we set the input and recurrent

weights so that a rich set of histories can be represented in the recurrent neural

network state? The answer proposed in the reservoir computing literature is to

make the dynamical system associated with the recurrent net nearly on the edge

of chaos. (TODO make that more precise). As alluded to in 8.2.5, an important

characteristic of a recurrent network is the eigenvalue spectrum of the Jacobians

∂s

t−1

, and in particular the spectral radius of J

, i.e., its largest eigenvalue.

If it is greater than 1, the dynamics can diverge, meaning that small diﬀerences in

the state value at t can yield a very large diﬀerence at T later in the sequence. To

see this, consider the simpler case where the Jacobian matrix J does not change

with t. If a change ∆s in the state at t is aligned with an eigenvector v of J

with eigenvalue λ > 1, then the small change ∆s becomes λ∆s after one time

step, and λ

∆s after N time steps. If λ > 1 this makes the change exponentially

large. With a non-linear map, the Jacobian keeps changing and the dynamics is

more complicated but what remains is that a small initial variation can turn into

a large variation after a number of steps. In general, the recurrent dynamics are

bounded (for example, if the hidden units use a bounded non-linearity such as

the hyperbolic tangent) so that the change after N steps must also be bounded.

Instead when the largest eigenvalue λ < 1, we say that the map from t to t + 1

is contractive: a small change gets contracted, becoming smaller after each time

step. This necessarily makes the network forgetting information about the long-

term past, but it also makes its dynamics stable and easier to use.

What has been proposed to set the weights of reservoir computing machines

is to make the Jacobians slightly contractive. This is achieved by making the

spectral radius of the weight matrix large but slightly less than 1. However,

in practice, good results are often found with a spectral radius of the recurrent

weight matrix being slightly larger than 1, e.g., 1.2 (Sutskever, 2012; Sutskever

et al., 2013). Keep in mind that with hyperboling tangent units, the maximum

derivative is 1, so that in order to guarantee a Jacobian spectral radius less than

1, the weight matrix should have spectral radius less than 1 as well. However,

most derivatives of the hyperbolic tangent will be less than 1, which may explain

Sutskever’s empirical observation.

268

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

More recently, it has been shown that the techniques used to set the weights in

ESNs could be used to initialize the weights in a fully trainable recurrent network

(e.g., trained using back-propagation through time), helping to learn long-term

dependencies (Sutskever, 2012; Sutskever et al., 2013). In addition to setting the

spectral radius to 1.2, Sutskever sets the recurrent weight matrix to be initially

sparse, with only 15 non-zero input weights per hidden unit.

Note that when some eigenvalues of the Jacobian are exactly 1, information

can be kept in a stable way, and back-propagated gradients neither vanish nor

explode. The next two sections show methods to make some paths in the unfolded

graph correspond to “multiplying by 1” at each step, i.e., keeping information for

a very long time.

10.6.2 Combining Short and Long Paths in the Unfolded Flow

Graph

An old idea that has been proposed to deal with the diﬃculty of learning long-

term dependencies is to use recurrent connections with long delays. Whereas

the ordinary recurrent connections are associated with a delay of 1 (relating the

state at t with the state at t + 1), it is possible to construct recurrent networks

with longer delays (Bengio, 1991), following the idea of incorporating delays in

feedforward neural networks (Lang and Hinton, 1988) in order to capture temporal

structure (with Time-Delay Neural Networks, which are the 1-D predecessors of

Convolutional Neural Networks, discussed in Chapter 9).

269

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

unfold

t−1

t+1

t−2

Figure 10.14: A recurrent neural networks with delays, in which some of the connections

reach back in time to more than one time step. Left: connectivity of the recurrent net,

with square boxes indicating the number of time delays associated with a connection.

Right: unfolded recurrent network. In the ﬁgure there are regular recurrent connections

with a delay of 1 time step (W

) and recurrent connections with a delay of 3 time steps

). The advantage of these longer-delay connections is that they allow to connect past

states to future states through shorter paths (3 times shorter, here), going through these

longer delay connections (in red).

As we have seen in Section 8.2.5, gradients will vanish exponentially with

respect to the number of time steps. If we have recurrent connections with a time-

delay of D, then instead of the vanishing or explosion going as O(λ

) over T

time steps (where λ is the largest eigenvalue of the Jacobians

∂s

t−1

), the unfolded

recurrent network now has paths through which gradients grow as O(λ

T/D

) be-

cause the number of eﬀective steps is T /D. This allows the learning algorithm

to capture longer dependencies although not all long-term dependencies may be

well represented in this way. This idea was ﬁrst explored in Lin et al. (1996) and

is illustrated in Figure 10.14.

10.6.3 Leaky Units and a Hierarchy of Diﬀerent Time Scales

A related idea in order to obtain paths on which the product of derivatives is close

to 1 is to have units with linear self-connections and a weight near 1 on these

connections. The strength of that linear self-connection corresponds to a time

scale and thus we can have diﬀerent hidden units which operate at diﬀerent time

scales (Mozer, 1992). Depending on how close to 1 these self-connection weights

are, information can travel forward and gradients backward with a diﬀerent rate

of “forgetting” or contraction to 0, i.e., a diﬀerent time scale. One can view this

idea as a smooth variant of the idea of having diﬀerent delays in the connections

presented in the previous section. Such ideas were proposed in Mozer (1992);

ElHihi and Bengio (1996), before a closely related idea discussed in the next

270

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

section of gating these self-connections in order to let the network control at what

rate each unit should be contracting.

The idea of leaky units with a self-connection actually arises naturally when

considering a continuous-time recurrent neural network such as

˙s

= −s

+ σ(b

+ Ws + Ux)

where σ is the neural non-linearity (e.g., sigmoid or tanh), τ

> 0 is a time constant

and ˙s

indicates the temporal derivative of unit s

. A related equation is

˙s

= −s

+ (b

+ Wσ(s) + Ux)

where the state vector s (with elements s

) now represents the pre-activation of

the hidden units.

When discretizing in time such equations (which changes the meaning of τ),

one gets

t+1,i

− s

t,i

= −

t,i

σ(b

+ Ws

+ Ux

)

t+1,i

= (1 −

t,i

σ(b

+ Ws

+ Ux

). (10.7)

We see that the new value of the state is a convex linear combination of the old

value and of the value computed based on current inputs and recurrent weights,

if 1 ≤ τ

< ∞. When τ

= 1, there is no linear self-recurrence, only the non-

linear update which we ﬁnd in ordinary recurrent networks. When τ

> 1, this

linear recurrence allows gradients to propagate more easily. When τ

is large, the

state changes very slowly, integrating the past values associated with the input

sequence.

By associating diﬀerent time scales τ

with diﬀerent units, one obtains diﬀerent

paths corresponding to diﬀerent forgetting rates. Those time constants can be

ﬁxed manually (e.g., by sampling from a distribution of time scales) or can be

learned as free parameters, and having such leaky units at diﬀerent time scales

appears to help with long-term dependencies (Mozer, 1992; Pascanu et al., 2013a).

Note that the time constant τ thus corresponds to a self-weight of (1 −

), but

without any non-linearity involved in the self-recurrence.

Consider the extreme case where τ → ∞: because the leaky unit just averages

contributions from the past, the contribution of each time step is equivalent and

there is no associated vanishing or exploding eﬀect. An alternative is to avoid the

weight of

in front of σ(b

+ W s

+ U x

), thus making the state sum all the past

values when τ

is large, instead of averaging them.

271

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.6.4 The Long-Short-Term-Memory Architecture and Other

Gated RNNs

Whereas in the previous section we consider creating paths where derivatives

neither vanish nor explode too quickly by introducing self-loops, leaky units have

self-weights that are not context-dependent: they are ﬁxed, or learned, but remain

constant during a whole test sequence.

+ X

input

input gate

forget gate

output gate

output

state

self-loop

Figure 10.15: Block diagram of the LSTM recurrent network “cell”. Cells are connected

recurrently to each other, replacing the usual hidden units of ordinary recurrent networks.

An input feature is computed with a regular artiﬁcial neuron unit, and its value can be

accumulated into the state if the sigmoidal input gate allows it. The state unit has a

linear self-loop whose weight is controlled by the forget gate. The output of the cell can

be shut oﬀ by the output gate. All the gating units have a sigmoid non-linearity, while

the input unit can have any squashing non-linearity. The state unit can also be used as

extra input to the gating units.

It is worthwhile considering the role played by leaky units: they allow to

accumulate information (e.g. evidence for a particular feature or category) over a

long duration. However, once that information gets used, it might be useful for

the neural network to forget the old state. For example, if a sequence is made of

subsequences and we want a leaky unit to accumulate evidence inside each sub-

subsequence, we need a mechanism to forget the old state by setting it to zero

and starting to count from fresh. Instead of manually deciding when to clear the

state, we want the neural network to learn to decide when to do it.

272

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

LSTM

This clever idea of conditioning the forgetting on the context is a core contribution

of the Long-Short-Term-Memory (LSTM) algorithm (Hochreiter and Schmidhu-

ber, 1997), described below. Several variants of the LSTM are found in the

literature (Hochreiter and Schmidhuber, 1997; Graves, 2012; Graves et al., 2013;

Graves, 2013; Sutskever et al., 2014a) but the principle is always to have a linear

self-loop through which gradients can ﬂow for long durations. By making the

weight of this self-loop gated (controlled by another hidden unit), the time scale

of integration can be changed dynamically (even for ﬁxed parameters, but based

on the input sequence). The LSTM has been found extremely successful in a num-

ber of applications, such as unconstrained handwriting recognition (Graves et al.,

2009), speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), hand-

writing generation (Graves, 2013), machine translation (Sutskever et al., 2014a),

image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al., 2014b;

Xu et al., 2015b) and parsing (Vinyals et al., 2014a).

The LSTM block diagram is illustrated in Figure 10.15. The corresponding

forward (state update equations) are follows, in the case of the vanilla recurrent

network architecture. Deeper architectures have been successfully used in Graves

et al. (2013); Pascanu et al. (2014a). Instead of a unit that simply applies a

squashing function on the aﬃne transformation of inputs and recurrent units,

LSTM networks have “LSTM cells”. Each cell has the same inputs and outputs

as a vanilla recurrent network, but has more parameters and a system of gating

units that controls the ﬂow of information. The most important component is the

state unit s

that has a linear self-loop similar to the leaky units described in the

previous section, but where the self-loop weight (or the associated time constant)

is controlled by a forget gate unit h

t,i

(for time step t and cell i), that sets this

weight to a value between 0 and 1 via a sigmoid unit:

t,i

= sigmoid(b

t,j

). (10.8)

where x

is the current input vector and h

is the current hidden layer vector,

containing the outputs of all the LSTM cells, and b

, W

are respectively

biases, input weights and recurrent weights for the forget gates. The LSTM cell

internal state is thus updated as follows, following the pattern of Eq. 10.7, but

with a conditional self-loop weight h

t,i

t+1,i

= h

t,i

+ h

t,i

σ(b

t,j

). (10.9)

b, U and W respectively the biases, input weights and recurrent weights into

the LSTM cell, and the external input gate unit h

t,i

is computed similarly to the

273

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

forget gate (i.e., with a sigmoid unit to obtain a gating value between 0 and 1),

but with its own parameters:

t,i

= sigmoid(b

t,j

). (10.10)

The output h

t+1,i

of the LSTM cell can also be shut oﬀ, via the output gate h

t,i

which also uses a sigmoid unit for gating:

t+1,i

= tanh(s

t+1,i

t,i

= sigmoid(b

t,j

) (10.11)

which has parameters b

, U

, W

for its biases, input weights and recurrent

weights, respectively. Among the variants, one can choose to use the cell state s

t,i

as an extra input (with its weight) into the three gates of the i-th unit, as shown

in Figure 10.15. This would require three additional parameters.

LSTM networks have been shown to learn long-term dependencies more easily

than vanilla recurrent architectures, ﬁrst on artiﬁcial data sets designed for test-

ing the ability to learn long-term dependencies Bengio et al. (1994); Hochreiter

and Schmidhuber (1997); Hochreiter et al. (2000), then on challenging sequence

processing tasks where state-of-the-art performance was obtained (Graves, 2012;

Graves et al., 2013; Sutskever et al., 2014a).

Other Gated RNNs

Which pieces of the LSTM architecture are actually necessary? What other suc-

cessful architectures could be designed that allow the network to dynamically

control the time scale and forgetting behavior of diﬀerent units?

Some answers to these questions are given with the recent work on gated

RNNs, which was successfully used in reaching the MOSES state-of-the-art for

English-to-French machine translation (Cho et al., 2014). The main diﬀerence

with the LSTM is that a single gating unit simultaneously controls the forgetting

factor and the decision to update the state unit, which is natural if we consider the

continuous-time interpretation of the self-weight of the state, as in the equation

for leaky units, Eq. 10.7. The update equations are the following:

t+1,i

= h

t,i

+ (1 − h

t,i

)σ(b

t,j

). (10.12)

where g

stands for “update” gate and g

for “reset” gate. Their value is deﬁned

as usual:

t,i

= sigmoid(b

t,j

) (10.13)

274

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

and

t,i

= sigmoid(b

t,j

). (10.14)

Many more variants around this theme can be designed. For example the reset

gate (or forget gate) output could be shared across a number of hidden units. Or

the product of a global gate (covering a whole group of units, e.g., a layer) and a

local gate (per unit) could be used to combine global control and local control.

In addition, as discussed in the next section, diﬀerent ways of making such

RNNs “deeper” are possible.

10.6.5 Deep RNNs

Figure 10.16 illustrate a number of architectural variations for RNNs which intro-

duce additional depth beyond what the vanilla architecture provides. In general,

the input-to-hidden, hidden-to-hidden, and hidden-to-output mappings can be

made more powerful than what the usual single-layer aﬃne + nonlinearity trans-

formation provides. This idea was introduced in Pascanu et al. (2014a). The more

established way of adding depth is simply to stack diﬀerent “layers” of RNNs on

top of each other (Schmidhuber, 1992; El Hihi and Bengio, 1996; Jaeger, 2007a;

Graves, 2013) (bottom right in the Figure), possibly with a diﬀerent time scale at

diﬀerent levels in the hierarchy (El Hihi and Bengio, 1996; Koutnik et al., 2014).

275

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t-1

Figure 10.16: Examples of deeper RNN architecture variants. Top left: vanilla RNN

with input sequence and an output sequence. Top middle: one step of the corresponding

unfolded ﬂow graph. Top right: with a deep hidden-to-hidden transformation, instead

of the usual single-layer transformation (e.g., replacing the aﬃne + softmax layer by a

full MLP with a single hidden layer, taking both the previous state and current input

as inputs). Bottom left: same but with skip connections that allow gradients to ﬂow

more easily backwards in time in spite of the extra non-linearity due to the intermediate

hidden layer. Bottom middle: similarly, depth can also be added in the hidden-to-output

transformation. Bottom right: a hierarchy of RNNs, which can be stacked on top of each

other in various ways.

10.6.6 Better Optimization

A central optimization diﬃculty with RNNs regards the learning of long-term

dependencies (Hochreiter, 1991; Bengio et al., 1993, 1994). This diﬃculty has

been explained in detail in Section 8.2.5. The jist of the problem is that the

composition of the non-linear recurrence with itself over many many time steps

yields a highly non-linear function whose derivatives (e.g. of the state at T w.r.t.

the state at t < T , i.e. the Jacobian matrix

∂s

) tend to either vanish or explode

as T −t increases, because it is equal to the product of the state transition Jacobian

276

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

matrices

∂s

t+1

∂s

)

If it explodes, the parameter gradient ∇

L also explodes, yielding gradient-

based parameter updates that are poor. A simple solution to this problem is

discussed in the next section (Sec. 10.6.7). However, as discussed in Bengio et al.

(1994), if the state transition Jacobian matrix has eigenvalues that are larger than

1 in magnitude, then it can yield to “unstable” dynamics, in the sense that a bit

of information cannot be stored reliably for a long time in the presence of input

“noise”. Indeed, the state transition Jacobian matrix eigenvalues indicate how a

small change in some direction (the corresponding eigenvector) will be expanded

(if the eigenvalue is greater than 1) or contracted (if it is less than 1).

If the eigenvalues of the state transition Jacobian are less than 1, then deriva-

tives associated with long-term eﬀects tend to vanish as T −t increases, making

them exponentially smaller in magnitude (as components of the total gradient)

then derivatives associated with short-term eﬀects. This therefore makes it diﬃ-

cult (but not impossible) to learn long-term dependencies.

An interesting idea proposed in Martens and Sutskever (2011) is that at the

same time as ﬁrst derivatives are becoming smaller in directions associated with

long-term eﬀects, so may the higher derivatives. In particular, if we use a second-

order optimization method (such as the Hessian-free method of Martens and

Sutskever (2011)), then we could diﬀerentially treat diﬀerent directions: divide

the small ﬁrst derivative (gradient) by a small second derivative, while not scaling

up in the directions where the second derivative is large (and hopefully, the ﬁrst

derivative as well). Whereas in the scalar case, if we add a large number and a

small number, the small number is “lost”, in the vector case, if we add a large

vector with a small vector, it is still possible to recover the information about

the direction of the small vector if we have access to information (such as in the

second derivative matrix) that tells us how to rescale appropriately each direction.

One disadvantage of many second-order methods, including the Hessian-free

method, is that they tend to be geared towards “batch” training rather than

“stochastic” updates (where only one or a small minibatch of examples are exam-

ined before a parameter update is made). Although the experiments on recurrent

networks applied to problems with long-term dependencies showed very encour-

aging results in Martens and Sutskever (2011), it was later shown that similar

results could be obtained by much simpler methods (Sutskever, 2012; Sutskever

et al., 2013) involving better initialization, a cheap surrogate to second-order op-

timization (a variant on the momentum technique, Section 8.4), and the clipping

trick described below.

277

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.6.7 Clipping Gradients

As discussed in Section 8.2.4, strongly non-linear functions such as those com-

puted by a recurrent net over many time steps tend to have derivatives that can

be either very large or very small in magnitude. This is illustrated in Figures 8.2

and 8.3, in which we see that the objective function (as a function of the param-

eters) has a “landscape” in which one ﬁnds “cliﬀs”: wide and rather ﬂat regions

separated by tiny regions where the objective function changes quickly, forming

a kind of cliﬀ.

The diﬃculty that arises is that when the parameter gradient is very large,

a gradient descent parameter update could throw the parameters very far, into

a region where the objective function is larger, wasting a lot of the work that

had been down to reach the current solution. This is because gradient descent is

hinged on the assumption of small enough steps, and this assumption can easily

be violated when the same learning rate is used for both the ﬂatter parts and the

steeper parts of the landscape.

Figure 10.17: Example of the eﬀect of gradient clipping in a recurrent network with two

paramters w and b. Vertical axis is the objective function to minimize. Note the cliﬀ

where the gradient explodes and from where gradient descent can get pushed very far.

Clipping the gradient when its norm is above a threshold (Pascanu et al., 2013a) prevents

this catastrophic outcome and helps training recurrent nets with long-term dependencies

to be captured.

A simple type of solution has been in used by practitioners for many years:

clipping the gradient. There are diﬀerent instances of this idea (Mikolov, 2012;

Pascanu et al., 2013a). One option is to clip the gradient element-wise (Mikolov,

2012). Another is to clip the norm of the gradient (Pascanu et al., 2013a).The

latter has the advantage that it guarantees that each step is still in the gradient

278

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

direction, but experiments suggest that both forms work similarly. Even simply

taking a random step when the gradient magnitude is above a threshold tends to

work almost as well.

10.6.8 Regularizing to Encourage Information Flow

Whereas clipping helps dealing with exploding gradients, it does not help with

vanishing gradients. To address vanishing gradients and better capture long-term

dependencies, we discussed the idea of creating paths in the computational graph

of the unfolded recurrent architecture along which the product of gradients asso-

ciated with arcs is near 1. One approach to achieve this is with LSTM and other

self-loops and gating mechanisms, described above in Section 10.6.4. Another idea

is to regularize or constrain the parameters so as to encourage “information ﬂow”.

In particular, we would like the gradient vector ∇

L being back-propagated to

maintain its magnitude (even if there is only a loss at the end of the sequence),

i.e., we want

∇

∂s

t−1

to be as large as

∇

With this objective, Pascanu et al. (2013a) propose the following regularizer:

Ω =







|∇

∂s

t−1



||∇

L||

− 1





. (10.15)

It looks like computing the gradient of this regularizer is diﬃcult, but Pascanu

et al. (2013a) propose an approximation in which we consider the back-propagated

vectors ∇

L as if they were constants (for the purpose of this regularizer, i.e., no

need to back-prop through them). The experiments with this regularizer suggest

that, if combined with the norm clipping heuristic (which handles gradient explo-

sion), it can considerably increase the span of the dependencies that an RNN can

learn. Because it keeps the RNN dynamics on the edge of explosive gradients, the

gradient clipping is particularly important: otherwise gradient explosion prevents

learning to succeed.

10.6.9 Organizing the State at Multiple Time Scales

Another promising approach to handle long-term dependencies is the old idea

of organizing the state of the RNN at multiple time-scales (El Hihi and Bengio,

1996), with information ﬂowing more easily through long distances at the slower

time scales. This is illustrated in Figure 10.18.

279

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.18: Example of a multi-scale recurrent net architecture (unfolded in time), with

higher levels operating at a slower time scale. Information can ﬂow unhampered (either

forward or backward in time) over longer durations at the higher levels, thus creating

long-paths (such as the red dotted path) through which long-term dependencies between

elements of the input/output sequence can be captured.

There are diﬀerent ways in which a group of recurrent units can be forced to

operate at diﬀerent time scales. One option is to make the recurrent units leaky

(as in Eq. 10.7), but to have diﬀerent groups of units associated with diﬀerent

ﬁxed time scales. This was the proposal in Mozer (1992) and has been successfully

used in Pascanu et al. (2013a). Another option is to have explicit and discrete

updates taking place at diﬀerent times, with a diﬀerent frequency for diﬀerent

groups of units, as in Figure 10.18. This is the approach of El Hihi and Bengio

(1996); Koutnik et al. (2014) and it also worked well on a number of benchmark

datasets.

10.7 Handling Temporal Dependencies with N-Grams,

HMMs, CRFs and Other Graphical Models

This section regards probabilistic approches to sequential data modeling which

have often been viewed as in competition with RNNs, although RNNs can be seen

as a particular form of dynamical Bayes nets (as directed graphical models with

deterministic latent variables).

10.7.1 N-grams

N-grams are non-parametric estimators of conditional probabilities based on count-

ing relative frequencies of occurence, and they have been the core building block of

statistical language modeling for many decades (Jelinek and Mercer, 1980; Katz,

1987; Chen and Goodman, 1999). Like RNNs, they are based on the product rule

(or chain rule) decomposition of the joint probability into conditionals, Eq. 10.6,

which relies on estimates P (x

t−1

, . . . , x

) to compute P (x

, . . . , x

). What is

particular of n-grams is that

1. they estimate these conditional probabilities based only on the last n − 1

values (to predict the next one)

280

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

2. they assume that the data is symbolic, i.e., x

is a symbol taken from a

ﬁnite alphabet V (for vocabulary), and

3. the conditional probability estimates are obtained from frequency counts

of all the observed length-n subsequences, hence the names unigram (for

n=1), bigram (for n=2), trigram (for n=3), and n-gram in general.

The maximum likelihood estimator for these conditional probabilities is simply

the relative frequency of occurence of the left hand symbol in the context of the

right hand symbols, compared to all the other possible symbols in V :

P (x

t−1

, . . . , x

t−n+1

) =

#{x

, x

t−1

, . . . , x

t−n+1

}

∈V

#{x

, x

t−1

, . . . , x

t−n+1

}

(10.16)

where #{a, b, c} denotes the cardinality of the set of tuples (a, b, c) in the training

set, and where the denominator is also a count (if border eﬀects are handled

properly).

A fundamental limitation of the above estimator is that it is very likely to be

zero in many cases, even though the tuple (x

, x

t−1

, . . . , x

t−n+1

) may show up in

the test set. In that case, the test log-likelihood would be inﬁnitely bad (−∞). To

avoid that catastrophic outcome, n-grams employ some form of smoothing, i.e.,

techniques to shift probability mass from the observed tuples to unobserved ones

that are similar (a central idea behind most non-parametric statistical methods).

See Chen and Goodman (1999) for a review and empirical comparisons. One

basic technique consists in assigning a non-zero probability mass to any of the

possible next symbol values. Another very popular idea consists in backing oﬀ,

or mixing (as in mixture model), the higher-order n-gram predictor with all the

lower-order ones (with smaller n). Back-oﬀ methods look-up the lower-order n-

grams if the frequency of the context x

t−1

, . . . , x

t−n+1

is too small, i.e., considering

the contexts x

t−1

, . . . , x

t−n+k

, for increasing k, until a suﬃciently reliable estimate

is found.

Another interesting idea that is related to neural language models (Section 13.3)

is to break up the symbols into classes (by some form of clustering) and back-up

to, or mix with, less precise models that only consider the classes of the words

in the context (i.e. aggregating statistics from a larger portion of the training

set). One can view the word classes as a very impoverished learned representa-

tion of words which help to generalize (across words of the same class). What

distributed representations (e.g. neural word embeddings) bring is a richer notion

of similarity by which individual words keep their own identity (instead of being

undistinguishible from the other words in the same class) and yet share learned

attributes with other words with which they have some elements in common (but

not all). This kind of richer notion of similarity makes generalization more speciﬁc

and the representation not necessarily lossy, unlike with word classes.

281

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

10.7.2 Eﬃcient Marginalization and Inference for Temporally Struc-

tured Outputs by Dynamic Programming

Many temporal modeling approaches can be cast in the following framework,

which also includes hybrids of neural networks with HMMs, CRFs, ﬁrst introduced

in Bottou et al. (1997); LeCun et al. (1998b) and later developed and applied

with great success in Graves et al. (2006); Graves (2012) with the Connectionist

Temporal Classiﬁcation (CTC) approach, as well as in Do and Arti`eres (2010) and

other more recent work (Farabet et al., 2013b; Deng et al., 2014). These ideas have

been rediscovered in a simpliﬁed form (limiting the input-output relationship to

a linear one) as CRFs (Laﬀerty et al., 2001), i.e., undirected graphical models

whose parameters are linear functions of input variables. In section 10.8 we

consider in more detail the neural network hybrids and the “graph transformer”

generalizations of the ideas presented below.

All these approaches (with or without neural nets in the middle) concern the

case where we have an input sequence (discrete or continuous-valued) {x

} and

a symbolic output sequence {y

} (typically of the same length, although shorter

output sequences can be handled by introducing “empty string” values in the

output). Generalizations to non-sequential output structure have been introduced

more recently (e.g. to condition the Markov Random Fields sometimes used to

model structural dependencies in images (Stewart et al., 2007)), at the loss of

exact inference (the dynamic programming methods described below).

Optionally, one also considers a latent variable sequence {s

} that is also dis-

crete and inference needs to be done over {s

}, either via marginalization (sum-

ming over all possible values of the state sequence) or maximization (picking

exactly or approximately the MAP sequence, with the largest probability). If

the state variables s

and the target variables y

have a 1-D Markov structure to

their dependency, then computing likelihood, partition function and MAP values

can all be done eﬃciently by exploiting dynamic programming to factorize the

computation. On the other hand, if the state or output sequence dependencies

are captured by an RNN, then there is no ﬁnite-order Markov property and no

eﬃcient and exact inference is generally possible. However, many reasonable ap-

proximations have been used in the past, such as with variants of the beam search

algorithm (Lowerre, 1976).

The application of the principle of dynamic programming in these setups is

the same as what is used in the Forward-Backward algorithm for graphical models

and HMMs (next Section) and the Viterbi algorithm for inference. We outline

the main lines of this type of eﬃcient computation here.

282

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

…

Figure 10.19: Example of a temporally structured output graph, as can be found in CRFs,

HMMs, and neural net hybrids. Each node corresponds to a particular value of an output

random variable at a particular point in the output sequence (contrast with a graphical

model representation, where each node corresponds to a random variable). A path from

the source node to the sink node (e.g. red bold arrows) corresponds to an interpretation of

the input as a sequence of output labels. The dynamic programming recursions that are

used for computing likelihood (or conditional likelihood) or performing MAP inference

(ﬁnding the best path) involve sums or maximizations over sub-paths ending at one of

the particular interior nodes.

Let G be a directed acyclic graph whose paths correspond to the sequences

that can be selected (for MAP) or summed over (marginalized for computing a

likelihood), as illustrated in Figure 10.19. In the above example, let z

represent

the choice variable (e.g., s

and y

in the above example), and each arc with score

a corresponds to a particular value of z

in its Markov context. In the language

of undirected graphical models, if a is the score associated with an arc from the

node for z

t−1

= j to the one for z

= i, then a is minus the energy of a term

of the energy function associated with the event 1

t−1

=j,z

and the associated

information from the input x (e.g. some value of x

For example, with a Markov chain of order 1, each z

= i can linked via an arc

to a z

t−1

= j. The order 1 Markov property means that P(z

t−1

, z

t−2

, . . . , z

, x) =

P (z

t−1

, x), where x is conditioning information (the input sequence), so the

number of nodes would be equal to the length of the sequence, T , times the

number of values of z

, N, and the number of arcs of the graph would be up to

T N

(if every value of z

can follow every value of z

t−1

, although in practice the

connectivity is often much smaller). A score a is computed for each arc (which

may include some component that only depends on the source or only on the

283

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

destination node), as a function of the conditioning information x. The inference

or marginalization problems involve performing the following computations.

For the marginalization task, we want to compute the sum over all complete

paths (e.g. from source to sink) of the product along the path of the exponentiated

scores associated with the arcs on that path:

S(G) =

path∈G

a∈path

(10.17)

where the product is over all the arcs on a path (with score a), and the sum is

over all the paths associated with complete sequences (from beginning to end of

a sequence). S(G) may correspond to a likelihood, numerator or denominator of

a probability. For example,

P ({z

} ∈ Y |x) =

S(G

)

S(G)

(10.18)

where G

is the subgraph of G which is restricted to sequences that are compatible

with some target answer Y .

For the inference task, we want to compute

π(G) = arg max

path∈G

a∈path

= arg max

path∈G

a∈path

V (G) = max

path∈G

a∈path

where π(G) is the most probable path and V (G) is its log-score or value, and

again the set of paths considered includes all of those starting at the beginning

and ending at the end the sequence.

The principle of dynamic programming is to recursively compute intermediate

quantities that can be reused eﬃciently so as to avoid actually going through an

exponential number of computations, e.g., though the exponential number of

paths to consider in the above sums or maxima. Note how it is already at play

in the underlying eﬃciency of back-propagation (or back-propagation through

time), where gradients w.r.t. intermediate layers or time steps or nodes in a ﬂow

graph can be computed based on previously computed gradients (for later layers,

time steps or nodes). Here it can be achieved by considering to restrictions of the

graph to those paths that end at a node n, which we denote G

. As before, G

indicates the additional restriction to subsequences that are compatible with the

target sequence Y , i.e., with the beginning of the sequence Y .

284

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

…

Figure 10.20: Illustration of the recursive computation taking place for inference or

marginalization by dynamic programming. See Figure 10.19. These recursions involve

sums or maximizations over sub-paths ending at one of the particular interior nodes (red

in the ﬁgure), each time only requiring to look up previously computed values at the

predecessor nodes (green).

We can thus perform marginalization eﬃciently as follows, and illustrated in

Figure 10.20.

S(G) =

n∈ﬁnal(G)

S(G

) (10.19)

where ﬁnal(G) is the set of ﬁnal nodes in the graph G, and we can recursively

compute the node-restricted sum via

S(G

) =

m∈pred(n)

S(G

m,n

(10.20)

where pred(n) is the set of predecessors of node n in the graph and a

m,n

is the

log-score associated with the arc from m to n. It is easy to see that expanding

the above recursion recovers the result of Eq. 10.17.

Similarly, we can perform eﬃcient MAP inference (also known as Viterbi

decoding) as follows.

V (G) = max

n∈ﬁnal(G)

V (G

) (10.21)

and

V (G

) = max

m∈pred(n)

V (G

) + a

m,n

. (10.22)

285

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

To obtain the corresponding path, it is enough to keep track of the argmax asso-

ciated with each of the above maximizations and trace back π(G) starting from

the nodes in ﬁnal(G). For example, the last element of π(G) is

∗

← arg max

n∈ﬁnal(G)

V (G

)

and (recursively) the argmax node before n

∗

along the selected path is a new n

∗

← arg max

m∈pred(n

∗

)

V (G

) + a

m,n

∗

etc. Keeping track of these n

∗

along the way gives the selected path. Proving

that these recursive computations yield the desired results is straightforward and

left as an exercise.

10.7.3 HMMs

Hidden Markov Models (HMMs) are probabilistic models of sequences that were

introduced in the 60’s (Baum and Petrie, 1966) along with the E-M algorithm

(Section 20.2). They are very commonly used to model sequential structure, in

particular having been since the mid 80’s and until recently the technological

core of speech recognition systems (Rabiner and Juang, 1986; Rabiner, 1989).

Just like RNNs, HMMs are dynamic Bayes nets (Koller and Friedman, 2009),

i.e., the same parameters and graphical model structure are used for every time

step. Compared to RNNs, what is particular to HMMs is that the latent variable

associated with each time step (called the state) is discrete, with a separate set of

parameters associated with each state value. We consider here the most common

form of HMM, in which the Markov chain is of order 1, i.e., the state s

at time

t, given the previous states, only depends on the previous state s

t−1

P (s

t−1

, s

t−2

, . . . , s

) = P (s

t−1

which we call the transition or state-to-state distribution. Generalizing to higher-

order Markov chains is straightforward: for example, order-2 Markov chains can

be mapped to order-1 Markov chains by considering as order-1 “states” all the

pairs (s

= i, s

t−1

= j).

Given the state value, a generative probabilistic model of the visible variable x

is deﬁned, that speciﬁes how each observation x

in a sequence (x

, x

, . . . , x

)

can be generated, via a model P (x

). Two kinds of parameters are distin-

guished: those that deﬁne the transition distribution, which can be given by a

matrix

= P (s

= i|s

t−1

= j),

286

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

and those that deﬁne the output model P(x

). For example, if the data are

discrete and x

is a symbol x

, then another matrix can be used to deﬁne the

output (or emission) model:

= P (x

= k|s

= i).

Another common parametrization for P (x

= i), in the case of continuous

vector-valued x

, is the Gaussian mixture model, where we have a diﬀerent mix-

ture (with its own means, covariances and component probabilities) for each state

= i. Alternatively, the means and covariances (or just variances) can be shared

across states, and only the component probabilities are state-speciﬁc.

The overall likelihood of an observed sequence is thus

P (x

, x

, . . . , x

) =

,...,s

P (x

)P (s

t−1

). (10.23)

In the language established earlier in Section 10.7.2, we have a graph G with

one node n per time step t and state value i, i.e., for s

= i, and one arc between

each node n (for 1

) and its predecessors m for 1

t−1

(when the transition

probability is non-zero, i.e., P (s

= i|s

t−1

= j) 6= 0). Following Eq. 10.23, the

log-score a

m,n

for the transition between m and n would then be

m,n

= log P(x

= i) + log P(s

= i|s

t−1

= j).

As explained in Section 10.7.2, this view gives us a dynamic programming

algorithm for computing the likelihood (or the conditional likelihood given some

constraints on the set of allowed paths), called the forward-backward or sum-

product algorithm, in time O(kNT ) where T is the sequence length, N is the

number of states and k the average in-degree of each node.

Although the likelihood is tractable and could be maximized by a gradient-

based optimization method, HMMs are typically trained by the E-M algorithm

(Section 20.2), which has been shown to converge rapidly (approaching the rate of

Newton-like methods) in some conditions (if we view the HMM as a big mixture,

then the condition is for the ﬁnal mixture components to be well-separated, i.e.,

have little overlap) (Xu and Jordan, 1996).

At test time, the sequence of states that maximizes the joint likelihood

P (x

, x

, . . . , x

, s

, . . . , s

)

can also be obtained using a dynamic programming algorithm (called the Viterbi

algorithm). This is a form of inference (see Section 14.5) that is called MAP

(Maximum A Posteriori) inference because we want to ﬁnd the most probable

value of the unobserved state variables given the observed inputs. Using the same

287

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

deﬁnitions as above (from Section 10.7.2) for the nodes and log-score of the graph

G in which we search for the optimal path, the Viterbi algorithm corresponds to

the recursion deﬁned by Eq. 10.22.

If the HMM is structured in such a way that states have a meaning associated

with labels of interest, then from the MAP sequence one can read oﬀ the associated

labels. When the number of states is very large (which happens for example with

large-vocabulary speech recognition based on n-gram language models), even the

eﬃcient Viterbi algorithm becomes too expensive, and approximate search must

be performed. A common family of search algorithms for HMMs is the Beam

Search algorithm (Lowerre, 1976): basically, a set of promising candidate paths

are kept and gradually extended, at each step pruning the set of candidates to only

keep the best ones according to their cumulative score (log of the joint likelihood

of states and observations up to the current time step and state being considered).

More details about speech recognition are given in Section 13.2. An HMM

can be used to associate a sequence of labels (y

, y

, . . . , y

) with the input

, x

, . . . , x

), where the output sequence is typically shorter than the input

sequence, i.e., N < T. Knowledge of (y

, y

, . . . , y

) constrains the set of com-

patible state sequences (s

, s

, . . . , s

), and the generative conditional likelihood

P (x

, x

, . . . , x

, y

, . . . , y

) =

,...,s

∈S(y

,...,y

)

P (x

)P (s

t−1

(10.24)

can be computed using the same forward-backward technique, and its logarithm

maximized during training, as discussed above.

Various discriminative alternatives to the generative likelihood of Eq. 10.24

have been proposed (Brown, 1987; Bahl et al., 1987; Nadas et al., 1988; Juang and

Katagiri, 1992; Bengio et al., 1992a; Bengio, 1993; Leprieur and Haﬀner, 1995;

Bengio, 1999a), the simplest of which is simply P(y

, y

, . . . , y

, x

, . . . , x

which is obtained from Eq. 10.24 by Bayes rule, i.e., involving a normalization

over all sequences, i.e., the unconstrained likelihood of Eq. 10.23:

P (y

, y

, . . . , y

, x

, . . . , x

) =

P (x

, x

, . . . , x

, y

, . . . , y

)P (y

, y

, . . . , y

)

P (x

, x

, . . . , x

)

Both the numerator and denominator can be formulated in the framework of

the previous section (Eqs. 10.18-10.20), where for the numerator we merge (add)

the log-scores coming from the structured output output model P(y

, y

, . . . , y

)

and from the input likelihood model P (x

, x

, . . . , x

, y

, . . . , y

). Again, each

node of the graph corresponds to a state of the HMM at a particular time step

t (which may or may not emit the next output symbol y

), associated with an

input vector x

. Instead of making the relationship to the input the result of a

simple parametric form (Gaussian or multinomial, typically), the scores can be

288

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

computed by a neural network (or any other parametrized diﬀerential function).

This gives rise to discriminative hybrids of search or graphical models with neural

networks, discussed below, Section 10.8.

10.7.4 CRFs

Whereas HMMs are typically trained to maximize the probability of an input

sequence x given a target sequence y and correspond to a directed graphical

model, Conditional Random Fields (CRFs) (Laﬀerty et al., 2001) are undirected

graphical models that are trained to maximize the joint probability of the target

variables, given input variables, P (y|x). CRFs are special cases of the graph

transformer model introduced in Bottou et al. (1997); LeCun et al. (1998b), where

neural nets are replaced by aﬃne transformations and there is a single graph

involved.

Many applications of CRFs involve sequences and the discussion here will be

focused on this type of application, although applications to images (e.g. for

image segmentation) are also common. Compared to other graphical models,

another characteristic of CRFs is that there are no latent variables. The general

equation for the probability distribution modeled by a CRF is basically the same

as for fully visible (not latent variable) undirected graphical models, also known as

Markov Random Fields (MRFs, see Section 14.2.2), except that the “potentials”

(terms of the energy function) are parametrized functions of the input variables,

and the likelihood of interest is the posterior probability P (y|x).

As in many other MRFs, CRFs often have a particular connectivity structure

in their graph, which allows one to perform learning or inference more eﬃciently.

In particular, when dealing with sequences, the energy function typically only has

terms that relate neighboring elements of the sequence of target variables. For

example, the target variables could form a homogenous

Markov chain of order

k (given the input variables). A typical linear CRF example with binary outputs

would have the following structure:

P (y = y|x) =

(b+

i=1

t−i

)

(10.25)

where Z is the normalization constant, which is the sum over all y sequences of the

numerator. In that case, the score marginalization framework of Section 10.7.2

and coming from Bottou et al. (1997); LeCun et al. (1998b) can be applied by

making terms in the above exponential correspond to scores associated with nodes

t of a graph G. If there were more than two output classes, more nodes per time

step would be required but the principle would remain the same. A more general

meaning that the same parameters are used for every time step

289

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

formulation for Markov chains of order d is the following:

P (y = y|x) =

t−1

,...,y

t−d

)

(10.26)

where f

computes a potential of the energy function, a parametrized function

of both the past target values (up to y

t−d

) and of the current input value x

. For

example, as discussed below f

could be the output of an arbitrary parametrized

computation, such as a neural network.

Although Z looks intractable, because of the Markov property of the model

(order 1, in the example), it is again possible to exploit dynamic programming to

compute Z eﬃciently, as per Eqs. 10.18-10.20). Again, the idea is to compute the

sub-sum for sequences of length t ≤ T (where T is the length of a target sequence

y), ending in each of the possible state values at t, e.g., y

= 1 and y

= 0 in the

above example. For higher order Markov chains (say order d instead of 1) and a

larger number of state values (say N instead of 2), the required sub-sums to keep

track of are for each element in the cross-product of d −1 state values, i.e., N

d−1

For each of these elements, the new sub-sums for sequences of length t + 1 (for

each of the N values at t + 1 and corresponding N

max(0,d−2)

past values for the

past d −2 time steps) can be obtained by only considering the sub-sums for the

d−1

joint state values for the last d −1 time steps before t + 1.

Following Eq. 10.22, the same kind of decomposition can be performed to

eﬃciently ﬁnd the MAP conﬁguration of y’s given x, where instead of products

(sums inside the exponential) and sums (for the outer sum of these exponentials,

over diﬀerent paths) we respectively have sums (corresponding to adding the sums

inside the exponential) and maxima (across the diﬀerent competing “previous-

state” choices).

10.8 Combining Neural Networks and Search

The idea of combining neural networks with HMMs or related search or alignment-

based components (such as graph transformers) for speech and handwriting recog-

nition dates from the early days of research on multi-layer neural networks (Bourlard

and Wellekens, 1990; Bottou et al., 1990; Bengio, 1991; Bottou, 1991; Haﬀner

et al., 1991; Bengio et al., 1992a; Matan et al., 1992; Bourlard and Morgan, 1993;

Bengio et al., 1995; Bengio and Frasconi, 1996; Baldi and Brunak, 1998) – and

see more references in Bengio (1999b). See also 13.4 for combining recurrent and

other deep learners with generative models such as CRFs, GSNs or RBMs.

The principle of eﬃcient marginalization and inference for temporally struc-

tured outputs by exploiting dynamic programming (Sec. 10.7.2) can be applied

not just when the log-scores of Eqs. 10.17 and 10.19 are parameters or linear

290

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

functions of the input, but also when they are learned non-linear functions of the

input, e.g., via a neural network transformation, as was ﬁrst done in Bottou et al.

(1997); LeCun et al. (1998b). These papers additionally introduced the powerful

idea of learned graph transformers, illustrated in Figure 10.21. In this context,

a graph transformer is a machine that can map a directed acyclic graph G

another graph G

out

. Both input and output graphs have paths that represent

hypotheses about the observed data.

For example, in the above papers, and as illustrated in Figure 10.22, a segmen-

tation graph transformer takes a singleton input graph (the image x) and outputs

a graph representing segmentation hypotheses (regarding sequences of segments

that could each contain a character in the image). Such a graph transformer could

be used as one layer of a graph transformer network for handwriting recognition

or document analysis for reading amounts on checks, as illustrated respectively

in Figures 10.23 and 10.24.

For example, after the segmentation graph transformer, a recognition graph

transformer could expand each node of the segmentation graph into a subgraph

whose arcs correspond to diﬀerent interpretations of the segment (which char-

acter is present in the segment?). Then, a dictionary graph transformer takes

the recognition graph and expands it further by considering only the sequences

of characters that are compatible with sequences of words in the language of in-

terest. Finally, a language-model graph transformer expands sequences of word

hypotheses so as to include multiple words in the state (context) and weigh the

arcs according to the language model next-word log-probabilities.

Each of these transformations is parametrized and takes real-valued scores

on the arcs of the input graph into real-valued scores on the arcs of the output

graph. These transformations can be parametrized and learned by gradient-based

optimization over the whole series of graph transformers.

10.8.1 Approximate Search

Unfortunately, as in the above example, when the number of nodes of the graph

becomes very large (e.g., considering all previous n words to condition the log-

probability of the next one, for n large), even dynamic programming (whose

computation scales with the number of arcs) is too slow for practical applications

such as speech recognition or machine translation. A common example is when a

recurrent neural network is used to compute the arcs log-score, e.g., as in neural

language models (Section 13.3). Since the prediction at step t depends on all

t − 1 previous choices, the number of states (nodes of the search graph G) grows

exponentially with the length of the sequence.

In that case, one has to resort to approximate search. In the case of sequential

structures as discussed in this chapter, a common family of approximate search

291

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

algorithms is the beam search (Lowerre, 1976).

The basic idea of beam search algorithms is the following:

• Break the nodes of the graph into T groups containing only “comparable

nodes”, e.g., the group of nodes n for which the maximum length of the

paths ending at n is exactly t.

• Process these groups of nodes sequentially, keeping only at each step t a

selected subset S

of the nodes (the “beam”), chosen based on the subset

t−1

. Each node in S

is associated with a score

V (G

) that represents an

approximation (a lower bound) on the maximum total log-score of the path

ending at the node, V (G

) (deﬁned in Eq. 10.22, Viterbi decoding).

• S

is obtained by following all the arcs from the nodes in S

t−1

, and sorting

all the resulting group t nodes n according to their estimated (lower bound)

score

V (G

) = max

m∈S

t−1

andm∈pred(n)

V (G

) + a

m,n

while keeping track of the argmax in order to trace back the estimated best

path. Only the k nodes with the highest log-score are kept and stored in

, and k is called the beam width.

• The estimated best ﬁnal node can be read oﬀ from max

n∈S

V (G

) and

the estimated best path from the associated argmax choices made along the

way, just like in the Viterbi algorithm.

One problem with beam search is that the beam often ends up lacking in

diversity, making the approximation poor. For example, imagine that we have

two “types” of solutions, but that each type has exponentially many variants (as

a function of t), due, e.g., to small independent variations in ways in which the

type can be expressed at each time step t. Then, even though the two types may

have close best log-score up to time t, the beam could be dominated by the one

that wins slightly, eliminating the other type from the search, although later time

steps might reveal that the second type was actually the best one.

292

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.3: Left: vanilla recurrent network with hidden-to-hidden recurrence, seen as a

circuit, with weight matrices U , V , W for the three diﬀerent kinds of connections (input-

to-hidden, hidden-to-output, and hidden-to-hidden, respectively). Each circle indicates a

whole vector of activations. Right: the same seen as an time-unfolded ﬂow graph, where

each node is now associated with one particular time instance.

293

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

unfold

t−1

t+1

t−1

t+1

t−1

t+1

t−1

t+1

Figure 10.4: Left: RNN whose recurrence is only through the output. Right: compu-

tational ﬂow graph unfolded in time. At each time step t, the input is x

, the hidden

layer activations h

, the output o

, the target y

and the loss L

. Such an RNN is less

powerful (can express a smaller set of functions) than those in the family represented by

Figure 10.3 but may be easier to train because they can exploit “teacher forcing”, i.e.,

constraining some of the units involved in the recurrent loop (here the output units) to

take some target values during training. The reason why this architecture is less powerful

is because the only state information (which carries all the information about the past)

is the previous prediction. Unless the predicted variable (and hence the target output

variable) is very high-dimensional and rich, this will usually miss important information

from the past inputs that needs to be carried in the state for future predictions.

294

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

t−1

t+1

…"

Figure 10.5: Time-Unfolded recurrent neural network with a single output at the end

of the sequence. Such a network can be used to summarize a sequence and produce a

ﬁxed-size representation used as input for further processing. There might be a target

right at the end (like in the ﬁgure) or the gradient on the output o

can be obtained by

back-propagating from further downstream modules.

295

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

t−1

t+1

t−1

t+2

Figure 10.6: A generative recurrent neural network modeling P (x

, . . . , x

), able to gen-

erate sequences from this distribution. Each element x

of the observed sequence serves

both as input (for the current time step) and as target (for the previous time step).

The output o

encodes the parameters of a conditional distribution P (x

t+1

, . . . , x

) =

P (x

t+1

) for x

t+1

, given the past sequence x

. . . , x

. The loss L

is the negative

log-likelihood associated with the output prediction (or more generally, distribution pa-

rameters) o

, when the actual observed target is x

t+1

. In training mode, one measures

and minimizes the sum of the losses over observed sequence(s) x. In generative mode,

is sampled from the conditional distribution P (x

t+1

, . . . , x

) = P (x

t+1

) (dashed

arrows) and then that generated sample x

t+1

is fed back as input for computing the next

state s

t+1

, the next output o

t+1

, and generating the next sample x

t+2

, etc.

296

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

t−1

t+2

t+1

t−1

Figure 10.7: A conditional generative recurrent neural network maps a ﬁxed-length vector

x into a distribution over sequences Y. Each element y

of the observed output sequence

serves both as input (for the current time step) and, during training, as target (for the

previous time step). The generative semantics are the same as in the unconditional case

(Fig. 10.6). The only diﬀerence is that the state is now conditioned on the input x,

and the same parameters (weight matrix R in the ﬁgure) is used at every time step

to parametrize that dependency. Although this was not discussed in Fig. 10.6, in both

ﬁgures one should note that the length of the sequence must also be generated (unless

known in advance). This could be done by a special binary output unit that encodes the

fact that the next output is the last.

297

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

t−1

t+1

t−1

t+2

t+1

t−1

Figure 10.8: A conditional generative recurrent neural network mapping a variable-length

sequence x into a distribution over sequences y of the same length. This architecture

assumes that the y

’s are causally related to the x

’s, i.e., that we want to predict the

’s only using the past x

’s. Note how the prediction of y

t+1

is based on both the past

x’s and the past y’s. The dashed arrows indicate that y

can be generated by sampling

from the output distribution o

t−1

. When y

is clamped (known), it is used as a target

in the loss L

t−1

which measures the log-probability that y

would be sampled from the

distribution o

t−1

298

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

t−1

t+1

t−1

t+1

t−1

t+1

t−1

t+1

t−1

t+1

Figure 10.9: Computation of a typical bidirectional recurrent neural network, meant to

learn to map input sequences x to target sequences y, with loss L

at each step t. The

h recurrence propagates information forward in time (towards the right) while the g

recurrence propagates information backward in time (towards the left). Thus at each

point t, the output units o

can beneﬁt from a relevant summary of the past in its h

input and from a relevant summary of the future in its g

input.

299

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.21: Illustration of the stacking of graph transformers (right, c) as a generaliza-

tion of the stacking of convolutional layers (middle, b) or of regular feedforward layers

that transform ﬁxed-size vectors (left, a). Figure reproduced with permission from the au-

thors of Bottou et al. (1997). Quoting from that paper, (c) shows that “multilayer graph

transformer networks are composed of trainable modules that operate on and produce

graphs whose args carry numerical information”.

Figure 10.22: Illustration of the input and output of a simple graph transformer that maps

a singleton graph corresponding to an input image to a graph representing hypothesized

segmentation hypotheses. Reproduced with permission from the authors of Bottou et al.

(1997).

300

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.23: Illustration of the graph transformer network that has been used for ﬁnding

the best segmentation of a handwritten word, for handwriting recognition. Reproduced

with permission from Bottou et al. (1997).

301

CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS

Figure 10.24: Illustration of the graph transformer network that has been used for reading

amounts on checks, starting from the single graph containing the image of the graph

to the recognized sequences of characters corresponding to the amount on the graph,

with currency and other recognized marks. Note how the grammar graph transformer

composes the grammar graph (allowed sequences of characters) and the recognition graph

(with character hypotheses associated with speciﬁc input segments, on the arcs) into an

interpretation graph that only contains the recognition graph paths that are compatible

with the grammar. Reproduced with permission from Bottou et al. (1997).

302