Chapter 10
Sequence Modeling: Recurrent
and Recursive Nets
One of the early ideas found in machine learning and statistical models of the 80’s is
that of sharing parameters
1
across different parts of a model, allowing to extend the
model, to examples of different forms, e.g. of different lengths. This can be found
in hidden Markov models (HMMs) (Rabiner and Juang, 1986), where the same state-
to-state transition matrix P (s
t
|s
t1
) is used for different time steps, allowing one to
model variable length sequences. It can also be found in the idea of recurrent neural
network (Rumelhart et al., 1986c) or RNN
2
: the same weights are used for different
instances of the artificial neurons at different time steps, allowing us to apply the network
to input sequences of different lengths. This idea is made more explicit in the early work
on time-delay neural networks (Lang and Hinton, 1988; Waibel et al., 1989), where a
fully connected network is replaced by one with local connections that are shared across
different temporal instances of the hidden units. Such networks are the ancestors of
convolutional neural networks, covered in more detail in Section 9. Recurrent nets are
covered below in Section 10.2. As shown in Section 10.1 below, the flow graph (notion
introduced in Section 6.3) associated with a recurrent network is structured like a chain.
Recurrent neural networks have been generalized into recursive neural networks, in which
the structure can be more general, i.e., and it is typically viewed as a tree. Recursive
neural networks are discussed in more detail in Section 10.4.
10.1 Unfolding Flow Graphs and Sharing Parameters
A flow graph is a way to formalize the structure of a set of computations, such as
those involved in mapping inputs and parameters to outputs and loss. Please refer to
Section 6.3 for a general introduction. In this section we explain the idea of unfolding a
1
see Section 7.9 for an introduction to the concept of parameter sharing
2
Unfortunately, the RNN acronym is sometimes also used for denoting Recursive Neural Networks.
However, since the RNN acronym has been around for much longer, we suggest keeping this acronym
for Recurrent Neural Network.
191
recursive or recurrent computation into a flow graph that has a repetitive structure.
For example, consider the classical form of a dynamical system:
s
t
= F
θ
(s
t1
) (10.1)
where s
t
is called the state of the system. The unfolded flow graph of such a system
looks like in Figure 10.1.
s
t
s
t1
s
t+1
F
F
F
F
Figure 10.1: Classical dynamical system equation 10.1 illustrated as an unfolded flow-
graph. Each node represents the state at some time t and function F
θ
maps the state
at t to the state at t + 1. The same parameters (the same function F
θ
) is used for all
time steps.
As another example, let us consider a dynamical system driven by an external signal
x
t
,
s
t
= F
θ
(s
t1
, x
t
) (10.2)
illustrated in Figure 10.2, where we see that the state now contains information about
the whole past sequence, i.e., the above equation implicitly defines a function
s
t
= G
t
(x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
) (10.3)
which maps the whole past sequence (x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
) to the current state.
Equation 10.2 is actually part of the definition of a recurrent net. We can think
of s
t
is a kind of summary of the past sequence of inputs up to t. Note that this
summary is in general necessarily lossy, since it maps an arbitrary length sequence
(x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
) to a fixed length vector s
t
. Depending on the training cri-
terion, this summary might selectively keeping some aspects of the past sequence with
more precision than other aspects. For example, if the RNN is used in statistical lan-
guage modeling, typically to predict the next word given previous words, it may not be
necessary to distinctly keep track of all the bits of information, only those required to
predict the rest of the sentence. The most demanding situation is when we ask s
t
to be
rich enough to allow one to approximately recover the input sequence, as in auto-encoder
frameworks (Chapter 16).
If we had to define a different function G
t
for each possible sequence length (imagine a
separate neural network, each with a different input size), each with its own parameters,
we would not get any generalization to sequences of a size not seen in the training set.
Furthermore, one would need to see a lot more training examples, because a separate
model would have to be trained for each sequence length. By instead defining the state
through a recurrent formulation as in Eq. 10.2, the same parameters are used for any
sequence length, allowing much better generalization properties.
192
Equation 10.2 can be drawn in two different ways. One is in a way that is inspired
by how a physical implementation (such as a real neural network) might look like, i.e.,
like a circuit which operates in real time, as in the left of Figure 10.2. The other is as
a flow graph, in which the computations occurring at different time steps in the circuit
are unfolded as different nodes of the flow graph, as in the right of Figure 10.2. What
we call unfolding is the operation that maps a circuit as in the left side of the figure to a
flow graph with repeated pieces as in the right side. Note how the unfolded graph now
has a size that depends on the sequence length. The black square indicates a delay of
1 time step on the recurrent connection, from the state at time t to the state at time
t + 1.
s
t
s
t1
s
t+1
F
F
F
x
t1
x
t+1
x
s
F
unfold
Figure 10.2: Left: input processing part of a recurrent neural network, seen as a circuit.
The black square indicates a delay of 1 time steps. Right: the same seen as an unfolded
flow graph, where each node is now associated with one particular time instance.
The other important observation to make from Figure 10.2 is that the same parame-
ters (θ) are shared over different parts of the graph, corresponding here to different time
steps.
10.2 Recurrent Neural Networks
Armed with the ideas introduced in the previous section, we can design a wide variety
of recurrent circuits, which are compact and simple to understand visually, and auto-
matically obtain their equivalent unfolded graph, which are useful computationally and
also help focus on the idea of information flow forward (computing outputs and losses)
and backward in time (computing gradients).
Some of the early circuit designs for recurrent neural networks are illustrated in
Figures 10.3, 10.5 and 10.4. Figure 10.3 shows the vanilla recurrent network whose
equations are laid down below in Eq. 10.4, and which has been shown to be a universal
approximation machine for sequences, i.e., able to implement a Turing machine (Siegel-
mann and Sontag, 1991; Siegelmann, 1995; Hyotyniemi, 1996). On the other hand, the
network with output recurrence shown in Figure 10.4 has a more limited memory or
state, which is its output, i.e., the prediction of the previous target, which potentially
limits its expressive power, but also makes it easier to train. Indeed, the “intermediate
state” of the corresponding unfolded deep network is not hidden anymore: targets are
available to guide this intermediate representation, which should make it easier to train.
Teacher forcing is the training process in which the fed back inputs are not the predicted
outputs but the targets themselves, as shown in figure TODO.The disadvantage of strict
193
teacher forcing is that if the network is going to be later used in an open-loop mode, i.e.,
with the network outputs (or samples from the output distribution) fed back as input,
then the kind of inputs that the network will have seen during training could be quite
different from the kind of inputs that it will see at test time, potentially yielding very
poor generalizations. One way to mitigate this problem is to train with both teacher-
forced inputs and with free-running inputs, e.g., predicting the correct target a number
of steps in the future through the unfolded recurrent output-to-input paths. In this way,
the network can learn to take into account input conditions (such as those it generates
itself in the free-running mode) not seen during training and how to map the state back
towards one that will make the network generate proper outputs after a few steps.
x
t1
x
t+1
x
unfold
V
W
W
W
W
W
V
V
V
U
U
U
U
s
o
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
Figure 10.3: Left: vanilla recurrent network with hidden-to-hidden recurrence, seen as a
circuit, with weight matrices U, V , W for the three different kinds of connections (input-
to-hidden, hidden-to-output, and hidden-to-hidden, respectively). Each circle indicates
a whole vector of activations. Right: the same seen as an time-unfolded flow graph,
where each node is now associated with one particular time instance.
The vanilla recurrent network of Figure 10.3 corresponds to the following forward
propagation equations, if we assume that hyperbolic tangent non-linearities are used in
the hidden units and softmax is used in output (for classification problems):
a
t
= b + W s
t1
+ Ux
t
s
t
= tanh(a
t
)
o
t
= c + V s
t
p
t
= softmax(o
t
) (10.4)
where the parameters are the bias vectors b and c along with the weight matrices U ,
V and W , respectively for input-to-hidden, hidden-to-output, and hidden-to-hidden
connections. This is an example of a recurrent network that maps an input sequence to
an output sequence of the same length. The total loss for a given input/target sequence
pair (x, y) would then be just the sum of the losses over all the time steps, e.g.,
L(x, y) =
X
t
L
t
=
X
t
log p
y
t
(10.5)
where y
t
is the category that should be associated with time step t in the output se-
quence.
194
x
t1
x
t+1
x
unfold
V
W
W
W
W
W
V
V
V
U
U
U
U
o
o
t1
o
t
o
t+1
h
h
t1
h
t+1
y
L
L
t1
L
t+1
L
t
y
t
y
t1
y
t+1
Figure 10.4: Left: RNN whose recurrence is only through the output. Right: compu-
tational flow graph unfolded in time. At each time step t, the input is x
t
, the hidden
layer activations h
t
, the output o
t
, the target y
t
and the loss L
t
. Such an RNN is less
powerful (can express a smaller set of functions) than those in the family represented by
Figure 10.3 but may be easier to train because they can exploit “teacher forcing”, i.e.,
constraining some of the units involved in the recurrent loop (here the output units) to
take some target values during training. The reason why this architecture is less powerful
is because the only state information (which carries all the information about the past)
is the previous prediction. Unless the predicted variable (and hence the target output
variable) is very high-dimensional and rich, this will usually miss important information
from the past inputs that needs to be carried in the state for future predictions.
10.2.1 Computing the Gradient in a Recurrent Neural Network
Using the generalized back-propagation algorithm (for arbitrary flow-graphs) introduced
in Section 6.3, one can thus obtain the so-called Back-Propagation Through Time
(BPTT) algorithm. Once we know how to compute gradients, we can in principle apply
any of the gradient-based techniques to train an RNN, introduced in Section 4.3 and
developed in greater depth in Chapter 8.
Let us thus work out how to compute gradients by BPTT for the RNN equations
above (Eqs. 10.4 and 10.5). The nodes of our flow graph will be the sequence of x
t
’s,
s
t
’s, o
t
’s, L
t
’s, and the parameters U , V , W , b, c. For each node a we need to compute
the gradient
a
L recursively, based on the gradient computed at nodes that follow it in
the graph. We start the recursion with the nodes immediately preceding the final loss
L
L
t
= 1
and the gradient on the outputs i at time step t, for all i, t:
L
o
ti
=
L
L
t
L
t
o
ti
= p
t,i
1
i,y
t
195
x
t
x
t1
x
t+1
W
U
U
U
h
t1
h
t
h
t+1
W
V
U
W
W
…"
L
T
y
T
o
T
h
T
x
T
Figure 10.5: Time-Unfolded recurrent neural network with a single output at the end
of the sequence. Such a network can be used to summarize a sequence and produce a
fixed-size representation used as input for further processing. There might be a target
right at the end (like in the figure) or the gradient on the output o
t
can be obtained by
back-propagating from further downstream modules.
and work our way backwards, starting from the end of the sequence, say T, at which
point s
T
only has o
T
as descendent:
s
T
L =
o
T
L
o
T
s
T
=
o
T
LV .
Note how the above equation is vector-wise and corresponds to
L
s
T j
=
P
i
L
o
T i
V
ij
, scalar-
wise. We can then iterate backwards in time to back-propagate gradients through time,
from t = T 1 down to t = 1, noting that s
t
(for t < T ) has as descendent both o
t
and
s
t+1
:
s
t
L =
s
t+1
L
s
t+1
s
t
+
o
t
L
o
t
s
t
=
s
t+1
Ldiag(1 s
2
t+1
)W +
o
t
LV
where diag(1 s
2
t+1
) indicates the diagonal matrix containing the elements 1 s
2
t+1,i
,
i.e., the derivative of the hyperbolic tangent associated with the hidden unit i at time
t + 1.
Once the gradients on the internal nodes of the flow graph are obtained, we can
obtain the gradients on the parameter nodes, which have descendents at all the time
196
steps:
c
L =
X
t
o
t
L
o
t
c
=
X
t
o
t
L
b
L =
X
t
s
t
L
s
t
b
=
X
t
s
t
Ldiag(1 s
2
t
)
V
L =
X
t
o
t
L
o
t
V
=
X
t
o
t
Ls
>
t
W
L =
X
t
s
t
s
t
W
=
X
t
s
t
Ldiag(1 s
2
t
)s
>
t1
Note in the above (and elsewhere) that whereas
s
t
L refers to the full influence of
s
t
through all paths from s
t
to L,
s
t
W
or
s
t
b
refers to the immediate effect of the
denominator on the numerator, i.e., when we consider the denominator as a parent of
the numerator and only that direct dependency is accounted for. Otherwise, we would
get “double counting” of derivatives.
10.2.2 Recurrent Networks as Generative Directed Acyclic Models
Up to here, we have not clearly stated what the losses L
t
associated with the outputs
o
t
of a recurrent net should be. It is because there are many possible ways in which
RNNs can be used. Here we consider the case where the RNN models a probability
distribution over a sequence of observations.
As introduced in Section 14.2.1, directed graphical models represent a probability
distribution over a sequence of T random variables x = (x
1
, x
2
, . . . , x
T
) by parametrizing
their joint distribution as
P (x) = P(x
1
, . . . , x
T
) =
T
Y
t=1
P (x
t
|x
t1
, x
t2
, . . . , x
1
) (10.6)
where the right-hand side is empty for t = 1, of course. Hence the negative log-likelihood
of x according to such a model is
L =
X
t
L
t
where
L
t
= log P (x
t
|x
t1
, x
t2
, . . . , x
1
).
In general directed graphical models, x
t
can be predicted using only a subset of its
predecessors (x
1
, . . . , x
t1
). However, for RNNs, the graphical model is generally fully
connected, not removing any dependency a priori. This can be achieved efficiently
through the recurrent parametrization, such as in Eq. 10.2, since s
t
summarizes the
whole previous sequence (Eq. 10.3).
197
Hence, instead of cutting statistical complexity by removing arcs in the directed
graphical model for (x
1
, . . . , x
T
), the core idea of recurrent networks is that we in-
troduce a state variable which decouples all the past and future observations, but we
make that state variable a generally deterministic function of the past, through the
recurrence, Eq. 10.2. Consequently, the number of parameters required to parametrize
P (x
t
|x
t1
, x
t2
, . . . , x
1
) does not grow exponentially with t (as it would if we parametrized
that probability by a straight probability table, when the data is discrete) but remains
constant with t. It only grows with the dimension of the state s
t
. The price to be paid
for that great advantage is that optimizing the parameters may be more difficult, as
discussed below in Section 10.6. The decomposition of the likelihood thus becomes:
P (x) =
T
Y
t=1
P (x
t
|G
t
(x
t1
, x
t2
, . . . , x
1
))
where
s
t
= G
t
(x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
) = F
θ
(s
t1
, x
t
).
Note that if the self-recurrence function F
θ
is learned, it can discard some of the infor-
mation in some of the past values x
tk
that are not needed for predicting the future
data. In fact, because the state generally has a fixed dimension smaller than the length
of the sequences (times the dimension of the input), it has to discard some information.
However, we leave it to the learning procedure to choose what information to keep and
what information to throw away.
The above decomposition of the joint probability of a sequence of variables into
ordered conditionals precisely corresponds to what an RNN can compute: the target at
each time step t is the next element in the sequence, while the input at each time step is
the previous element in the sequence, and the output is interpreted as parametrizing the
probability distribution of the target given the state. This is illustrated in Figure 10.6.
If the RNN is actually going to be used to generate sequences, one must also incorpo-
rate in the output information allowing to stochastically decide when to stop generating
new output elements. This can be achieved in various ways. In the case when the output
is a symbol taken from a vocabulary, one can add a special symbol corresponding to
the end of a sequence. When that symbol is generated, a complete sequence has been
generated. The target for that special symbol occurs exactly once per sequence, as the
last target after the last output element x
T
. In general, one may train a binomial output
associated with that stopping variable, for example using a sigmoid output non-linearity
and the cross-entropy loss, i.e., again negative log-likelihood for the event “end of the
sequence”. Another kind of solution is to model the integer T itself, through any rea-
sonable parametric distribution, and use the number of time steps left (and possibly the
number of time steps since the beginning of the sequence) as extra inputs at each time
step. Thus we would have decomposed P(x
1
. . . , x
T
) into P (T ) and P(x
1
. . . , x
T
|T ).
In general, one must therefore keep in mind that in order to fully generate a sequence
we must not only generate the x
t
’s, but also the sequence length T, either implicitly
through a series of continue/stop decisions (or a special “end-of-sequence” symbol), or
explicitly through modeling the distribution of T itself as an integer random variable.
198
x
t
x
t1
x
t+1
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
x
t+2
Figure 10.6: A generative recurrent neural network modeling P (x
1
, . . . , x
T
), able to
generate sequences from this distribution. Each element x
t
of the observed sequence
serves both as input (for the current time step) and as target (for the previous time step).
The output o
t
encodes the parameters of a conditional distribution P (x
t+1
|x
1
, . . . , x
t
) =
P (x
t+1
|o
t
) for x
t+1
, given the past sequence x
1
. . . , x
t
. The loss L
t
is the negative
log-likelihood associated with the output prediction (or more generally, distribution
parameters) o
t
, when the actual observed target is x
t+1
. In training mode, one measures
and minimizes the sum of the losses over observed sequence(s) x. In generative mode, x
t
is sampled from the conditional distribution P (x
t+1
|x
1
, . . . , x
t
) = P (x
t+1
|o
t
) (dashed
arrows) and then that generated sample x
t+1
is fed back as input for computing the
next state s
t+1
, the next output o
t+1
, and generating the next sample x
t+2
, etc.
If we take the RNN equations of the previous section (Eq. 10.4 and 10.5), they could
correspond to a generative RNN if we simply make the target y
t
equal to the next input
x
t+1
(and because the outputs are the result of a softmax, it must be that the input
sequence is a sequence of symbols, i.e., x
t
is a symbol or bounded integer).
Other types of data can clearly be modeled in a similar way, following the discussions
about the encoding of outputs and the probabilistic interpretation of losses as negative
log-likelihoods, in Sections 5.6 and 6.2.2.
10.2.3 RNNs to Represent Conditional Probability Distributions
In general, as discussed in Section 6.2.2 (see especially the end of that section, in Sub-
section 6.2.2 ), when we can represent a parametric probability distribution P(y|ω), we
can make it conditional by making ω a function of the desired conditioning variable:
P (y|ω = f(x)).
In the case of an RNN, this can be achieved in different ways, and we review here the
most common and obvious choices.
199
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
y
t+2
y
t+1
y
t1
y
t
x
t
R
R
R
Figure 10.7: A conditional generative recurrent neural network maps a fixed-length vec-
tor x into a distribution over sequences Y. Each element y
t
of the observed output
sequence serves both as input (for the current time step) and, during training, as target
(for the previous time step). The generative semantics are the same as in the uncondi-
tional case (Fig. 10.6). The only difference is that the state is now conditioned on the
input x, and the same parameters (weight matrix R in the figure) is used at every time
step to parametrize that dependency. Although this was not discussed in Fig. 10.6, in
both figures one should note that the length of the sequence must also be generated
(unless known in advance). This could be done by a special binary output unit that
encodes the fact that the next output is the last.
If x is a fixed-size vector, then we can simply make it an extra input of the RNN
that generates the y sequence. Some common ways of providing an extra input to an
RNN are:
1. as an extra input at each time step, or
2. as the initial state s
0
, or
3. both.
In general, one may need to add extra parameters (and parametrization) to map x = x
into the “extra bias” going either into only s
0
, into the other s
t
(t > 0), or into both.
The first (and most commonly used) approach is illustrated in Figure 10.7.
As an example, we could imagine that x is encoding the identity of a phoneme and
the identity of a speaker, and that y represents an acoustic sequence corresponding to
that phoneme, as pronounced by that speaker.
Consider the case where the input x is a sequence of the same length as the output
sequence y, and the y
t
’s are independent of each other when the past input sequence
200
x
t
x
t1
x
t+1
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
x
t+2
y
t+2
y
t+1
y
t1
y
t
R
R
R
R
Figure 10.8: A conditional generative recurrent neural network mapping a variable-
length sequence x into a distribution over sequences y of the same length. This ar-
chitecture assumes that the y
t
’s are causally related to the x
t
’s, i.e., that we want to
predict the y
t
’s only using the past x
t
’s. Note how the prediction of y
t+1
is based on
both the past x’s and the past y’s. The dashed arrows indicate that y
t
can be generated
by sampling from the output distribution o
t1
. When y
t
is clamped (known), it is used
as a target in the loss L
t1
which measures the log-probability that y
t
would be sampled
from the distribution o
t1
.
is given, i.e., P (y
t
|y
t1
, . . . , y
1
, x) = P(y
t
|x
t
, x
t1
, . . . , x
1
). We therefore have a causal
relationship between the x
t
’s and the predictions of the y
t
’s, in addition to the indepen-
dence of the y
t
’s, given x. Under these (pretty strong) assumptions, we can return to
Fig. 10.3 and interpret the t-th output o
t
as parameters for a conditional distribution
for y
t
, given x
t
, x
t1
, . . . , x
1
.
If we want to remove the conditional independence assumption, we can do so by
making the past y
t
’s inputs into the state as well. That situation is illustrated in
Fig. 10.8.
10.3 Bidirectional RNNs
All of the recurrent networks we have considered up to now have a “causal” structure,
meaning that the state at time t only captures information from the past, x
1
, . . . , x
t
.
However, in many applications we want to output at time t a prediction regarding
an output which may depend on the whole input sequence. For example, in speech
recognition, the correct interpretation of the current sound as a phoneme may depend
201
x
t
x
t1
x
t+1
h
t1
h
t
h
t+1
g
t1
g
t+1
g
t
o
t
o
t1
o
t+1
L
t+1
L
t1
L
t
y
t
y
t1
y
t+1
Figure 10.9: Computational of a typical bidirectional recurrent neural network, meant
to learn to map input sequences x to target sequences y, with loss L
t
at each step t.
The h recurrence propagates information forward in time (towards the right) while the
g recurrence propagates information backward in time (towards the left). Thus at each
point t, the output units o
t
can benefit from a relevant summary of the past in its h
t
input and from a relevant summary of the future in its g
t
input.
on the next few phonemes because of co-articulation and potentially may even depend
on the next few words because of the linguistic dependencies between nearby words: if
there are two interpretations of the current word that are both acoustically plausible, we
may have to look far into the future (and the past) to disambiguate them. This is also
true of handwriting recognition and many other sequence-to-sequence learning tasks.
Bidirectional recurrent neural networks (or bidirectional RNNs) were invented to
address that need (Schuster and Paliwal, 1997). They have been extremely success-
ful (Graves, 2012) in applications where that need arises, such as handwriting (Graves
et al., 2008; Graves and Schmidhuber, 2009), speech recognition (Graves and Schmid-
huber, 2005; Graves et al., 2013) and bioinformatics (Baldi et al., 1999).
As the name suggests, the basic idea behind bidirectional RNNs is to combine a
forward-going RNN and a backward-going RNN. Figure 10.9 illustrates the typical bidi-
rectional RNN, with h
t
standing for the state of the forward-going RNN and g
t
standing
for the state of the backward-going RNN. This allows the units o
t
to compute a rep-
202
resentation that depends on both the past and the future but is most sensitive to the
input values around time t, without having to specify a xed-size window around t (as
one would have to do with a feedforward network, a convolutional network, or a regular
RNN with a fixed-size look-ahead buffer).
This idea can be naturally extended to 2-dimensional input, such as images, by
having four RNNs, each one going in one of the four directions: up, down, left, right. At
each point (i, j) of a 2-D grid, an output o
i,j
could then compute a representation that
would capture mostly local information but could also depend on long-range inputs, if
the RNN are able to learn to carry that information.
203
W
U
W
U
W
U
y"
L"
x
1"
x
2"
x
3"
x
4"
V"
V"
V"
V"
o"
Figure 10.10: A recursive network has a computational graph that generalizes that of
the recurrent network from a chain to a tree. In the figure, a variable-size sequence
x
1
, x
2
, . . . can be mapped to a fixed-size representation (the output o), with a fixed
number of parameters (e.g. the weight matrices U, V , W ). The figure illustrates a
supervised learning case in which some target y is provided which is associated with the
whole sequence.
10.4 Recursive Neural Networks
Recursive net represent yet another generalization of recurrent networks, with a dif-
ferent kind of computational graph, which this times looks like a tree. The typical
computational graph for a recursive network is illustrated in Figure 10.10. Recursive
neural networks were introduced by Pollack (1990) and their potential use for learning
to reason were nicely laid down by Bottou (2011). Recursive networks have been suc-
cessfully applied in processing data structures as input to neural nets (Frasconi et al.,
204
1997, 1998), in natural language processing (Socher et al., 2011a,c, 2013) as well as in
computer vision (Socher et al., 2011b).
One clear advantage of recursive net over recurrent nets is that for a sequence of the
same length N, depth can be drastically reduced from N to O(log N). An open question
is how to best structure the tree, though. One option is to have a tree structure which
does not depend on the data, e.g., a balanced binary tree. Another is to use an external
method, such as a natural language parser (Socher et al., 2011a, 2013). Ideally, one
would like the learner itself to discover and infer the tree structure that is appropriate
for any given input, as suggested in Bottou (2011).
Many variants of the recursive net idea are possible. For example, in Frasconi et al.
(1997, 1998), the data is associated with a tree structure in the first place, and inputs
and/or targets with each node of the tree. The computation performed by each node
does not have to be the traditional artificial neuron computation (affine transformation
of all inputs followed by a monotone non-linearity). For example, Socher et al. (2013)
propose using tensor operations and bilinear forms, which have previously been found
useful to model relationships between concepts (Weston et al., 2010; Bordes et al., 2012)
when the concepts are represented by continuous vectors (embeddings).
10.5 Auto-Regressive Networks
One of the basic ideas behind recurrent networks is that of directed graphical models with
a twist: we decompose a probability distribution as a product of conditionals without
explicitly cutting any arc in the graphical model, but instead reducing complexity by
parametrizing the transition probability in a recursive way that requires a fixed (and not
exponential) number of parameters, due to a form of parameter sharing (see Section 7.9
for an introduction to the concept). Instead of reducing P (x
t
|x
t1
, . . . , x
1
) to something
like P (x
t
|x
t1
, . . . , x
tk
) (assuming the k previous ones as the parents), we keep the
full dependency but we parametrize the conditional efficiently in a way that does not
grow with t, exploiting parameter sharing. When the above conditional probability
distribution is in some sense stationary, i.e., the relation between the past and the next
observation does not depend on t, only on the values of the past observations, then this
form of parameter sharing makes a lot of sense, and for recurrent nets it allows to use
the same model for sequences of different lengths.
Auto-regressive networks are similar to recurrent networks in the sense that we also
decompose a joint probability over the observed variables as a product of conditionals of
the form P (x
t
|x
t1
, . . . , x
1
) but we drop the form of parameter sharing that makes these
conditionals all share the same parametrization. This makes sense when the variables
are not elements of a translation-invariant sequence, but instead form an arbitrary tuple
without any particular ordering that would correspond to a translation-invariant form of
relationship between variables at position k and variables at position k
0
. In some forms
of auto-regressive networks, such as NADE (Larochelle and Murray, 2011), described in
Section 10.5.3 below, we can re-introduce a form of parameter sharing that is different
from the one found in recurrent networks, but that brings both a statistical advantage
205
(less parameters) and a computational advantage (less computation). Although we drop
the sharing over time, as we see below in Section 10.5.2, using a deep learning concept
of re-use of features, we can share features that have been computed for predicting x
tk
with the sub-network that predicts x
t
.
x
1#
x
2#
x
3#
x
4#
P(x
1
)
#
P(x
2
|x
1
)
#
P(x
3
|x
2#
,x
1
)
#
P(x
4
|x
3#
,#x
2#
,x
1
)
#
#
x
1#
x
2#
x
3#
x
4#
Figure 10.11: An auto-regressive network predicts the i-th variable from the i1 previous
ones. Left: corresponding graphical model (which is the same as that of a recurrent
network). Right: corresponding computational graph, in the case of the logistic auto-
regressive network, where each prediction has the form of a logistic regression, i.e., with
i free parameters (for the i 1 weights associated with i 1 inputs, and an offset
parameter).
10.5.1 Logistic Auto-Regressive Networks
Let us first consider the simplest auto-regressive network, without hidden units, and
hence no sharing at all. Each P (x
t
|x
t1
, . . . , x
1
) is parametrized as a linear model,
e.g., a logistic regression. This model was introduced by Frey (1998) and has O(T
2
)
parameters when there are T variables to be modeled. It is illustrated in Figure 10.11,
showing both the graphical model (left) and the computational graph (right).
A clear disadvantage of the logistic auto-regressive network is that one cannot easily
increase its capacity in order to capture more complex data distributions. It defines
a parametric family of fixed capacity, like the linear regression, the logistic regression,
or the Gaussian distribution. In fact, if the variables are continuous, one gets a linear
auto-regressive model, which is thus another way to formulate a Gaussian distribution,
i.e., only capturing pairwise interactions between the observed variables.
206
x
1#
x
2#
x
3#
x
4#
P(x
1
)
#
P(x
2
|x
1
)
#
P(x
3
|x
2#
,x
1
)
#
P(x
4
|x
3#
,#x
2#
,x
1
)
#
#
h
1#
h
2#
h
3#
Figure 10.12: A neural auto-regressive network predicts the i-th variable x
i
from the
i1 previous ones, but is parametrized so that features (groups of hidden units denoted
h
i
) that are functions of x
1
, . . . , x
i
can be re-used in predicting all of the subsequent
variables x
i+1
, x
i+2
, . . ..
10.5.2 Neural Auto-Regressive Networks
Neural Auto-Regressive Networks have the same left-to-right graphical model as logistic
auto-regressive networks (Figure 10.11, left) but a different parametrization that is at
once more powerful (allowing to extend the capacity as needed and approximate any
joint distribution) and can improve generalization by introducing a parameter sharing
and feature sharing principle common to deep learning in general. The first paper on
neural auto-regressive networks by Bengio and Bengio (2000b) (see also Bengio and
Bengio (2000a) for the more extensive journal version) were motivated by the objective
to avoid the curse of dimensionality arising out of traditional non-parametric graphi-
cal models, sharing the same structure as Figure 10.11 (left). In the non-parametric
discrete distribution models, each conditional distribution is represented by a table of
probabilities, with one entry and one parameter for each possible configuration of the
variables involved. By using a neural network instead, two advantages are obtained:
1. The parametrization of each P (x
t
|x
t1
, . . . , x
1
) by a neural network with (t
1) × K inputs and K outputs (if the variables are discrete and take K values,
encoded one-hot) allows to estimate the conditional probability without requiring
an exponential number of parameters (and examples), yet still allowing to capture
high-order dependencies between the random variables.
2. Instead of having a different neural network for the prediction of each x
t
, a left-
to-right connectivity illustrated in Figure 10.12 allows to merge all the neural
207
networks into one. Equivalently, it means that the hidden layer features computed
for predicting x
t
can be re-used for predicting x
t+k
(k > 0). The hidden units are
thus organized in groups that have the particularity that all the units in the t-th
group only depend on the input values x
1
, . . . , x
t
. In fact the parameters used to
compute these hidden units are jointly optimized to help the prediction of all the
variables x
t
, x
t+1
, x
t+2
, . . .. This is an instance of the re-use principle that makes
multi-task learning and transfer learning successful with neural networks and deep
learning in general (See Sections 7.12 and 17.2).
Each P (x
t
|x
t1
, . . . , x
1
) can represent a conditional distribution by having outputs of
the neural network predict parameters of the conditional distribution of x
t
, as discussed
in Section 6.2.2. Although the original neural auto-regressive networks were initially
evaluated in the context of purely discrete multivariate data (e.g., with a sigmoid -
Bernoulli case - or softmax output - Multinoulli case) it is straightforward to extend
such models to continuous variables or joint distributions involving both discrete and
continuous variables, as for example with RNADE introduced below (Uria et al., 2013).
h
1#
x
1#
x
2#
x
3#
x
4#
P(x
1
)
#
P(x
2
|x
1
)
#
P(x
3
|x
2#
,x
1
)
#
P(x
4
|x
3#
,#x
2#
,x
1
)
#
#
h
2#
h
3#
W
1#
W
1#
W
1#
W
2#
W
2#
W
3#
Figure 10.13: NADE (Neural Auto-regressive Density Estimator) is a neural auto-
regressive network, i.e., the hidden units are organized in groups h
j
so that only the
inputs x
1
, . . . , x
i
participate in computing h
i
and predicting P (x
j
|x
j1
, . . . , x
1
), for
j > i. The particularity of NADE is the use of a particular weight sharing pattern:
the same W
0
jki
= W
ki
is shared (same color and line pattern in the figure) for all the
weights outgoing from x
i
to the k-th unit of any group j i. The vector (W
1i
, W
2i
, . . .)
is denoted W
i
here.
208
10.5.3 NADE
A very successful recent form of neural auto-regressive network was proposed by Larochelle
and Murray (2011). The architecture is basically the same as for the original neural
auto-regressive network of Bengio and Bengio (2000b) except for the introduction of a
weight-sharing scheme: as illustrated in Figure 10.13. The parameteres of the hidden
units of different groups j are shared, i.e., the weights W
0
jki
from the i-th input x
i
to
the k-th element of the j-th group of hidden unit h
jk
(j i) are shared:
W
0
jki
= W
ki
with (W
1i
, W
2i
, . . .) denoted W
i
in Figure 10.13.
This particular sharing pattern is motivated in Larochelle and Murray (2011) by the
computations performed in the mean-field inference
3
of an RBM, when only the first i
inputs are given and one tries to infer the subsequent ones. This mean-field inference
corresponds to running a recurrent network with shared weights and the first step of
that inference is the same as in NADE. The only difference is that with the proposed
NADE, the output weights are not forced to be simply transpose values of the input
weights (they are not tied). One could imagine actually extending this procedure to not
just one time step of the mean-field recurrent inference but to k steps, as in Raiko et al.
(2014).
Although the neural auto-regressive networks and NADE were originally proposed to
deal with discrete distributions, they can in principle be generalized to continuous ones
by replacing the conditional discrete probability distributions (for P (x
j
|x
j1
, . . . , x
1
))
by continuous ones, following general practice to predict continuous random variables
with neural networks (see Section 6.2.2) using the log-likelihood framework. A fairly
generic way of parametrizing a continuous density is as a Gaussian mixture, and this
avenue has been successfully evaluated for the neural auto-regressive architecture with
RNADE (Uria et al., 2013). One interesting point to note is that stochastic gradient
descent can be numerically ill-behaved due to the interactions between the conditional
means and the conditional variances (especially when the variances become small). Uria
et al. (2013) have used a heuristic to rescale the gradient on the component means by
the associated standard deviation which seems to have helped optimizing RNADE.
Another very interesting extension of the neural auto-regressive architectures gets
rid of the need to choose an arbitrary order for the observed variables (Murray and
Larochelle, 2014). The idea is to train the network to be able to cope with any order
by randomly sampling orders and providing the information to hidden units specifying
which of the inputs are observed (on the - right - conditioning side of the bar) and which
are to be predicted and are thus considered missing (on the - left - side of the conditioning
bar). This is nice because it allows to use a trained auto-regressive network to perform
any inference (i.e. predict or sample from the probability distribution over any subset of
variables given any subset) extremely efficiently. Finally, since many orders are possible,
the joint probability of some set of variables can be computed in many ways (N! for N
3
Here, unlike in Section 14.5, the inference is over some of the input variables that are missing, given
the observed ones.
209
variables), and this can be exploited to obtain a more robust probability estimation and
better log-likelihood, by simply averaging the log-probabilities predicted by different
randomly chosen orders. In the same paper, the authors propose to consider deep
versions of the architecture, but unfortunately that immediately makes computation as
expensive as in the original neural auto-regressive neural network (Bengio and Bengio,
2000b). The first layer and the output layer can still be computed in O(NH) multiply-
add operations, as in the regular NADE, where H is the number of hidden units (the
size of the groups h
i
, in Figures 10.13 and 10.12), whereas it is O(N
2
H) in Bengio and
Bengio (2000b). However, for the other hidden layers, the computation is O(N
2
H
2
) if
every “previous” group at layer k participates in predicting the “next” group at layer
k + 1, assuming N groups of H hidden units at each layer. Making the i-th group at
layer k + 1 only depend on the i-th group, as in Murray and Larochelle (2014) at layer
k reduces it to O(NH
2
), which is still H times worse than the regular NADE.
10.6 Facing the Challenge of Long-Term Dependencies
The mathematical challenge of learning long-term dependencies in recurrent networks
was introduced in Section 8.2.5. The basic problem is that gradients propagated over
many stages tend to either vanish (most of the time) or explode (rarely, but with much
damage to the optimization). Even if we assume that the parameters are such that the
recurrent network is stable (can store memories, with gradients not exploding), the dif-
ficulty with long-term dependencies arises from the exponentially smaller weights given
to long-term interactions (involving the multiplication of many Jacobians) compared
short-term ones. See Hochreiter (1991); Doya (1993); Bengio et al. (1994); Pascanu
et al. (2013a) for a deeper treatment.
In this section we discuss various approaches that have been proposed to alleviate
this difficulty with learning long-term dependencies.
10.6.1 Echo State Networks: Choosing Weights to Make Dynamics
Barely Contractive
The recurrent weights and input weights of a recurrent network are those that define the
state representation captured by the model, i.e., how the state s
t
(hidden units vector)
at time t (Eq. 10.2) captures and summarizes information from the previous inputs
x
1
, x
2
, . . . , x
t
. Since learning the recurrent and input weights is difficult, one option
that has been proposed (Jaeger and Haas, 2004; Jaeger, 2007b; Maass et al., 2002) is
to set those weights such that the recurrent hidden units do a good job of capturing
the history of past inputs, and only learn the ouput weights. This is the idea that
was independently proposed for Echo State Networks or ESNs (Jaeger and Haas, 2004;
Jaeger, 2007b) and Liquid State Machines (Maass et al., 2002). The latter is similar,
except that it uses spiking neurons (with binary outputs) instead of the continuous
valued hidden units used for ESNs. Both ESNs and liquid state machines are termed
reservoir computing (Lukoˇseviˇcius and Jaeger, 2009) to denote the fact that the hidden
210
units form of reservoir of temporal features which may capture different aspects of the
history of inputs.
One way to think about these reservoir computing recurrent networks is that they are
similar to kernel machines: they allow to map an arbitrary length sequence (the history
of inputs up to time t) into a fixed-length vector (the recurrent state s
t
), on which a
linear predictor (typically a linear regression) can be applied to solve the problem of
interest. The training criterion is therefore convex in the parameters (which are just the
output weights) and can actually be solved online in the linear regression case (using
online updates for linear regression (Jaeger, 2003)).
The important question is therefore: how do we set the input and recurrent weights
so that a rich set of histories can be represented in the recurrent neural network state?
The answer proposed in the reservoir computing literature is to make the dynamical
system associated with the recurrent net nearly on the edge of chaos. (TODO make that
more precise). As alluded to in 8.2.5, an important characteristic of a recurrent network
is the eigenvalue spectrum of the Jacobians J
t
=
s
t
s
t1
, and in particular the spectral
radius of J
t
, i.e., its largest eigenvalue. If it is greater than 1, the dynamics can diverge,
meaning that small differences in the state value at t can yield a very large difference at
T later in the sequence. To see this, consider the simpler case where the Jacobian matrix
J does not change with t. If a change s in the state at t is aligned with an eigenvector v
of J with eigenvalue λ > 1, then the small change s becomes λs after one time step,
and λ
N
s after N time steps. If λ > 1 this makes the change exponentially large. With
a non-linear map, the Jacobian keeps changing and the dynamics is more complicated
but what remains is that a small initial variation can turn into a large variation after
a number of steps. In general, the recurrent dynamics are bounded (for example, if
the hidden units use a bounded non-linearity such as the hyperbolic tangent) so that
the change after N steps must also be bounded. Instead when the largest eigenvalue
λ < 1, we say that the map from t to t +1 is contractive: a small change gets contracted,
becoming smaller after each time step. This necessarily makes the network forgetting
information about the long-term past, but it also makes its dynamics stable and easier
to use.
What has been proposed to set the weights of reservoir computing machines is to
make the Jacobians slightly contractive. This is achieved by making the spectral radius
of the weight matrix large but slightly less than 1. However, in practice, good results
are often found with a spectral radius of the recurrent weight matrix being slightly
larger than 1, e.g., 1.2 (Sutskever, 2012; Sutskever et al., 2013). Keep in mind that with
hyperboling tangent units, the maximum derivative is 1, so that in order to guarantee a
Jacobian spectral radius less than 1, the weight matrix should have spectral radius less
than 1 as well. However, most derivatives of the hyperbolic tangent will be less than 1,
which may explain Sutskever’s empirical observation.
More recently, it has been shown that the techniques used to set the weights in
ESNs could be used to initialize the weights in a fully trainable recurrent network (e.g.,
trained using back-propagation through time), helping to learn long-term dependen-
cies (Sutskever, 2012; Sutskever et al., 2013). In addition to setting the spectral radius
to 1.2, Sutskever sets the recurrent weight matrix to be initially sparse, with only 15
211
non-zero input weights per hidden unit.
Note that when some eigenvalues of the Jacobian are exactly 1, information can be
kept in a stable way, and back-propagated gradients neither vanish nor explode. The
next two sections show methods to make some paths in the unfolded graph correspond
to “multiplying by 1” at each step, i.e., keeping information for a very long time.
10.6.2 Combining Short and Long Paths in the Unfolded Flow Graph
An old idea that has been proposed to deal with the difficulty of learning long-term
dependencies is to use recurrent connections with long delays. Whereas the ordinary
recurrent connections are associated with a delay of 1 (relating the state at t with the
state at t + 1), it is possible to construct recurrent networks with longer delays (Bengio,
1991), following the idea of incorporating delays in feedforward neural networks (Lang
and Hinton, 1988) in order to capture temporal structure (with Time-Delay Neural
Networks, which are the 1-D predecessors of Convolutional Neural Networks, discussed
in Chapter 9).
x
t
x
t1
x
t+1
x
unfold
s
o
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
W
1
W
3
W
1
W
1
W
1
W
1
W
3
s
t2
W
3
W
3
W
3
Figure 10.14: A recurrent neural networks with delays, in which some of the connections
reach back in time to more than one time step. Left: connectivity of the recurrent net,
with square boxes indicating the number of time delays associated with a connection.
Right: unfolded recurrent network. In the figure there are regular recurrent connections
with a delay of 1 time step (W
1
) and recurrent connections with a delay of 3 time steps
(W
3
). The advantage of these longer-delay connections is that they allow to connect
past states to future states through shorter paths (3 times shorter, here), going through
these longer delay connections (in red).
As we have seen in Section 8.2.5, gradients will vanish exponentially with respect
to the number of time steps. If we have recurrent connections with a time-delay of D,
then instead of the vanishing or explosion going as O(λ
T
) over T time steps (where λ is
the largest eigenvalue of the Jacobians
s
t
s
t1
), the unfolded recurrent network now has
paths through which gradients grow as O(λ
T/D
) because the number of effective steps
is T/D. This allows the learning algorithm to capture longer dependencies although
212
not all long-term dependencies may be well represented in this way. This idea was first
explored in Lin et al. (1996) and is illustrated in Figure 10.14.
10.6.3 Leaky Units and a Hierarchy of Different Time Scales
A related idea in order to obtain paths on which the product of derivatives is close to 1
is to have units with linear self-connections and a weight near 1 on these connections.
The strength of that linear self-connection corresponds to a time scale and thus we
can have different hidden units which operate at different time scales (Mozer, 1992).
Depending on how close to 1 these self-connection weights are, information can travel
forward and gradients backward with a different rate of “forgetting” or contraction to
0, i.e., a different time scale. One can view this idea as a smooth variant of the idea of
having different delays in the connections presented in the previous section. Such ideas
were proposed in Mozer (1992); ElHihi and Bengio (1996), before a closely related idea
discussed in the next section of gating these self-connections in order to let the network
control at what rate each unit should be contracting.
The idea of leaky units with a self-connection actually arises naturally when consid-
ering a continuous-time recurrent neural network such as
˙s
i
τ
i
= s
i
+ σ(b
i
+ W s + Ux)
where σ is the neural non-linearity (e.g., sigmoid or tanh), τ
i
> 0 is a time constant and
˙s
i
indicates the temporal derivative of unit s
i
. A related equation is
˙s
i
τ
i
= s
i
+ (b
i
+ W σ(s) + Ux)
where the state vector s (with elements s
i
) now represents the pre-activation of the
hidden units.
When discretizing in time such equations (which changes the meaning of τ), one gets
s
t+1,i
s
t,i
=
s
t,i
τ
i
+
1
τ
i
σ(b
i
+ W s
t
+ Ux
t
)
s
t+1,i
= (1
1
τ
i
)s
t,i
+
1
τ
i
σ(b
i
+ W s
t
+ Ux
t
). (10.7)
We see that the new value of the state is a convex linear combination of the old value
and of the value computed based on current inputs and recurrent weights, if 1 τ
i
< .
When τ
i
= 1, there is no linear self-recurrence, only the non-linear update which we find
in ordinary recurrent networks. When τ
i
> 1, this linear recurrence allows gradients to
propagate more easily. When τ
i
is large, the state changes very slowly, integrating the
past values associated with the input sequence.
By associating different time scales τ
i
with different units, one obtains different
paths corresponding to different forgetting rates. Those time constants can be fixed
manually (e.g., by sampling from a distribution of time scales) or can be learned as
free parameters, and having such leaky units at different time scales appears to help
with long-term dependencies (Mozer, 1992; Pascanu et al., 2013a). Note that the time
213
constant τ thus corresponds to a self-weight of (1
1
τ
), but without any non-linearity
involved in the self-recurrence.
Consider the extreme case where τ : because the leaky unit just averages
contributions from the past, the contribution of each time step is equivalent and there
is no associated vanishing or exploding effect. An alternative is to avoid the weight of
1
τ
i
in front of σ(b
i
+ W s
t
+ Ux
t
), thus making the state sum all the past values when τ
i
is large, instead of averaging them.
10.6.4 The Long-Short-Term-Memory Architecture and Other Gated
RNNs
Whereas in the previous section we consider creating paths where derivatives neither
vanish nor explode too quickly by introducing self-loops, leaky units have self-weights
that are not context-dependent: they are fixed, or learned, but remain constant during
a whole test sequence.
+ X
X
X
input
input gate
forget gate
output gate
output
state
self-loop
Figure 10.15: Block diagram of the LSTM recurrent network “cell”. Cells are connected
recurrently to each other, replacing the usual hidden units of ordinary recurrent net-
works. An input feature is computed with a regular artificial neuron unit, and its value
can be accumulated into the state if the sigmoidal input gate allows it. The state unit
has a linear self-loop whose weight is controlled by the forget gate. The output of the
cell can be shut off by the output gate. All the gating units have a sigmoid non-linearity,
while the input unit can have any squashing non-linearity. The state unit can also be
used as extra input to the gating units.
It is worthwhile considering the role played by leaky units: they allow to accumulate
214
information (e.g. evidence for a particular feature or category) over a long duration.
However, once that information gets used, it might be useful for the neural network to
forget the old state. For example, if a sequence is made of subsequences and we want a
leaky unit to accumulate evidence inside each sub-subsequence, we need a mechanism
to forget the old state by setting it to zero and starting to count from fresh. Instead
of manually deciding when to clear the state, we want the neural network to learn to
decide when to do it.
LSTM
This clever idea of conditioning the forgetting on the context is a core contribution of
the Long-Short-Term-Memory (LSTM) algorithm (Hochreiter and Schmidhuber, 1997),
described below. Several variants of the LSTM are found in the literature (Hochreiter
and Schmidhuber, 1997; Graves, 2012; Graves et al., 2013; Graves, 2013; Sutskever et al.,
2014) but the principle is always to have a linear self-loop through which gradients can
flow for long durations. By making the weight of this self-loop gated (controlled by
another hidden unit), the time scale of integration can be changed dynamically (even
for fixed parameters, but based on the input sequence).
The LSTM block diagram is illustrated in Figure 10.15. The corresponding forward
(state update equations) are follows, in the case of the vanilla recurrent network ar-
chitecture. Deeper architectures have been successfully used in Graves et al. (2013);
Pascanu et al. (2014b). Instead of a unit that simply applies a squashing function on
the affine transformation of inputs and recurrent units, LSTM networks have “LSTM
cells”. Each cell has the same inputs and outputs as a vanilla recurrent network, but
has more parameters and a system of gating units that controls the flow of information.
The most important component is the state unit s
t
that has a linear self-loop similar to
the leaky units described in the previous section, but where the self-loop weight (or the
associated time constant) is controlled by a forget gate unit h
f
t,i
(for time step t and cell
i), that sets this weight to a value between 0 and 1 via a sigmoid unit:
h
f
t,i
= sigmoid(b
f
i
+
X
j
U
f
ij
x
t,j
+
X
j
W
f
ij
h
t,j
). (10.8)
where x
t
is the current input vector and h
t
is the current hidden layer vector, containing
the outputs of all the LSTM cells, and b
f
,U
f
, W
f
are respectively biases, input weights
and recurrent weights for the forget gates. The LSTM cell internal state is thus updated
as follows, following the pattern of Eq. 10.7, but with a conditional self-loop weight h
f
t,i
:
s
t+1,i
= h
f
t,i
s
t,i
+ h
e
t,i
σ(b
i
+
X
j
U
ij
x
t,j
+
X
j
W
ij
h
t,j
). (10.9)
b, U and W respectively the biases, input weights and recurrent weights into the LSTM
cell, and the external input gate unit h
e
t,i
is computed similarly to the forget gate (i.e.,
with a sigmoid unit to obtain a gating value between 0 and 1), but with its own param-
eters:
h
e
t,i
= sigmoid(b
e
i
+
X
j
U
e
ij
x
t,j
+
X
j
W
e
ij
h
t,j
). (10.10)
215
The output h
t+1,i
of the LSTM cell can also be shut off, via the output gate h
o
t,i
, which
also uses a sigmoid unit for gating:
h
t+1,i
= tanh(s
t+1,i
)h
o
t,i
h
o
t,i
= sigmoid(b
o
i
+
X
j
U
o
ij
x
t,j
+
X
j
W
o
ij
h
t,j
) (10.11)
which has parameters b
o
, U
o
, W
o
for its biases, input weights and recurrent weights,
respectively. Among the variants, one can choose to use the cell state s
t,i
as an extra
input (with its weight) into the three gates of the i-th unit, as shown in Figure 10.15.
This would require three additional parameters.
LSTM networks have been shown to learn long-term dependencies more easily than
vanilla recurrent architectures, first on artificial data sets designed for testing the abil-
ity to learn long-term dependencies Bengio et al. (1994); Hochreiter and Schmidhuber
(1997); Hochreiter et al. (2000), then on challenging sequence processing tasks where
state-of-the-art performance was obtained (Graves, 2012; Graves et al., 2013; Sutskever
et al., 2014).
Other Gated RNNs
Which pieces of the LSTM architecture are actually necessary? What other successful
architectures could be designed that allow the network to dynamically control the time
scale and forgetting behavior of different units?
Some answers to these questions are given with the recent work on gated RNNs,
which was successfully used in reaching the MOSES state-of-the-art for English-to-
French machine translation (Cho et al., 2014). The main difference with the LSTM
is that a single gating unit simultaneously controls the forgetting factor and the decision
to update the state unit, which is natural if we consider the continuous-time interpre-
tation of the self-weight of the state, as in the equation for leaky units, Eq. 10.7. The
update equations are the following:
h
t+1,i
= h
u
t,i
h
t,i
+ (1 h
u
t,i
)σ(b
i
+
X
j
U
ij
x
t,j
+
X
j
W
ij
h
r
t,j
h
t,j
). (10.12)
where g
u
stands for “update” gate and g
r
for “reset” gate. Their value is defined as
usual:
h
u
t,i
= sigmoid(b
u
i
+
X
j
U
u
ij
x
t,j
+
X
j
W
u
ij
h
t,j
) (10.13)
and
h
r
t,i
= sigmoid(b
r
i
+
X
j
U
r
ij
x
t,j
+
X
j
W
r
ij
h
t,j
). (10.14)
Many more variants around this theme can be designed. For example the reset gate
(or forget gate) output could be shared across a number of hidden units. Or the product
of a global gate (covering a whole group of units, e.g., a layer) and a local gate (per
unit) could be used to combine global control and local control.
216
In addition, as discussed in the next section, different ways of making such RNNs
“deeper” are possible.
10.6.5 Deep RNNs
Figure 10.16 illustrate a number of architectural variations for RNNs which introduce
additional depth beyond what the vanilla architecture provides. In general, the input-to-
hidden, hidden-to-hidden, and hidden-to-output mappings can be made more powerful
than what the usual single-layer affine + nonlinearity transformation provides. This idea
was introduced in Pascanu et al. (2014b). The more established way of adding depth
is simply to stack different “layers” of RNNs on top of each other (Schmidhuber, 1992;
El Hihi and Bengio, 1996; Jaeger, 2007a; Graves, 2013) (bottom right in the Figure),
possibly with a different time scale at different levels in the hierarchy (El Hihi and
Bengio, 1996; Koutnik et al., 2014).
217
x
h
y
x
t
h
t-1
h
t
y
t
x
t
h
t-1
h
t
y
t
x
t
h
t-1
h
t
y
t
x
t
h
t-1
h
t
y
t
x
t
h
t-1
h
t
y
t
z
t-1
z
t
Figure 10.16: Examples of deeper RNN architecture variants. Top left: vanilla RNN
with input sequence and an output sequence. Top middle: one step of the corresponding
unfolded flow graph. Top right: with a deep hidden-to-hidden transformation, instead
of the usual single-layer transformation (e.g., replacing the affine + softmax layer by a
full MLP with a single hidden layer, taking both the previous state and current input
as inputs). Bottom left: same but with skip connections that allow gradients to flow
more easily backwards in time in spite of the extra non-linearity due to the intermediate
hidden layer. Bottom middle: similarly, depth can also be added in the hidden-to-output
transformation. Bottom right: a hierarchy of RNNs, which can be stacked on top of
each other in various ways.
10.6.6 Better Optimization
A central optimization difficulty with RNNs regards the learning of long-term dependen-
cies (Hochreiter, 1991; Bengio et al., 1993, 1994). This difficulty has been explained in
detail in Section 8.2.5. The jist of the problem is that the composition of the non-linear
recurrence with itself over many many time steps yields a highly non-linear function
whose derivatives (e.g. of the state at T w.r.t. the state at t < T , i.e. the Jacobian
matrix
s
T
s
t
) tend to either vanish or explode as T t increases, because it is equal to
the product of the state transition Jacobian matrices
s
t+1
s
t
)
If it explodes, the parameter gradient
θ
L also explodes, yielding gradient-based
218
parameter updates that are poor. A simple solution to this problem is discussed in
the next section (Sec. 10.6.7). However, as discussed in Bengio et al. (1994), if the
state transition Jacobian matrix has eigenvalues that are larger than 1 in magnitude,
then it can yield to “unstable” dynamics, in the sense that a bit of information cannot
be stored reliably for a long time in the presence of input “noise”. Indeed, the state
transition Jacobian matrix eigenvalues indicate how a small change in some direction
(the corresponding eigenvector) will be expanded (if the eigenvalue is greater than 1) or
contracted (if it is less than 1).
If the eigenvalues of the state transition Jacobian are less than 1, then derivatives
associated with long-term effects tend to vanish as T t increases, making them expo-
nentially smaller in magnitude (as components of the total gradient) then derivatives
associated with short-term effects. This therefore makes it difficult (but not impossible)
to learn long-term dependencies.
An interesting idea proposed in Martens and Sutskever (2011) is that at the same
time as first derivatives are becoming smaller in directions associated with long-term
effects, so may the higher derivatives. In particular, if we use a second-order optimization
method (such as the Hessian-free method of Martens and Sutskever (2011)), then we
could differentially treat different directions: divide the small first derivative (gradient)
by a small second derivative, while not scaling up in the directions where the second
derivative is large (and hopefully, the first derivative as well). Whereas in the scalar
case, if we add a large number and a small number, the small number is “lost”, in the
vector case, if we add a large vector with a small vector, it is still possible to recover
the information about the direction of the small vector if we have access to information
(such as in the second derivative matrix) that tells us how to rescale appropriately each
direction.
One disadvantage of many second-order methods, including the Hessian-free method,
is that they tend to be geared towards “batch” training rather than “stochastic” up-
dates (where only one or a small minibatch of examples are examined before a pa-
rameter update is made). Although the experiments on recurrent networks applied to
problems with long-term dependencies showed very encouraging results in Martens and
Sutskever (2011), it was later shown that similar results could be obtained by much
simpler methods (Sutskever, 2012; Sutskever et al., 2013) involving better initialization,
a cheap surrogate to second-order optimization (a variant on the momentum technique,
Section 8.4), and the clipping trick described below.
10.6.7 Clipping Gradients
As discussed in Section 8.2.4, strongly non-linear functions such as those computed by
a recurrent net over many time steps tend to have derivatives that can be either very
large or very small in magnitude. This is illustrated in Figures 8.2 and 8.3, in which
we see that the objective function (as a function of the parameters) has a “landscape”
in which one finds “cliffs”: wide and rather flat regions separated by tiny regions where
the objective function changes quickly, forming a kind of cliff.
The difficulty that arises is that when the parameter gradient is very large, a gradient
219
descent parameter update could throw the parameters very far, into a region where the
objective function is larger, wasting a lot of the work that had been down to reach the
current solution. This is because gradient descent is hinged on the assumption of small
enough steps, and this assumption can easily be violated when the same learning rate is
used for both the flatter parts and the steeper parts of the landscape.
Figure 10.17: Example of the effect of gradient clipping in a recurrent network with
two paramters w and b. Vertical axis is the objective function to minimize. Note the
cliff where the gradient explodes and from where gradient descent can get pushed very
far. Clipping the gradient when its norm is above a threshold (Pascanu et al., 2013a)
prevents this catastrophic outcome and helps training recurrent nets with long-term
dependencies to be captured.
A simple type of solution has been in used by practitioners for many years: clipping
the gradient. There are different instances of this idea (Mikolov, 2012; Pascanu et al.,
2013a). One option is to clip the gradient element-wise (Mikolov, 2012). Another is to
clip the norm of the gradient (Pascanu et al., 2013a).The latter has the advantage that
it guarantees that each step is still in the gradient direction, but experiments suggest
that both forms work similarly. Even simply taking a random step when the gradient
magnitude is above a threshold tends to work almost as well.
10.6.8 Regularizing to Encourage Information Flow
Whereas clipping helps dealing with exploding gradients, it does not help with vanishing
gradients. To address vanishing gradients and better capture long-term dependencies,
we discussed the idea of creating paths in the computational graph of the unfolded re-
current architecture along which the product of gradients associated with arcs is near
1. One approach to achieve this is with LSTM and other self-loops and gating mech-
anisms, described above in Section 10.6.4. Another idea is to regularize or constrain
the parameters so as to encourage “information flow”. In particular, we would like the
220
gradient vector
s
t
L being back-propagated to maintain its magnitude (even if there is
only a loss at the end of the sequence), i.e., we want
s
t
L
s
t
s
t1
to be as large as
s
t
L.
With this objective, Pascanu et al. (2013a) propose the following regularizer:
Ω =
X
t
|∇
s
t
L
s
t
s
t1
|
||∇
s
t
L||
1
2
. (10.15)
It looks like computing the gradient of this regularizer is difficult, but Pascanu et al.
(2013a) propose an approximation in which we consider the back-propagated vectors
s
t
L as if they were constants (for the purpose of this regularizer, i.e., no need to back-
prop through them). The experiments with this regularizer suggest that, if combined
with the norm clipping heuristic (which handles gradient explosion), it can considerably
increase the span of the dependencies that an RNN can learn. Because it keeps the
RNN dynamics on the edge of explosive gradients, the gradient clipping is particularly
important: otherwise gradient explosion prevents learning to succeed.
10.6.9 Organizing the State at Multiple Time Scales
Another promising approach to handle long-term dependencies is the old idea of orga-
nizing the state of the RNN at multiple time-scales (El Hihi and Bengio, 1996), with
information flowing more easily through long distances at the slower time scales. This
is illustrated in Figure 10.18.
Figure 10.18: Example of a multi-scale recurrent net architecture (unfolded in time),
with higher levels operating at a slower time scale. Information can flow unhampered
(either forward or backward in time) over longer durations at the higher levels, thus
creating long-paths (such as the red dotted path) through which long-term dependencies
between elements of the input/output sequence can be captured.
There are different ways in which a group of recurrent units can be forced to operate
at different time scales. One option is to make the recurrent units leaky (as in Eq. 10.7),
but to have different groups of units associated with different fixed time scales. This was
221
the proposal in Mozer (1992) and has been successfully used in Pascanu et al. (2013a).
Another option is to have explicit and discrete updates taking place at different times,
with a different frequency for different groups of units, as in Figure 10.18. This is the
approach of El Hihi and Bengio (1996); Koutnik et al. (2014) and it also worked well on
a number of benchmark datasets.
10.7 Handling Temporal Dependencies with N-Grams, HMMs,
CRFs and Other Graphical Models
This section regards probabilistic approches to sequential data modeling which have
often been viewed as in competition with RNNs, although RNNs can be seen as a
particular form of dynamical Bayes nets (as directed graphical models with deterministic
latent variables).
10.7.1 N-grams
N-grams are non-parametric estimators of conditional probabilities based on counting
relative frequencies of occurence, and they have been the core building block of statis-
tical language modeling for many decades (Jelinek and Mercer, 1980; Katz, 1987; Chen
and Goodman, 1999). Like RNNs, they are based on the product rule (or chain rule)
decomposition of the joint probability into conditionals, Eq. 10.6, which relies on esti-
mates P (x
t
|x
t1
, . . . , x
1
) to compute P (x
1
, . . . , x
T
). What is particular of n-grams is
that
1. they estimate these conditional probabilities based only on the last n 1 values
(to predict the next one)
2. they assume that the data is symbolic, i.e., x
t
is a symbol taken from a finite
alphabet V (for vocabulary), and
3. the conditional probability estimates are obtained from frequency counts of all the
observed length-n subsequences, hence the names unigram (for n=1), bigram (for
n=2), trigram (for n=3), and n-gram in general.
The maximum likelihood estimator for these conditional probabilities is simply the
relative frequency of occurence of the left hand symbol in the context of the right hand
symbols, compared to all the other possible symbols in V :
P (x
t
|x
t1
, . . . , x
tn+1
) =
#{x
t
, x
t1
, . . . , x
tn+1
}
P
x
t
V
#{x
t
, x
t1
, . . . , x
tn+1
}
(10.16)
where #{a, b, c} denotes the cardinality of the set of tuples (a, b, c) in the training set,
and where the denominator is also a count (if border effects are handled properly).
A fundamental limitation of the above estimator is that it is very likely to be zero
in many cases, even though the tuple (x
t
, x
t1
, . . . , x
tn+1
) may show up in the test
set. In that case, the test log-likelihood would be infinitely bad (−∞). To avoid that
222
catastrophic outcome, n-grams employ some form of smoothing, i.e., techniques to shift
probability mass from the observed tuples to unobserved ones that are similar (a central
idea behind most non-parametric statistical methods). See Chen and Goodman (1999)
for a review and empirical comparisons. One basic technique consists in assigning a non-
zero probability mass to any of the possible next symbol values. Another very popular
idea consists in backing off, or mixing (as in mixture model), the higher-order n-gram
predictor with all the lower-order ones (with smaller n). Back-off methods look-up the
lower-order n-grams if the frequency of the context x
t1
, . . . , x
tn+1
is too small, i.e.,
considering the contexts x
t1
, . . . , x
tn+k
, for increasing k, until a sufficiently reliable
estimate is found.
Another interesting idea that is related to neural language models (Section 13.3) is
to break up the symbols into classes (by some form of clustering) and back-up to, or mix
with, less precise models that only consider the classes of the words in the context (i.e.
aggregating statistics from a larger portion of the training set). One can view the word
classes as a very impoverished learned representation of words which help to generalize
(across words of the same class). What distributed representations (e.g. neural word
embeddings) bring is a richer notion of similarity by which individual words keep their
own identity (instead of being undistinguishible from the other words in the same class)
and yet share learned attributes with other words with which they have some elements
in common (but not all). This kind of richer notion of similarity makes generalization
more specific and the representation not necessarily lossy, unlike with word classes.
10.7.2 Efficient Marginalization and Inference for Temporally Struc-
tured Outputs by Dynamic Programming
Many temporal modeling approaches can be cast in the following framework, which
also includes hybrids of neural networks with HMMs, CRFs, first introduced in Bottou
et al. (1997); LeCun et al. (1998a) and later developed and applied with great success
in Graves et al. (2006); Graves (2012) with the Connectionist Temporal Classification
approach, as well as in Do and Arti`eres (2010) and other more recent work (Farabet
et al., 2013b; Deng et al., 2014). These ideas have been rediscovered in a simplified
form (limiting the input-output relationship to a linear one) as CRFs (Lafferty et al.,
2001), i.e., undirected graphical models whose parameters are linear functions of input
variables. In section 10.8 we consider in more detail the neural network hybrids and the
“graph transformer” generalizations of the ideas presented below.
All these approaches (with or without neural nets in the middle) concern the case
where we have an input sequence (discrete or continuous-valued) {x
t
} and a symbolic
output sequence {y
t
} (typically of the same length, although shorter output sequences
can be handled by introducing “empty string” values in the output). Generalizations
to non-sequential output structure have been introduced more recently (e.g. to condi-
tion the Markov Random Fields sometimes used to model structural dependencies in
images (Stewart et al., 2007)), at the loss of exact inference (the dynamic programming
methods described below).
Optionally, one also considers a latent variable sequence {s
t
} that is also discrete
223
and inference needs to be done over {s
t
}, either via marginalization (summing over all
possible values of the state sequence) or maximization (picking exactly or approximately
the MAP sequence, with the largest probability). If the state variables s
t
and the
target variables y
t
have a 1-D Markov structure to their dependency, then computing
likelihood, partition function and MAP values can all be done efficiently by exploiting
dynamic programming to factorize the computation. On the other hand, if the state
or output sequence dependencies are captured by an RNN, then there is no finite-order
Markov property and no efficient and exact inference is generally possible. However,
many reasonable approximations have been used in the past, such as with variants of
the beam search algorithm (Lowerre, 1976).
The application of the principle of dynamic programming in these setups is the same
as what is used in the Forward-Backward algorithm for graphical models and HMMs
(next Section) and the Viterbi algorithm for inference. We outline the main lines of this
type of efficient computation here.
Figure 10.19: Example of a temporally structured output graph, as can be found in
CRFs, HMMs, and neural net hybrids. Each node corresponds to a particular value of
an output random variable at a particular point in the output sequence (contrast with
a graphical model representation, where each node corresponds to a random variable).
A path from the source node to the sink node (e.g. red bold arrows) corresponds to
an interpretation of the input as a sequence of output labels. The dynamic program-
ming recursions that are used for computing likelihood (or conditional likelihood) or
performing MAP inference (finding the best path) involve sums or maximizations over
sub-paths ending at one of the particular interior nodes.
Let G be a directed acyclic graph whose paths correspond to the sequences that can
be selected (for MAP) or summed over (marginalized for computing a likelihood), as
illustrated in Figure 10.19. In the above example, let z
t
represent the choice variable
224
(e.g., s
t
and y
t
in the above example), and each arc with score a corresponds to a
particular value of z
t
in its Markov context. In the language of undirected graphical
models, if a is the score associated with an arc from the node for z
t1
= j to the one
for z
t
= i, then a is minus the energy of a term of the energy function associated with
the event 1
z
t1
=j,z
t
=i
and the associated information from the input x (e.g. some value
of x
t
).
For example, with a Markov chain of order 1, each z
t
= i can linked via an arc
to a z
t1
= j. The order 1 Markov property means that P (z
t
|z
t1
, z
t2
, . . . , z
1
, x) =
P (z
t
|z
t1
, x), where x is conditioning information (the input sequence), so the number
of nodes would be equal to the length of the sequence, T, times the number of values
of z
t
, N, and the number of arcs of the graph would be up to TN
2
(if every value of
z
t
can follow every value of z
t1
, although in practice the connectivity is often much
smaller). A score a is computed for each arc (which may include some component that
only depends on the source or only on the destination node), as a function of the con-
ditioning information x. The inference or marginalization problems involve performing
the following computations.
For the marginalization task, we want to compute the sum over all complete paths
(e.g. from source to sink) of the product along the path of the exponentiated scores
associated with the arcs on that path:
S(G) =
X
pathG
Y
apath
e
a
(10.17)
where the product is over all the arcs on a path (with score a), and the sum is over all
the paths associated with complete sequences (from beginning to end of a sequence).
S(G) may correspond to a likelihood, numerator or denominator of a probability. For
example,
P ({z
t
} Y |x) =
S(G
Y
)
S(G)
(10.18)
where G
Y
is the subgraph of G which is restricted to sequences that are compatible with
some target answer Y .
For the inference task, we want to compute
π(G) = arg max
pathG
Y
apath
e
a
= arg max
pathG
X
apath
a
V (G) = max
pathG
X
apath
a
where π(G) is the most probable path and V (G) is its log-score or value, and again the
set of paths considered includes all of those starting at the beginning and ending at the
end the sequence.
The principle of dynamic programming is to recursively compute intermediate quan-
tities that can be re-used efficiently so as to avoid actually going through an exponential
number of computations, e.g., though the exponential number of paths to consider in
the above sums or maxima. Note how it is already at play in the underlying efficiency of
225
back-propagation (or back-propagation through time), where gradients w.r.t. interme-
diate layers or time steps or nodes in a flow graph can be computed based on previously
computed gradients (for later layers, time steps or nodes). Here it can be achieved by
considering to restrictions of the graph to those paths that end at a node n, which we
denote G
n
. As before, G
n
Y
indicates the additional restriction to subsequences that are
compatible with the target sequence Y , i.e., with the beginning of the sequence Y .
Figure 10.20: Illustration of the recursive computation taking place for inference or
marginalization by dynamic programming. See Figure 10.19. These recursions involve
sums or maximizations over sub-paths ending at one of the particular interior nodes
(red in the figure), each time only requiring to look up previously computed values at
the predecessor nodes (green).
We can thus perform marginalization efficiently as follows, and illustrated in Fig-
ure 10.20.
S(G) =
X
nfinal(G)
S(G
n
) (10.19)
where final(G) is the set of final nodes in the graph G, and we can recursively compute
the node-restricted sum via
S(G
n
) =
X
mpred(n)
S(G
m
)e
a
m,n
(10.20)
where pred(n) is the set of predecessors of node n in the graph and a
m,n
is the log-score
associated with the arc from m to n. It is easy to see that expanding the above recursion
recovers the result of Eq. 10.17.
Similarly, we can perform efficient MAP inference (also known as Viterbi decoding)
226
as follows.
V (G) = max
nfinal(G)
V (G
n
) (10.21)
and
V (G
n
) = max
mpred(n)
V (G
m
) + a
m,n
. (10.22)
To obtain the corresponding path, it is enough to keep track of the argmax associated
with each of the above maximizations and trace back π(G) starting from the nodes in
final(G). For example, the last element of π(G) is
n
arg max
nfinal(G)
V (G
n
)
and (recursively) the argmax node before n
along the selected path is a new n
,
n
arg max
mpred(n
)
V (G
m
) + a
m,n
,
etc. Keeping track of these n
along the way gives the selected path. Proving that
these recursive computations yield the desired results is straightforward and left as an
exercise.
10.7.3 HMMs
Hidden Markov Models (HMMs) are probabilistic models of sequences that were intro-
duced in the 60’s (Baum and Petrie, 1966) along with the E-M algorithm (Section 20.2).
They are very commonly used to model sequential structure, in particular having been
since the mid 80’s and until recently the technological core of speech recognition sys-
tems (Rabiner and Juang, 1986; Rabiner, 1989). Just like RNNs, HMMs are dynamic
Bayes nets (Koller and Friedman, 2009), i.e., the same parameters and graphical model
structure are used for every time step. Compared to RNNs, what is particular to HMMs
is that the latent variable associated with each time step (called the state) is discrete,
with a separate set of parameters associated with each state value. We consider here
the most common form of HMM, in which the Markov chain is of order 1, i.e., the state
s
t
at time t, given the previous states, only depends on the previous state s
t1
:
P (s
t
|s
t1
, s
t2
, . . . , s
1
) = P (s
t
|s
t1
),
which we call the transition or state-to-state distribution. Generalizing to higher-order
Markov chains is straightforward: for example, order-2 Markov chains can be mapped to
order-1 Markov chains by considering as order-1 “states” all the pairs (s
t
= i, s
t1
= j).
Given the state value, a generative probabilistic model of the visible variable x
t
is
defined, that specifies how each observation x
t
in a sequence (x
1
, x
2
, . . . , x
T
) can be
generated, via a model P(x
t
|s
t
). Two kinds of parameters are distinguished: those that
define the transition distribution, which can be given by a matrix
A
ij
= P (s
t
= i|s
t1
= j),
227
and those that define the output model P (x
t
|s
t
). For example, if the data are discrete
and x
t
is a symbol x
t
, then another matrix can be used to define the output (or emission)
model:
B
ki
= P (x
t
= k|s
t
= i).
Another common parametrization for P(x
t
|s
t
= i), in the case of continuous vector-
valued x
t
, is the Gaussian mixture model, where we have a different mixture (with its
own means, covariances and component probabilities) for each state s
t
= i. Alterna-
tively, the means and covariances (or just variances) can be shared across states, and
only the component probabilities are state-specific.
The overall likelihood of an observed sequence is thus
P (x
1
, x
2
, . . . , x
T
) =
X
s
1
,s
2
,...,s
T
Y
t
P (x
t
|s
t
)P (s
t
|s
t1
). (10.23)
In the language established earlier in Section 10.7.2, we have a graph G with one
node n per time step t and state value i, i.e., for s
t
= i, and one arc between each
node n (for 1
s
t
=i
) and its predecessors m for 1
s
t1
=j
(when the transition probability is
non-zero, i.e., P (s
t
= i|s
t1
= j) 6= 0). Following Eq. 10.23, the log-score a
m,n
for the
transition between m and n would then be
a
m,n
= log P (x
t
|s
t
= i) + log P (s
t
= i|s
t1
= j).
As explained in Section 10.7.2, this view gives us a dynamic programming algorithm
for computing the likelihood (or the conditional likelihood given some constraints on the
set of allowed paths), called the forward-backward or sum-product algorithm, in time
O(kNT ) where T is the sequence length, N is the number of states and k the average
in-degree of each node.
Although the likelihood is tractable and could be maximized by a gradient-based
optimization method, HMMs are typically trained by the E-M algorithm (Section 20.2),
which has been shown to converge rapidly (approaching the rate of Newton-like methods)
in some conditions (if we view the HMM as a big mixture, then the condition is for the
final mixture components to be well-separated, i.e., have little overlap) (Xu and Jordan,
1996).
At test time, the sequence of states that maximizes the joint likelihood
P (x
1
, x
2
, . . . , x
T
, s
1
, s
2
, . . . , s
T
)
can also be obtained using a dynamic programming algorithm (called the Viterbi algo-
rithm). This is a form of inference (see Section 14.5) that is called MAP (Maximum A
Posteriori) inference because we want to find the most probable value of the unobserved
state variables given the observed inputs. Using the same definitions as above (from
Section 10.7.2) for the nodes and log-score of the graph G in which we search for the
optimal path, the Viterbi algorithm corresponds to the recursion defined by Eq. 10.22.
If the HMM is structured in such a way that states have a meaning associated
with labels of interest, then from the MAP sequence one can read off the associated
228
labels. When the number of states is very large (which happens for example with large-
vocabulary speech recognition based on n-gram language models), even the efficient
Viterbi algorithm becomes too expensive, and approximate search must be performed. A
common family of search algorithms for HMMs is the Beam Search algorithm (Lowerre,
1976): basically, a set of promising candidate paths are kept and gradually extended,
at each step pruning the set of candidates to only keep the best ones according to their
cumulative score (log of the joint likelihood of states and observations up to the current
time step and state being considered).
More details about speech recognition are given in Section 13.2. An HMM can be
used to associate a sequence of labels (y
1
, y
2
, . . . , y
N
) with the input (x
1
, x
2
, . . . , x
T
),
where the output sequence is typically shorter than the input sequence, i.e., N < T .
Knowledge of (y
1
, y
2
, . . . , y
N
) constrains the set of compatible state sequences (s
1
, s
2
, . . . , s
T
),
and the generative conditional likelihood
P (x
1
, x
2
, . . . , x
T
|y
1
, y
2
, . . . , y
N
) =
X
s
1
,s
2
,...,s
T
∈S(y
1
,y
2
,...,y
N
)
Y
t
P (x
t
|s
t
)P (s
t
|s
t1
).
(10.24)
can be computed using the same forward-backward technique, and its logarithm maxi-
mized during training, as discussed above.
Various discriminative alternatives to the generative likelihood of Eq. 10.24 have
been proposed (Brown, 1987; Bahl et al., 1987; Nadas et al., 1988; Juang and Katagiri,
1992; Bengio et al., 1992; Bengio, 1993; Leprieur and Haffner, 1995; Bengio, 1999a),
the simplest of which is simply P (y
1
, y
2
, . . . , y
N
|x
1
, x
2
, . . . , x
T
), which is obtained from
Eq. 10.24 by Bayes rule, i.e., involving a normalization over all sequences, i.e., the
unconstrained likelihood of Eq. 10.23:
P (y
1
, y
2
, . . . , y
N
|x
1
, x
2
, . . . , x
T
) =
P (x
1
, x
2
, . . . , x
T
|y
1
, y
2
, . . . , y
N
)P (y
1
, y
2
, . . . , y
N
)
P (x
1
, x
2
, . . . , x
T
)
.
Both the numerator and denominator can be formulated in the framework of the previous
section (Eqs. 10.18-10.20), where for the numerator we merge (add) the log-scores coming
from the structured output output model P (y
1
, y
2
, . . . , y
N
) and from the input likelihood
model P (x
1
, x
2
, . . . , x
T
|y
1
, y
2
, . . . , y
N
). Again, each node of the graph corresponds to a
state of the HMM at a particular time step t (which may or may not emit the next output
symbol y
i
), associated with an input vector x
t
. Instead of making the relationship to
the input the result of a simple parametric form (Gaussian or multinomial, typically),
the scores can be computed by a neural network (or any other parametrized differential
function). This gives rise to discriminative hybrids of search or graphical models with
neural networks, discussed below, Section 10.8.
10.7.4 CRFs
Whereas HMMs are typically trained to maximize the probability of an input sequence
x given a target sequence y and correspond to a directed graphical model, Conditional
Random Fields (CRFs) (Lafferty et al., 2001) are undirected graphical models that are
229
trained to maximize the joint probability of the target variables, given input variables,
P (y|x). CRFs are special cases of the graph transformer model introduced in Bottou
et al. (1997); LeCun et al. (1998a), where neural nets are replaced by affine transforma-
tions and there is a single graph involved.
Many applications of CRFs involve sequences and the discussion here will be focused
on this type of application, although applications to images (e.g. for image segmenta-
tion) are also common. Compared to other graphical models, another characteristic
of CRFs is that there are no latent variables. The general equation for the proba-
bility distribution modeled by a CRF is basically the same as for fully visible (not
latent variable) undirected graphical models, also known as Markov Random Fields
(MRFs, see Section 14.2.2), except that the “potentials” (terms of the energy function)
are parametrized functions of the input variables, and the likelihood of interest is the
posterior probability P (y|x).
As in many other MRFs, CRFs often have a particular connectivity structure in
their graph, which allows one to perform learning or inference more efficiently. In par-
ticular, when dealing with sequences, the energy function typically only has terms that
relate neighboring elements of the sequence of target variables. For example, the target
variables could form a homogenous
4
Markov chain of order k (given the input variables).
A typical linear CRF example with binary outputs would have the following structure:
P (y = y|x) =
1
Z
e
P
t
y
t
(b+
P
j
w
i
x
tj
)+
P
k
i=1
y
t
y
ti
(u
i
+
P
j
v
ij
x
tj
)
(10.25)
where Z is the normalization constant, which is the sum over all y sequences of the
numerator. In that case, the score marginalization framework of Section 10.7.2 and
coming from Bottou et al. (1997); LeCun et al. (1998a) can be applied by making terms
in the above exponential correspond to scores associated with nodes t of a graph G. If
there were more than two output classes, more nodes per time step would be required
but the principle would remain the same. A more general formulation for Markov chains
of order d is the following:
P (y = y|x) =
1
Z
e
P
t
P
d
d
0
=0
f
d
0
(y
t
,y
t1
,...,y
td
0
,x
t
)
(10.26)
where f
d
0
computes a potential of the energy function, a parametrized function of both
the past target values (up to y
td
0
) and of the current input value x
t
. For example, as
discussed below f
d
0
could be the output of an arbitrary parametrized computation, such
as a neural network.
Although Z looks intractable, because of the Markov property of the model (order
1, in the example), it is again possible to exploit dynamic programming to compute
Z efficiently, as per Eqs. 10.18-10.20). Again, the idea is to compute the sub-sum for
sequences of length t T (where T is the length of a target sequence y), ending in
each of the possible state values at t, e.g., y
t
= 1 and y
t
= 0 in the above example.
For higher order Markov chains (say order d instead of 1) and a larger number of state
values (say N instead of 2), the required sub-sums to keep track of are for each element
4
meaning that the same parameters are used for every time step
230
in the cross-product of d1 state values, i.e., N
d1
. For each of these elements, the new
sub-sums for sequences of length t+1 (for each of the N values at t+1 and corresponding
N
max(0,d2)
past values for the past d2 time steps) can be obtained by only considering
the sub-sums for the N
d1
joint state values for the last d 1 time steps before t + 1.
Following Eq. 10.22, the same kind of decomposition can be performed to efficiently
find the MAP configuration of y’s given x, where instead of products (sums inside the
exponential) and sums (for the outer sum of these exponentials, over different paths) we
respectively have sums (corresponding to adding the sums inside the exponential) and
maxima (across the different competing “previous-state” choices).
Figure 10.21: Illustration of the stacking of graph transformers (right, c) as a generaliza-
tion of the stacking of convolutional layers (middle, b) or of regular feedforward layers
that transform fixed-size vectors (left, a). Figure reproduced with permission from the
authors of Bottou et al. (1997). Quoting from that paper, (c) shows that “multilayer
graph transformer networks are composed of trainable modules that operate on and
produce graphs whose args carry numerical information”.
10.8 Combining Neural Networks and Search
The idea of combining neural networks with HMMs or related search or alignment-based
components (such as graph transformers) for speech and handwriting recognition dates
from the early days of research on multi-layer neural networks (Bourlard and Wellekens,
1990; Bottou et al., 1990; Bengio, 1991; Bottou, 1991; Haffner et al., 1991; Bengio et al.,
1992; Matan et al., 1992; Bourlard and Morgan, 1993; Bengio et al., 1995; Bengio and
Frasconi, 1996; Baldi and Brunak, 1998) and see more references in Bengio (1999b).
See also 13.4 for combining recurrent and other deep learners with generative models
such as CRFs, GSNs or RBMs.
231
Figure 10.22: Illustration of the input and output of a simple graph transformer that
maps a singleton graph corresponding to an input image to a graph representing hy-
pothesized segmentation hypotheses. Reproduced with permission from the authors
of Bottou et al. (1997).
The principle of efficient marginalization and inference for temporally structured
outputs by exploiting dynamic programming (Sec. 10.7.2) can be applied not just when
the log-scores of Eqs. 10.17 and 10.19 are parameters or linear functions of the input,
but also when they are learned non-linear functions of the input, e.g., via a neural
network transformation, as was first done in Bottou et al. (1997); LeCun et al. (1998a).
These papers additionally introduced the powerful idea of learned graph transformers,
illustrated in Figure 10.21. In this context, a graph transformer is a machine that can
map a directed acyclic graph G
in
to another graph G
out
. Both input and output graphs
have paths that represent hypotheses about the observed data.
For example, in the above papers, and as illustrated in Figure 10.22, a segmentation
graph transformer takes a singleton input graph (the image x) and outputs a graph
representing segmentation hypotheses (regarding sequences of segments that could each
contain a character in the image). Such a graph transformer could be used as one layer
of a graph transformer network for handwriting recognition or document analysis for
reading amounts on checks, as illustrated respectively in Figures 10.23 and 10.24.
For example, after the segmentation graph transformer, a recognition graph trans-
former could expand each node of the segmentation graph into a subgraph whose arcs
correspond to different interpretations of the segment (which character is present in
the segment?). Then, a dictionary graph transformer takes the recognition graph and
expands it further by considering only the sequences of characters that are compatible
with sequences of words in the language of interest. Finally, a language-model graph
transformer expands sequences of word hypotheses so as to include multiple words in
the state (context) and weigh the arcs according to the language model next-word log-
probabilities.
Each of these transformations is parametrized and takes real-valued scores on the
arcs of the input graph into real-valued scores on the arcs of the output graph. These
transformations can be parametrized and learned by gradient-based optimization over
the whole series of graph transformers.
232
Figure 10.23: Illustration of the graph transformer network that has been used for finding
the best segmentation of a handwritten word, for handwriting recognition. Reproduced
with permission from Bottou et al. (1997).
10.8.1 Approximate Search
Unfortunately, as in the above example, when the number of nodes of the graph becomes
very large (e.g., considering all previous n words to condition the log-probability of the
next one, for n large), even dynamic programming (whose computation scales with the
number of arcs) is too slow for practical applications such as speech recognition or
machine translation. A common example is when a recurrent neural network is used to
compute the arcs log-score, e.g., as in neural language models (Section 13.3). Since the
prediction at step t depends on all t 1 previous choices, the number of states (nodes
of the search graph G) grows exponentially with the length of the sequence.
In that case, one has to resort to approximate search. In the case of sequential struc-
tures as discussed in this chapter, a common family of approximate search algorithms
is the beam search (Lowerre, 1976).
The basic idea of beam search algorithms is the following:
Break the nodes of the graph into T groups containing only “comparable nodes”,
e.g., the group of nodes n for which the maximum length of the paths ending at
n is exactly t.
Process these groups of nodes sequentially, keeping only at each step t a selected
subset S
t
of the nodes (the “beam”), chosen based on the subset S
t1
. Each node
233
in S
t
is associated with a score
ˆ
V (G
n
) that represents an approximation (a lower
bound) on the maximum total log-score of the path ending at the node, V (G
n
)
(defined in Eq. 10.22, Viterbi decoding).
S
t
is obtained by following all the arcs from the nodes in S
t1
, and sorting all the
resulting group t nodes n according to their estimated (lower bound) score
ˆ
V (G
n
) = max
mS
t1
andmpred(n)
ˆ
V (G
m
) + a
m,n
,
while keeping track of the argmax in order to trace back the estimated best path.
Only the k nodes with the highest log-score are kept and stored in S
t
, and k is
called the beam width.
The estimated best final node can be read off from max
nS
T
ˆ
V (G
n
) and the esti-
mated best path from the associated argmax choices made along the way, just like
in the Viterbi algorithm.
One problem with beam search is that the beam often ends up lacking in diversity,
making the approximation poor. For example, imagine that we have two “types” of
solutions, but that each type has exponentially many variants (as a function of t), due,
e.g., to small independent variations in ways in which the type can be expressed at each
time step t. Then, even though the two types may have close best log-score up to time
t, the beam could be dominated by the one that wins slightly, eliminating the other
type from the search, although later time steps might reveal that the second type was
actually the best one.
234
Figure 10.24: Illustration of the graph transformer network that has been used for
reading amounts on checks, starting from the single graph containing the image of
the graph to the recognized sequences of characters corresponding to the amount on
the graph, with currency and other recognized marks. Note how the grammar graph
transformer composes the grammar graph (allowed sequences of characters) and the
recognition graph (with character hypotheses associated with specific input segments,
on the arcs) into an interpretation graph that only contains the recognition graph paths
that are compatible with the grammar. Reproduced with permission from Bottou et al.
(1997).
235