Chapter 12
Sequence Modeling: Recurrent
and Recursive Nets
One of the early ideas found in machine learning and statistical models of the 80’s is
that of sharing parameters
1
across different parts of a model, allowing to extend the
model, to examples of different forms, e.g. of different lengths. This can be found
in hidden Markov models (HMMs) (Rabiner and Juang, 1986), where the same state-
to-state transition matrix P (s
t
|s
t1
) is used for different time steps, allowing one to
model variable length sequences. It can also be found in the idea of recurrent neural
network (Rumelhart et al., 1986b) or RNN
2
: the same weights are used for different
instances of the artificial neurons at different time steps, allowing us to apply the network
to input sequences of different lengths. This idea is made more explicit in the early work
on time-delay neural networks (Lang and Hinton, 1988; Waibel et al., 1989), where a
fully connected network is replaced by one with local connections that are shared across
different temporal instances of the hidden units. Such networks are the ancestors of
convolutional neural networks, covered in more detail in Section 11. Recurrent nets are
covered below in Section 12.2. As shown in Section 12.1 below, the flow graph (notion
introduced in Section 6.3) associated with a recurrent network is structured like a chain.
Recurrent neural networks have been generalized into recursive neural networks, in which
the structure can be more general, i.e., and it is typically viewed as a tree. Recursive
neural networks are discussed in more detail in Section 12.4.
12.1 Unfolding Flow Graphs and Sharing Parameters
A flow graph is a way to formalize the structure of a set of computations, such as
those involved in mapping inputs and parameters to outputs and loss. Please refer to
Section 6.3 for a general introduction. In this section we explain the idea of unfolding a
1
see Section 7.4 for an introduction to the concept of parameter sharing
2
Unfortunately, the RNN acronym is sometimes also used for denoting Recursive Neural Networks.
However, since the RNN acronym has been around for much longer, we suggest keeping this acronym
for Recurrent Neural Network.
187
recursive or recurrent computation into a flow graph that has a repetitive structure.
For example, consider the classical form of a dynamical systems:
s
t
= F
θ
(s
t1
) (12.1)
where s
t
is called the state of the system. The unfolded flow graph of such a system
looks like in Figure 12.1.
s
t
s
t1
s
t+1
F
F
F
F
Figure 12.1: Classical dynamical system equation 12.1 illustrated as an unfolded flow-
graph. Each node represents the state at some time t and function F
θ
maps the state
at t to the state at t + 1. The same parameters (the same function F
θ
) is used for all
time steps.
As another example, let us consider a dynamical system driven by an external signal
x
t
,
s
t
= F
θ
(s
t1
, x
t
) (12.2)
illustrated in Figure 12.2, where we see that the state now contains information about
the whole past sequence, i.e., the above equation implicitly defines a function
s
t
= G
t
(x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
) (12.3)
which maps the whole past sequence (x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
)) to the current state.
Equation 12.2 is actually part of the definition of a recurrent net. We can think of
s
t
is a kind of summary of the past sequence of inputs up to t 1. Note that this
summary is in general necessarily lossy, since it maps an arbitrary length sequence
(x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
)) to a fixed length vector x
t
. Depending on the training cri-
terion, this summary might selectively keeping some aspects of the past sequence with
more precision than other aspects. For example, if the RNN is used in statistical lan-
guage modeling, typically to predict the next word given previous words, it may not be
necessary to distinctly keep track of all the bits of information, only those required to
predict the rest of the sentence. The most demanding situation is when we ask s
t
to be
rich enough to allow one to approximately recover the input sequence, as in auto-encoder
frameworks (Section 10.1).
If we had to define a different function G
t
for each possible sequence length (imagine a
separate neural network, each with a different input size), each with its own parameters,
we would not get any generalization to sequences of a size not seen in the training set.
Furthermore, one would need to see a lot more training examples, because a separate
model would have to be trained for each sequence length. By instead defining the state
through a recurrent formulation as in Eq. 12.2, the same parameters are used for any
sequence length, allowing much better generalization properties.
188
Equation 12.2 can be drawn in two different ways. One is in a way that is inspired
by how a physical implementation (such as a real neural network) might look like, i.e.,
like a circuit which operates in real time, as in the left of Figure 12.2. The other is as
a flow graph, in which the computations occurring at different time steps in the circuit
are unfolded as different nodes of the flow graph, as in the right of Figure 12.2. What
we call unfolding is the operation that maps a circuit as in the left side of the figure to a
flow graph with repeated pieces as in the right side. Note how the unfolded graph now
has a size that depends on the sequence length. The black square indicates a delay of
1 time step on the recurrent connection, from the state at time t to the state at time
t + 1.
s
t
s
t1
s
t+1
F
F
F
x
t
x
s
F
unfold
Figure 12.2: Left: input processing part of a recurrent neural network, seen as a circuit.
The black square indicates a delay of 1 time steps. Right: the same seen as an unfolded
flow graph, where each node is now associated with one particular time instance.
The other important observation to make from Figure 12.2 is that the same parame-
ters (θ) are shared over different parts of the graph, corresponding here to different time
steps.
12.2 Recurrent Neural Networks
Armed with the ideas introduced in the previous section, we can design a wide variety
of recurrent circuits, which are compact and simple to understand visually, and auto-
matically obtain their equivalent unfolded graph, which are useful computationally and
also help focus on the idea of information flow forward (computing outputs and losses)
and backward in time (computing gradients).
Some of the early circuit designs for recurrent neural networks are illustrated in
Figures 12.3, 12.5 and 12.4. Figure 12.3 shows the vanilla recurrent network whose
equations are laid down below in Eq. 12.4, and which has been shown to be a universal
approximation machine for sequences, i.e., able to implement a Turing machine (Siegel-
mann and Sontag, 1991; Siegelmann, 1995; Hyotyniemi, 1996). On the other hand, the
network with output recurrence shown in Figure 12.5 has a more limited memory or
state, which is its output, i.e., the prediction of the previous target, which potentially
limits its expressive power, but also makes it easier to train. Indeed, the “intermediate
state” of the corresponding unfolded deep network is not hidden anymore: targets are
available to guide this intermediate representation, which should make it easier to train.
Teacher forcing is the training process in which the fed back inputs are not the predicted
189
outputs but the targets themselves, as shown in figure TODO.The disadvantage of strict
teacher forcing is that if the network is going to be later used in an open-loop mode, i.e.,
with the network outputs (or samples from the output distribution) fed back as input,
then the kind of inputs that the network will have seen during training could be quite
different from the kind of inputs that it will see at test time, potentially yielding very
poor generalizations. One way to mitigate this problem is to train with both teacher-
forced inputs and with free-running inputs, e.g., predicting the correct target a number
of steps in the future through the unfolded recurrent output-to-input paths.
x
t
x
unfold
V
W
W
W
W
W
V
V
V
U
U
U
U
s
o
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
Figure 12.3: Left: vanilla recurrent network with hidden-to-hidden recurrence, seen as a
circuit, with weight matrices U, V , W for the three different kinds of connections (input-
to-hidden, hidden-to-output, and hidden-to-hidden, respectively). Each circle indicates
a whole vector of activations. Right: the same seen as an time-unfolded flow graph,
where each node is now associated with one particular time instance.
The vanilla recurrent network of Figure 12.3 corresponds to the following forward
propagation equations, if we assume that hyperbolic tangent non-linearities are used in
the hidden units and softmax is used in output (for classification problems):
a
t
= b + W s
t1
+ Ux
t
s
t
= tanh(a
t
)
o
t
= c + V s
t
p
t
= softmax(o
t
) (12.4)
where the parameters are the bias vectors b and c along with the weight matrices U,
V and W, respectively for input-to-hidden, hidden-to-output, and hidden-to-hidden
connections. This is an example of a recurrent network that maps an input sequence to
an output sequence of the same length. The total loss for a given input/target sequence
pair (x, y) would then be just the sum of the losses over all the time steps, e.g.,
L(x, y) =
t
L
t
=
t
log p
y
t
(12.5)
where y
t
is the category that should be associated with time step t in the output se-
quence.
190
x
t
x
unfold
V
W
W
W
W
W
V
V
V
U
U
U
U
o
o
t1
o
t
o
t+1
h
h
t1
h
t
h
t+1
y
L
L
t1
L
t+1
L
t
y
t
y
t1
y
t+1
Figure 12.4: Left: RNN whose recurrence is only through the output. Right: compu-
tational flow graph unfolded in time. At each time step t, the input is x
t
, the hidden
layer activations h
t
, the output o
t
, the target y
t
and the loss L
t
. Such an RNN is less
powerful (can express a smaller set of functions) than those in the family represented
by Figure 12.3 but may be easier to train because they can exploit “teacher forcing”,
i.e., knowledge of the intermediate state (which is the output or target), and which is
observed during training.
12.2.1 Computing the gradient in a recurrent neural network
Using the generalized back-propagation algorithm (for arbitrary flow-graphs) introduced
in Section 6.3, one can thus obtain the so-called Back-Propagation Through Time
(BPTT) algorithm. Once we now how to compute gradients, we can in principle apply
any of the gradient-based techniques to train an RNN, introduced in Section 4.3 and
developed in greater depth in Chapter 8.
Let us thus work out how to compute gradients by BPTT for the RNN equations
above (Eqs. 12.4 and 12.5). The nodes of our flow graph will be the sequence of x
t
’s,
s
t
’s, o
t
’s, L
t
’s, and the parameters U, V , W, b, c. For each node a we need to compute
the gradient
L
a
recursively, based on the gradient computed at nodes that follow it in
the graph. We start the recursion with the nodes immediately preceding the final loss
L
L
t
= 1
and the gradient on the outputs i at time step t, for all i, t:
L
o
ti
=
L
L
t
L
t
o
ti
= o
t,i
1
i,y
t
and work our way backwards, starting from the end of the sequence, say T, at which
point s
T
only has o
T
as descendent:
L
s
T
=
L
o
T
o
T
s
T
=
L
o
T
V.
191
x
t
x
t1
x
t+1
W
U
U
U
h
t1
h
t
h
t+1
W
V
U
W
W
…"
L
T
y
T
o
T
h
T
x
T
Figure 12.5: Time-Unfolded recurrent neural network with a single output at the end
of the sequence. Such a network can be used to summarize a sequence and produce a
fixed-size representation used as input for further processing. There might be a target
right at the end (like in the figure) or the gradient on the output o
t
can be obtained by
back-propagating from further downstream modules.
Note how the above equation is vector-wise and corresponds to
L
s
T j
=
i
L
o
T i
V
ij
, scalar-
wise. We can then iterate backwards in time to back-propagate gradients through time,
from t = T 1 down to t = 1, noting that s
t
(for t < T) has as descendent both o
t
and
s
t+1
:
L
s
t
=
L
s
t+1
s
t+1
s
t
+
L
o
t
o
t
s
t
=
L
s
t+1
diag(1 s
2
t+1
)W +
L
o
t
V
where diag(1 s
2
t+1
) indicates the diagonal matrix containing the elements 1 s
2
t+1,i
,
i.e., the derivative of the hyperbolic tangent associated with the hidden unit i at time
t + 1.
Once the gradients on the internal nodes of the flow graph are obtained, we can
obtain the gradients on the parameter nodes, which have descendents at all the time
192
steps:
L
c
=
t
L
o
t
o
t
c
=
t
L
o
t
L
b
=
t
L
s
t
s
t
b
=
t
L
s
t
diag(1 s
2
t
)
L
V
=
t
L
o
t
o
t
V
=
t
L
o
t
s
t
L
W
=
t
L
s
t
s
t
W
=
t
L
s
t
diag(1 s
2
t
)
12.2.2 Recurrent Networks as Generative Directed Acyclic Models
Up to here, we have not clearly stated what the losses L
t
associated with the outputs
o
t
of a recurrent net should be. It is because there are many possible ways in which
RNNs can be used. Here we consider the case where the RNN models a probability
distribution over a sequence of observations.
As introduced in Section 9.2.1, directed graphical models represent a probability
distribution over a sequence of T random variables x = (x
1
, x
2
, . . . x
T
) by parametrizing
their joint distribution as
P (x) = P (x
1
, . . . x
T
) =
T
t=1
P (x
t
|x
t1
, x
t2
, . . . x
1
) (12.6)
where the right-hand side is empty for t = 1, of course. Hence the negative log-likelihood
of x according to such a model is
L =
t
L
t
where
L
t
= log P (x
t
|x
t1
, x
t2
, . . . x
1
).
In general directed graphical models, x
t
can be predicted using only a subset of its
predecessors (x
1
, . . . , x
t1
). However, for RNNs, the graphical model is generally fully
connected, not removing any dependency a priori. This can be achieved efficiently
through the recurrent parametrization, such as in Eq. 12.2, since s
t
summarizes the
whole previous sequence (Eq. 12.3).
Hence, instead of cutting statistical complexity by removing arcs in the directed
graphical model for (x
1
, . . . x
T
), the core idea of recurrent networks is that we intro-
duce a state variable which decouples all the past and future observations, but we
make that state variable a generally deterministic function of the past, through the
recurrence, Eq. 12.2. Consequently, the number of parameters required to parametrize
P (x
t
|x
t1
, x
t2
, . . . x
1
) does not grow exponentially with t (as it would if we parametrized
that probability by a straight probability table, when the data is discrete) but remains
193
constant with t. It only grows with the dimension of the state s
t
. The price to be paid
for that great advantage is that optimizing the parameters may be more difficult, as
discussed below in Section 12.6. The decomposition of the likelihood thus becomes:
P (x) =
T
t=1
P (x
t
|G
t
(x
t1
, x
t2
, . . . x
1
))
where
s
t
= G
t
(x
t
, x
t1
, x
t2
, . . . , x
2
, x
1
) = F
θ
(s
t1
, x
t
).
Note that if the self-recurrence function F
θ
is learned,it can discard some of the infor-
mation in some of the past values x
tk
that are not needed for predicting the future
data. In fact, because the state generally has a fixed dimension smaller than the length
of the sequences (times the dimension of the input), it has to discard some information.
However, we leave it to the learning procedure to choose what information to keep and
what information to throw away.
x
t
x
t1
x
t+1
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
x
t+2
Figure 12.6: A generative recurrent neural network modeling P (x
1
, . . . x
T
) and able
to generate sequences from this distribution. Each element x
t
of the observed se-
quence serves both as input (for the current time step) and as target (for the previ-
ous time step). The output o
t
encodes the parameters of a conditional distribution
P (x
t+1
|x
1
, . . . x
t
) = P (x
t+1
|o
t
) for x
t+1
, given the past sequence x
1
. . . x
t
. The loss L
t
is
the negative log-likelihood associated with the output prediction (or more generally, dis-
tribution parameters) o
t
, when the actual observed target is x
t+1
. In training mode, one
measures and minimizes the sum of the losses over observed sequence(s) x. In generative
mode, x
t
is sampled from the conditional distribution P (x
t+1
|x
1
, . . . x
t
) = P (x
t+1
|o
t
)
(dashed arrows) and then that generated sample x
t+1
is fed back as input for computing
the next state s
t+1
, the next output o
t+1
, and generating the next sample x
t+2
, etc.
The above decomposition of the joint probability of a sequence of variables into
ordered conditionals precisely corresponds to what an RNN can compute, if the target
194
at each time step t is the next element in the sequence, while the input at each time step is
the previous element in the sequence, and the output is interpreted as parametrizing the
probability distribution of the target given the state. This is illustrated in Figure 12.6.
If the RNN is actually going to be used to generate sequences, one must also incorpo-
rate in the output information allowing to stochastically decide when to stop generating
new output elements. This can be achieved in various ways. In the case when the output
is a symbol taken from a vocabulary, one can add a special symbol corresponding to
the end of a sequence. When that symbol is generated, a complete sequence has been
generated. The target for that special symbol occurs exactly once per sequence, as the
last target after the last output element x
T
. In general, one may train a binomial output
associated with that stopping variable, for example using a sigmoid output non-linearity
and the cross-entropy loss, i.e., again negative log-likelihood for the event “end of the
sequence”. Another kind of solution is to model the integer T itself, through any rea-
sonable parametric distribution, and use the number of time steps left (and possibly the
number of time steps since the beginning of the sequence) as extra inputs at each time
step. Thus we would have decomposed P(x
1
. . . x
T
) into P(T ) and P(x
1
. . . x
T
|T ). In
general, one must therefore keep in mind that in order fully generate sequence we must
not only generate the x
t
’s, but also the sequence length T (either implicitly through a
series of continue/stop decisions, or explicitly through modeling the distribution of T
itself as an integer random variable).
If we take the RNN equations of the previous section (Eq. 12.4 and 12.5), they could
correspond to a generative RNN if we simply make the target y
t
equal to the next input
x
t+1
(and because the outputs are the result of a softmax, it must be that the input
sequence is a sequence of symbols, i.e., x
t
is a symbol or bounded integer).
Other types of data can clearly be modeled in a similar way, following the discussions
about the encoding of outputs and the probabilistic interpretation of losses as negative
log-likelihoods, in Sections 5.6 and 6.2.2.
12.2.3 RNNs to represent conditional probability distributions
In general, as discussed in Section 6.2.2
3
, when we can represent a parametric probability
distribution P (Y |ω), we can make it conditional by making ω a function of the desired
conditioning variable:
P (Y |ω = f(X)).
In the case of an RNN, this can be achieved in different ways, and we review here the
most common and obvious choices.
If X is a fixed-size vector, then we can simply make it an extra input of the RNN
that generates the Y sequence. Two common ways of sticking an extra input into an
RNN are
1. as an extra input at each time step, or
2. as the initial state s
0
, or
3
See especially the end of that section, in Subsection TODO
195
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
y
t+2
y
t+1
y
t1
y
t
x
Figure 12.7: A conditional generative recurrent neural network maps a fixed-length vec-
tor x into a distribution over sequences Y . Each element y
t
of the observed output
sequence serves both as input (for the current time step) and as target (for the pre-
vious time step). The generative semantics are the same as in the unconditional case
(Fig. 12.6). The only difference is that the state is now conditioned on the input x.
Although this was not discussed in Fig. 12.6, in both figures one should note that the
length of the sequence must also be generated (unless known in advance). This could
be done by a special binary output unit that encodes the fact that the next output is
the last.
3. both.
In general, one may need to add extra parameters (and parametrization) to map X = x
into the “extra bias” going either into only s
0
, into the other s
t
(t > 0), or into both.
The first (and most commonly used) approach is illustrated in Figure 12.7.
As an example, we could imagine that X is encoding the identity of a phoneme and
the identity of a speaker, and that Y represents an acoustic sequence corresponding to
that phoneme, as pronounced by that speaker.
Consider the case where the input X is a sequence of the same length as the output
sequence Y , and the Y
t
’s are independent of each other when the past input sequence
is given, i.e., P (Y
t
|Y
t1
, . . . Y
1
, X) = P(Y
t
|X
t
, X
t1
, . . . X
1
). We therefore have a causal
relationship between the X
t
’s and the predictions of the Y
t
’s, in addition to the inde-
pendence of the Y
t
’s, given X. Under these (pretty strong) assumptions, we can return
to Fig. 12.3 and interpret the t-th output o
t
as parameters for a conditional distribution
for Y
t
, given X
t
, X
t1
, . . . X
1
.
If we want to remove the conditional independence assumption, we can do so by
196
x
t
x
t1
x
t+1
W
W
W
W
V
V
V
U
U
U
s
t1
o
t1
o
t
s
t
s
t+1
o
t+1
L
t+1
L
t1
L
t
x
t+2
y
t+2
y
t+1
y
t1
y
t
Figure 12.8: A conditional generative recurrent neural network mapping a variable-
length sequence X into a distribution over sequences Y of the same length. This ar-
chitecture assumes that the Y
t
’s are causally related to the X
t
’s, i.e., that we want to
predict the Y
t
’s only using the past X
t
’s. Note how the prediction of Y
t+1
is based on
both the past X’s and the past Y ’s.
making the past Y
t
’s inputs into the state as well. That situation is illustrated in
Fig. 12.8.
12.3 Bidirectional RNNs
All of the recurrent networks we have considered up to now have a “causal” structure,
meaning that the state at time t only captures information from the past, x
1
. . . x
t
.
However, in many applications we want to output at time t a prediction regarding
an output which may depend on the whole input sequence. For example, in speech
recognition, the correct interpretation of the current sound as a phoneme may depend
on the next few phonemes because of co-articulation and potentially may even depend
on the next few words because of the linguistic dependencies between nearby words: if
there are two interpretations of the current word that are both acoustically plausible, we
may have to look far into the future (and the past) to disambiguate them. This is also
true of handwriting recognition and many other sequence-to-sequence learning tasks.
Bidirectional recurrent neural networks (or bidirectional RNNs) were invented to
address that need (Schuster and Paliwal, 1997). They have been extremely success-
ful (Graves, 2012) in applications where that need arises, such as handwriting (Graves
et al., 2008; Graves and Schmidhuber, 2009), speech recognition (Graves and Schmid-
197
x
t
x
t1
x
t+1
h
t1
h
t
h
t+1
g
t1
g
t+1
g
t
o
t
o
t1
o
t+1
L
t+1
L
t1
L
t
y
t
y
t1
y
t+1
Figure 12.9: Computational of a typical bidirectional recurrent neural network, meant
to learn to map input sequences x to target sequences y, with loss L
t
at each step t.
The h recurrence propagates information forward in time (towards the right) while the
g recurrence propagates information backward in time (towards the left). Thus at each
point t, the output units o
t
can benefit from a relevant summary of the past in its h
t
input and from a relevant summary of the future in its g
t
input.
huber, 2005; Graves et al., 2013) and bioinformatics
As the name suggests, the basic idea behind bidirectional RNNs is to combine a
forward-going RNN and a backward-going RNN. Figure 12.9 illustrates the typical bidi-
rectional RNN, with h
t
standing for the state of the forward-going RNN and g
t
standing
for the state of the backward-going RNN. This allows the units o
t
to compute a rep-
resentation that depends on both the past and the future but is most sensitive to the
input values around time t, without having to specify a xed-size window around t (as
one would have to do with a feedforward network, a convolutional network, or a regular
RNN with a fixed-size look-ahead buffer).
This idea can be naturally extended to 2-dimensional input, such as images, by
having four RNNs, each one going in one of the four directions: up, down, left, right. At
each point (i, j) of a 2-D grid, an output o
i,j
could then compute a representation that
would capture mostly local information but could also depend on long-range inputs, if
the RNN are able to learn to carry that information.
198
W
U
W
U
W
U
y"
L"
x
1"
x
2"
x
3"
x
4"
V"
V"
V"
V"
o"
Figure 12.10: A recursive network has a computational graph that generalizes that of
the recurrent network from a chain to a tree. In the figure, a variable-size sequence
x
1
, x
2
, . . . can be mapped to a fixed-size representation (the output o), with a fixed
number of parameters (e.g. the weight matrices U, V , W ). The figure illustrates a
supervised learning case in which some target y is provided which is associated with the
whole sequence.
12.4 Recursive Neural Networks
Recursive net represent yet another generalization of recurrent networks, with a dif-
ferent kind of computational graph, which this times looks like a tree. The typical
computational graph for a recursive network is illustrated in Figure 12.10. Recursive
neural networks were introduced by Pollack (1990) and their potential use for learning
to reason were nicely laid down by Bottou (2011). Recursive networks have been suc-
cessfully applied in processing data structures as input to neural nets (Frasconi et al.,
199
1997, 1998), in natural language processing (Socher et al., 2011a,c, 2013) as well as in
computer vision (Socher et al., 2011b).
One clear advantage of recursive net over recursive nets is that for a sequence of the
same length N, depth can be drastically reduced from N to O(log N). An open question
is how to best structure the tree, though. One option is to have a tree structure which
does not depend on the data, e.g., a balanced binary tree. Another is to use an external
method, such as a natural language parser (Socher et al., 2011a, 2013). Ideally, one
would like the learner itself to discover and infer the tree structure that is appropriate
for any given input, as suggested in Bottou (2011).
Many variants of the recursive net idea are possible. For example, in Frasconi et al.
(1997, 1998), the data is associated with a tree structure in the first place, and inputs
and/or targets with each node of the tree. The computation performed by each node
does not have to be the traditional artificial neuron computation (affine transformation
of all inputs followed by a monotone non-linearity). For example, Socher et al. (2013)
propose using tensor operations and bilinear forms, which have previously been found
useful to model relationships between concepts (Weston et al., 2010; Bordes et al., 2012)
when the concepts are represented by continuous vectors (embeddings).
12.5 Auto-Regressive Networks
One the basic ideas behind recurrent networks is that of directed graphical models with
a twist: we decompose a probability distribution as a product of conditionals without
explicitly cutting any arc in the graphical model, but instead reducing complexity by
parametrizing the transition probability in a recursive way that requires a fixed (and
not exponential) number of parameters, thanks to a form of parameter sharing (see
Section 7.4 for an introduction to the concept). Instead of reducing P (x
t
|x
t1
, . . . x
1
)
to something like P(x
t
|x
t1
, . . . x
tk
) (assuming the k previous ones as the parents), we
keep the full dependency but we parametrize the conditional efficiently in a way that does
not grow with t, exploiting parameter sharing. When the above conditional probability
distribution is in some sense stationary, i.e., the relation between the past and the next
observation does not depend on t, only on the values of the past observations, then this
form of parameter sharing makes a lot of sense, and for recurrent nets it allows to use
the same model for sequences of different lengths.
Auto-regressive networks are similar to recurrent networks in the sense that we also
decompose a joint probability over the observed variables as a product of conditionals of
the form P (x
t
|x
t1
, . . . x
1
) but we drop the form of parameter sharing that makes these
conditionals all share the same parametrization. This makes sense when the variables
are not elements of a translation-invariant sequence, but instead form an arbitrary tuple
without any particular ordering that would correspond to a translation-invariant form of
relationship between variables at position k and variables at position k
. In some forms
of auto-regressive networks, such as NADE (Larochelle and Murray, 2011), described in
Section 12.5.3 below, we can re-introduce a form of parameter sharing that is different
from the one found in recurrent networks, but that brings both a statistical advantage
200
(less parameters) and a computational advantage (less computation). Although we drop
the sharing over time, as we see below in Section 12.5.2, using a deep learning concept
of re-use of features, we can share features that have been computed for predicting x
tk
with the sub-network that predicts x
t
.
x
1#
x
2#
x
3#
x
4#
P(x
1
)
#
P(x
2
|x
1
)
#
P(x
3
|x
2#
,x
1
)
#
P(x
4
|x
3#
,#x
2#
,x
1
)
#
#
x
1#
x
2#
x
3#
x
4#
Figure 12.11: An auto-regressive network predicts the i-th variable from the i1 previous
ones. Left: corresponding graphical model (which is the same as that of a recurrent
network). Right: corresponding computational graph, in the case of the logistic auto-
regressive network, where each prediction has the form of a logistic regression, i.e., with
i free parameters (for the i 1 weights associated with i 1 inputs, and an offset
parameter).
12.5.1 Logistic Auto-Regressive Networks
Let us first consider the simplest auto-regressive network, without hidden units, and
hence no sharing at all. Each P (x
t
|x
t1
, . . . x
1
) is parametrized as a linear model, e.g., a
logistic regression. This model was introduced by Frey (1998) and has O(T
2
) parameters
when there are T variables to be modeled. It is illustrated in Figure 12.11, showing both
the graphical model (left) and the computational graph (right).
A clear disadvantage of the logistic auto-regressive network is that one cannot easily
increase its capacity in order to capture more complex data distributions. It defines
a parametric family of fixed capacity, like the linear regression, the logistic regression,
or the Gaussian distribution. In fact, if the variables are continuous, one gets a linear
auto-regressive model, which is thus another way to formulate a Gaussian distribution,
i.e., only capturing pairwise interactions between the observed variables.
201
x
1#
x
2#
x
3#
x
4#
P(x
1
)
#
P(x
2
|x
1
)
#
P(x
3
|x
2#
,x
1
)
#
P(x
4
|x
3#
,#x
2#
,x
1
)
#
#
h
1#
h
2#
h
3#
Figure 12.12: A neural auto-regressive network predicts the i-th variable x
i
from the
i1 previous ones, but is parametrized so that features (groups of hidden units denoted
h
i
) that are functions of x
1
, . . . x
i
can be re-used in predicting all of the subsequent
variables x
i+1
, x
i+2
, . . ..
12.5.2 Neural Auto-Regressive Networks
Neural Auto-Regressive Networks have the same graphical model as logistic auto-regressive
networks (Figure 12.11, left) but a different parametrization that is at once more power-
ful (allowing to extend the capacity as needed and approximate any joint distribution)
and can improve in terms of generalization by introducing a parameter sharing and fea-
ture sharing principle common to deep learning in general. The first paper on neural
auto-regressive networks by Bengio and Bengio (2000b) (see also Bengio and Bengio
(2000a) for more extensive the journal version) were motivated by the objective to avoid
the curse of dimensionality arising out of traditional non-parametric graphical models
sharing the same structure as Figure 12.11 (left). In the latter, each conditional distri-
bution is represented by a table of probabilities, with one entry and one parameter for
each possible configuration of the variables involved. By using a neural network instead,
two advantages are obtained:
1. The parametrization of each P (x
t
|x
t1
, . . . x
1
) by a neural network with (t1)×K
inputs and K outputs (if the variables take K values) allows to estimate the
conditional probability without requiring an exponential number of parameters
(and examples), yet still allowing to capture high-order dependencies between the
random variables.
2. Instead of having a different neural network for the prediction of each x
t
, a left-
to-right connectivity illustrated in Figure 12.12 allows to merge all the neural
202
networks into one. Equivalently, it means that the hidden layer features computed
for predicting x
t
can be re-used for predicting x
t+k
(k > 0). The hidden units are
thus organized in groups that have the particularity that all the units in the t-th
group only depend on the input values x
1
. . . x
t
. In fact the parameters used to
compute these hidden units are jointly optimized to help the prediction of all the
variables x
t
, x
t+1
, x
t+2
, . . .. This is an instance of the re-use principle that makes
multi-task learning and transfer learning successful with neural networks and deep
learning in general (See Sections 7.11 and 10.5).
Each P (x
t
|x
t1
, . . . x
1
) can represent a conditional distribution by having outputs of
the neural network predict parameters of the conditional distribution of x
t
, as discussed
in Section 6.2.2. Although the original neural auto-regressive networks were initially
evaluated in the context of purely discrete multivariate data (e.g., with a sigmoid -
Bernoulli case - or softmax output - Multinoulli case) it is straightforward to extend
such models to continuous variables or joint distributions involving both discrete and
continuous variables, as for example with RNADE introduced below (Uria et al., 2013).
h
1#
x
1#
x
2#
x
3#
x
4#
P(x
1
)
#
P(x
2
|x
1
)
#
P(x
3
|x
2#
,x
1
)
#
P(x
4
|x
3#
,#x
2#
,x
1
)
#
#
h
2#
h
3#
W
1#
W
1#
W
1#
W
2#
W
2#
W
3#
Figure 12.13: NADE (Neural Auto-regressive Density Estimator) is a neural auto-
regressive network, i.e., the hidden units are organized in groups h
j
so that only the
inputs x
1
. . . x
i
participate in computing h
i
and predicting P (x
j
|x
j1
, . . . x
1
), for j > i.
The particularity of NADE is the use of a particular weight sharing pattern: the same
W
jki
= W
ki
is shared (same color and line pattern in the figure) for all the weights out-
going from x
i
to the k-th unit of any group j i. The vector (W
1i
, W
2i
, . . .) is denoted
W
i
here.
203
12.5.3 NADE
A very successful recent form of neural auto-regressive network was proposed by Larochelle
and Murray (2011). The architecture is basically the same as for the original neural
auto-regressive network of Bengio and Bengio (2000b) except for the introduction of a
weight-sharing scheme: as illustrated in Figure 12.13. The hidden units are the weights
W
jki
from the i-th input x
i
to the k-th element of the j-th group of hidden unit h
jk
(j i) are shared, i.e.,
W
jki
= W
ki
with (W
1i
, W
2i
, . . .) denoted W
i
in Figure 12.13.
This particular sharing pattern is motivated in Larochelle and Murray (2011) by the
computations performed in the mean-field inference of an RBM, when only the first i
inputs are given and one tries to infer the subsequent ones. This mean-field inference
corresponds to running a recurrent network with shared weights and the first step of
that inference is the same as in NADE. The only difference is that with the proposed
NADE, the output weights are not forced to be simply transpose values of the input
weights (they are not tied). One could imagine actually extending this procedure to not
just one time step of the mean-field recurrent inference but to k steps, as in Raiko et al.
(2014).
Although the neural auto-regressive networks and NADE were originally proposed to
deal with discrete distributions, they can in principle be generalized to continuous ones
by replacing the conditional discrete probability distributions (for P (x
j
|x
j1
, . . . x
1
))
by continuous ones, following general practice to predict continuous random variables
with neural networks (see Section 6.2.2) using the log-likelihood framework. A fairly
generic way of parametrizing a continuous density is as a Gaussian mixture, and this
avenue has been successfully evaluated for the neural auto-regressive architecture with
RNADE (Uria et al., 2013). One interesting point to note is that stochastic gradient
descent can be numerically ill-behaved due to the interactions between the conditional
means and the conditional variances (especially when the variances become small). Uria
et al. (2013) have used a heuristic to rescale the gradient on the component means by
the associated standard deviation which seems to have helped optimizing RNADE.
Another very interesting extension of the neural auto-regressive architectures gets
rid of the need to choose an arbitrary order for the observed variables (Murray and
Larochelle, 2014). The idea is to train the network to be able to cope with any order
by randomly sampling orders and providing the information to hidden units specifying
which of the inputs are observed (on the - right - conditioning side of the bar) and which
are to be predicted and are thus considered missing (on the - left - side of the conditioning
bar). This is nice because it allows to use a trained auto-regressive network to perform
any inference (i.e. predict or sample from the probability distribution over any subset of
variables given any subset) extremely efficiently. Finally, since many orders are possible,
the joint probability of some set of variables can be computed in many ways (N! for N
variables), and this can be exploited to obtain a more robust probability estimation and
better log-likelihood, by simply averaging the log-probabilities predicted by different
randomly chosen orders. In the same paper, the authors propose to consider deep
204
versions of the architecture, but unfortunately that immediately makes computation as
expensive as in the original neural auto-regressive neural network (Bengio and Bengio,
2000b). The first layer and the output layer can still be computed in O(NH) multiply-
add, as in the regular NADE, where H is the number of hidden units (the size of the
groups h
i
, in Figures 12.13 and 12.12), whereas it is O(N
2
H) in Bengio and Bengio
(2000b). However, for the other hidden layers, the computation is O(N
2
H
2
) if every
“previous” group at layer k participates in predicting the “next” group at layer k + 1,
assuming N groups of H hidden units at each layer. Making the i-th group at layer k+1
only depend on the i-th group, as in Murray and Larochelle (2014) at layer k reduces it
to O(N H
2
), which is still H times worse than the regular NADE.
12.6 Facing the Challenge of Long-Term Dependencies
12.6.1 Echo State Networks: Choosing Weights to Make Dynamics
Barely Contractive
12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph
12.6.3 The Long-Short-Term-Memory architecture
12.6.4 Better Optimization
12.6.5 Clipping Gradients
Refer to clipping, Section ??.
12.6.6 Regularizing to Encourage Information Flow
12.6.7 Organizing the State at Multiple Time Scales
12.7 Handling temporal dependencies with n-grams, HMMs,
CRFs and other graphical models
12.7.1 N-grams
12.7.2 HMMs
12.7.3 CRFs
See 20.4 for combining recurrent and other deep learners with generative models such
as CRFs, GSNs or RBMs.
205
12.8 Combining Neural Networks and Search
12.8.1 Joint Training of Neural Networks and Sequential Probabilistic
Models
12.8.2 MAP and Structured Output Models
12.8.3 Back-prop through Search
12.9 Historical Notes
.
.
206