
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
One way to think about these reservoir computing recurrent networks is that
they are similar to kernel machines: they allow to map an arbitrary length se-
quence (the history of inputs up to time t) into a ļ¬xed-length vector (the recurrent
state s
t
), on which a linear predictor (typically a linear regression) can be applied
to solve the problem of interest. The training criterion is therefore convex in the
parameters (which are just the output weights) and can actually be solved online
in the linear regression case (using online updates for linear regression (Jaeger,
2003)).
The important question is therefore: how do we set the input and recurrent
weights so that a rich set of histories can be represented in the recurrent neural
network state? The answer proposed in the reservoir computing literature is to
make the dynamical system associated with the recurrent net nearly on the edge
of chaos. (TODO make that more precise). As alluded to in 8.2.5, an important
characteristic of a recurrent network is the eigenvalue spectrum of the Jacobians
J
t
=
ās
t
ās
tā1
, and in particular the spectral radius of J
t
, i.e., its largest eigenvalue.
If it is greater than 1, the dynamics can diverge, meaning that small diļ¬erences in
the state value at t can yield a very large diļ¬erence at T later in the sequence. To
see this, consider the simpler case where the Jacobian matrix J does not change
with t. If a change ās in the state at t is aligned with an eigenvector v of J
with eigenvalue Ī» > 1, then the small change ās becomes Ī»ās after one time
step, and Ī»
N
ās after N time steps. If Ī» > 1 this makes the change exponentially
large. With a non-linear map, the Jacobian keeps changing and the dynamics is
more complicated but what remains is that a small initial variation can turn into
a large variation after a number of steps. In general, the recurrent dynamics are
bounded (for example, if the hidden units use a bounded non-linearity such as
the hyperbolic tangent) so that the change after N steps must also be bounded.
Instead when the largest eigenvalue Ī» < 1, we say that the map from t to t + 1
is contractive: a small change gets contracted, becoming smaller after each time
step. This necessarily makes the network forgetting information about the long-
term past, but it also makes its dynamics stable and easier to use.
What has been proposed to set the weights of reservoir computing machines
is to make the Jacobians slightly contractive. This is achieved by making the
spectral radius of the weight matrix large but slightly less than 1. However,
in practice, good results are often found with a spectral radius of the recurrent
weight matrix being slightly larger than 1, e.g., 1.2 (Sutskever, 2012; Sutskever
et al., 2013). Keep in mind that with hyperboling tangent units, the maximum
derivative is 1, so that in order to guarantee a Jacobian spectral radius less than
1, the weight matrix should have spectral radius less than 1 as well. However,
most derivatives of the hyperbolic tangent will be less than 1, which may explain
Sutskeverās empirical observation.
268