
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
history of inputs.
One way to think about these reservoir computing recurrent networks is that
they are similar to kernel machines: they map an arbitrary length sequence (the
history of inputs up to time t) into a fixed-length vector (the recurrent state
s
t
), on which a linear predictor (typically a linear regression) can be applied to
solve the problem of interest. The training criterion is therefore convex in the
parameters (which are just the output weights) and can actually be solved online
in the linear regression case (using online updates for linear regression (Jaeger,
2003)).
The important question is therefore: how do we set the input and recurrent
weights so that a rich set of histories can be represented in the recurrent neural
network state? The answer proposed in the reservoir computing literature is to
make the dynamical system associated with the recurrent net nearly be on the
edge of stability, i.e., more precisely with values around 1 for the leading singular
value of the Jacobian of the state-to-state transition function. As alluded to
in 8.2.5, an important characteristic of a recurrent network is the eigenvalue
spectrum of the Jacobians J
(t)
=
∂s
t
∂s
t−1
, and in particular the spectral radius
of J
(t)
, i.e., its largest eigenvalue. If it is greater than 1, the dynamics can
diverge, meaning that small differences in the state value at t can yield a very
large difference at T later in the sequence. To see this, consider the simpler case
where the Jacobian matrix J does not change with t. If a change ∆s in the state
at t is aligned with an eigenvector v of J with eigenvalue λ > 1, then the small
change ∆s becomes λ∆s after one time step, and λ
N
∆s after N time steps. If
λ > 1 this makes the change exponentially large. More generally, ∆s can be the
component of the state change vector in the direction v. With a non-linear map,
the Jacobian keeps changing and the dynamics is more complicated but what
remains is that a small initial variation can turn into a large variation after a few
steps. Note that, in general, the recurrent dynamics are bounded (for example,
if the hidden units use a bounded non-linearity such as the hyperbolic tangent)
so that the change after N steps must also be bounded. Instead when the largest
eigenvalue λ < 1, we say that the map from t to t + 1 is contractive: a small
change gets contracted, becoming smaller after each time step. This necessarily
makes the network forgetting information about the long-term past, but it also
makes its dynamics stable and easier to use.
What has been proposed to set the weights of reservoir computing machines
is to make the Jacobians slightly contractive. This is achieved by making the
spectral radius of the weight matrix large but slightly less than 1. However,
in practice, good results are often found with a spectral radius of the recurrent
weight matrix being slightly larger than 1, e.g., 1.2 (Sutskever, 2012; Sutskever
et al., 2013). Keep in mind that with hyperbolic tangent units, the maximum
335