
units form of reservoir of temporal features which may capture different aspects of the
history of inputs.
One way to think about these reservoir computing recurrent networks is that they are
similar to kernel machines: they allow to map an arbitrary length sequence (the history
of inputs up to time t) into a fixed-length vector (the recurrent state s
t
), on which a
linear predictor (typically a linear regression) can be applied to solve the problem of
interest. The training criterion is therefore convex in the parameters (which are just the
output weights) and can actually be solved online in the linear regression case (using
online updates for linear regression (Jaeger, 2003)).
The important question is therefore: how do we set the input and recurrent weights
so that a rich set of histories can be represented in the recurrent neural network state?
The answer proposed in the reservoir computing literature is to make the dynamical
system associated with the recurrent net nearly on the edge of chaos. (TODO make that
more precise). As alluded to in 8.2.5, an important characteristic of a recurrent network
is the eigenvalue spectrum of the Jacobians J
t
=
∂s
t
∂s
t−1
, and in particular the spectral
radius of J
t
, i.e., its largest eigenvalue. If it is greater than 1, the dynamics can diverge,
meaning that small differences in the state value at t can yield a very large difference at
T later in the sequence. To see this, consider the simpler case where the Jacobian matrix
J does not change with t. If a change ∆s in the state at t is aligned with an eigenvector v
of J with eigenvalue λ > 1, then the small change ∆s becomes λ∆s after one time step,
and λ
N
∆s after N time steps. If λ > 1 this makes the change exponentially large. With
a non-linear map, the Jacobian keeps changing and the dynamics is more complicated
but what remains is that a small initial variation can turn into a large variation after
a number of steps. In general, the recurrent dynamics are bounded (for example, if
the hidden units use a bounded non-linearity such as the hyperbolic tangent) so that
the change after N steps must also be bounded. Instead when the largest eigenvalue
λ < 1, we say that the map from t to t +1 is contractive: a small change gets contracted,
becoming smaller after each time step. This necessarily makes the network forgetting
information about the long-term past, but it also makes its dynamics stable and easier
to use.
What has been proposed to set the weights of reservoir computing machines is to
make the Jacobians slightly contractive. This is achieved by making the spectral radius
of the weight matrix large but slightly less than 1. However, in practice, good results
are often found with a spectral radius of the recurrent weight matrix being slightly
larger than 1, e.g., 1.2 (Sutskever, 2012; Sutskever et al., 2013). Keep in mind that with
hyperboling tangent units, the maximum derivative is 1, so that in order to guarantee a
Jacobian spectral radius less than 1, the weight matrix should have spectral radius less
than 1 as well. However, most derivatives of the hyperbolic tangent will be less than 1,
which may explain Sutskever’s empirical observation.
More recently, it has been shown that the techniques used to set the weights in
ESNs could be used to initialize the weights in a fully trainable recurrent network (e.g.,
trained using back-propagation through time), helping to learn long-term dependen-
cies (Sutskever, 2012; Sutskever et al., 2013). In addition to setting the spectral radius
to 1.2, Sutskever sets the recurrent weight matrix to be initially sparse, with only 15
211