the Markov chains at each gradient step with their states from the previous gradient
step. This approach was first discovered under the name stochastic maximum likelihood
(SML) in the applied mathematics and statistics community (Younes, 1998) and later
independently rediscovered under the name persistent contrastive divergence (PCD, or
PCD-k to indicate the use of k Gibbs steps per update) in the deep learning commu-
nity (Tieleman, 2008). See Algorithm 15.3. The basic idea of this approach is that, so
long as the steps taken by the stochastic gradient algorithm are small, then the model
from the previous step will be similar to the model from the current step. It follows
that the samples from the previous model’s distribution will be very close to being fair
samples from the current model’s distribution, so a Markov chain initialized with these
samples will not require much time to mix.
Because each Markov chain is continually updated throughout the learning process,
rather than restarted at each gradient step, the chains are free to wander far enough to
find all of the model’s modes. SML is thus considerably more resistant to forming models
with spurious modes than CD is. Moreover, because it is possible to store the state of
all of the sampled variables, whether visible or latent, SML provides an initialization
point for both the hidden and visible units. CD is only able to provide an initialization
for the visible units, and therefore requires burn-in for deep models. SML is able to
train deep models efficiently. Marlin et al. (2010) compared SML to many of the other
criteria presented in this chapter. They found that SML results in the best test set log
likelihood for an RBM, and if the RBM’s hidden units are used as features for an SVM
classifier, SML results in the best classification accuracy.
SML is vulnerable to becoming inaccurate if k is too small or is too large — in
other words, if the stochastic gradient algorithm can move the model faster than the
Markov chain can mix between steps. There is no known way to test formally whether
the chain is successfully mixing between steps. Subjectively, if the learning rate is too
high for the number of Gibbs steps, the human operator will be able to observe that
there is much more variance in the negative phase samples across gradient steps rather
than across different Markov chains. For example, a model trained on MNIST might
sample exclusively 7s on one step. The learning process will then push down strongly on
the mode corresponding to 7s, and the model might sample exclusively 9s on the next
step.
Care must be taken when evaluating the samples from a model trained with SML. It
is necessary to draw the samples starting from a fresh Markov chain initialized from
a random starting point after the model is done training. The samples present in
the persistent negative chains used for training have been influenced by several recent
versions of the model, and thus can make the model appear to have greater capacity
than it actually does.
Berglund and Raiko (2013) performed experiments to examine the bias and variance
in the estimate of the gradient provided by CD and SML. CD proves to have low variance
than the estimator based on exact sampling. SML has higher variance. The cause of
CD’s low variance is its use of the same training points in both the positive and negative
phase. If the negative phase is initialized from different training points, the variance
rises above that of the estimator based on exact sampling.
243