
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
divergence (PCD, or PCD-k to indicate the use of k Gibbs steps per update) in
the deep learning community (Tieleman, 2008). See Algorithm 18.3. The basic
idea of this approach is that, so long as the steps taken by the stochastic gradi-
ent algorithm are small, then the model from the previous step will be similar
to the model from the current step. It follows that the samples from the previ-
ous model’s distribution will be very close to being fair samples from the current
model’s distribution, so a Markov chain initialized with these samples will not
require much time to mix.
Because each Markov chain is continually updated throughout the learning
process, rather than restarted at each gradient step, the chains are free to wander
far enough to find all of the model’s modes. SML is thus considerably more
resistant to forming models with spurious modes than CD is. Moreover, because
it is possible to store the state of all of the sampled variables, whether visible or
latent, SML provides an initialization point for both the hidden and visible units.
CD is only able to provide an initialization for the visible units, and therefore
requires burn-in for deep models. SML is able to train deep models efficiently.
Marlin et al. (2010) compared SML to many of the other criteria presented in this
chapter. They found that SML results in the best test set log-likelihood for an
RBM, and if the RBM’s hidden units are used as features for an SVM classifier,
SML results in the best classification accuracy.
SML is vulnerable to becoming inaccurate if k is too small or is too large
— in other words, if the stochastic gradient algorithm can move the model faster
than the Markov chain can mix between steps. There is no known way to test
formally whether the chain is successfully mixing between steps. Subjectively, if
the learning rate is too high for the number of Gibbs steps, the human operator
will be able to observe that there is much more variance in the negative phase
samples across gradient steps rather than across different Markov chains. For
example, a model trained on MNIST might sample exclusively 7s on one step.
The learning process will then push down strongly on the mode corresponding to
7s, and the model might sample exclusively 9s on the next step.
Care must be taken when evaluating the samples from a model trained with
SML. It is necessary to draw the samples starting from a fresh Markov chain
initialized from a random starting point after the model is done training. The
samples present in the persistent negative chains used for training have been
influenced by several recent versions of the model, and thus can make the model
appear to have greater capacity than it actually does.
Berglund and Raiko (2013) performed experiments to examine the bias and
variance in the estimate of the gradient provided by CD and SML. CD proves to
have low variance than the estimator based on exact sampling. SML has higher
variance. The cause of CD’s low variance is its use of the same training points
526