
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
several approximations to the negative phase. Each of these approximations can
be understood as making the negative phase computationally cheaper but also
making it push down in the wrong locations.
Because the negative phase involves drawing samples from the model’s distri-
bution, we can think of it as finding points that the model believes in strongly.
Because the negative phase acts to reduce the probability of those points, they
are generally considered to represent the model’s incorrect beliefs about the world.
They are frequently referred to in the literature as “hallucinations” or “fantasy
particles.” In fact, the negative phase has been proposed as a possible explanation
for dreaming in humans and other animals (Crick and Mitchison, 1983), the idea
being that the brain maintains a probabilistic model of the world and follows the
gradient of
log ˜p
while experiencing real events while awake and follows the negative
gradient of
log ˜p
to minimize
log Z
while sleeping and experiencing events sampled
from the current model. This view explains much of the language used to describe
algorithms with a positive and negative phase, but it has not been proven to be
correct with neuroscientific experiments. In machine learning models, it is usually
necessary to use the positive and negative phase simultaneously, rather than in
separate time periods of wakefulness and REM sleep. As we will see in Sec. 19.5,
other machine learning algorithms draw samples from the model distribution for
other purposes and such algorithms could also provide an account for the function
of dream sleep.
Given this understanding of the role of the positive and negative phase of
learning, we can attempt to design a less expensive alternative to Algorithm 18.1.
The main cost of the naive MCMC algorithm is the cost of burning in the Markov
chains from a random initialization at each step. A natural solution is to initialize
the Markov chains from a distribution that is very close to the model distribution,
so that the burn in operation does not take as many steps.
The contrastive divergence (CD, or CD-
k
to indicate CD with
k
Gibbs steps)
algorithm initializes the Markov chain at each step with samples from the data
distribution (Hinton, 2000, 2010). This approach is presented as Algorithm 18.2.
Obtaining samples from the data distribution is free, because they are already
available in the data set. Initially, the data distribution is not close to the model
distribution, so the negative phase is not very accurate. Fortunately, the positive
phase can still accurately increase the model’s probability of the data. After the
positive phase has had some time to act, the model distribution is closer to the
data distribution, and the negative phase starts to become accurate.
Of course, CD is still an approximation to the correct negative phase. The
main way that CD qualitatively fails to implement the correct negative phase
613