Chapter 15
Confronting the Partition
Function
TODO– make sure the book explains asymptotic consistency somewhere, add links to
it here
In chapter 9.2.2 we saw that many probabilistic models (commonly known as undi-
rected graphical models) are defined by an unnormalized probability distribution ˜p(x; θ)
and a partition function Z(θ) such that p(x; θ) =
1
Z
˜p(x; θ) is a valid, normalized prob-
ability distribution. The partition function is an integral or sum over the unnormalized
probability of all states. This operation is intractable for many interesting models.
As we will see in chapter 17, many deep learning models are designed to have a
tractable normalizing constant, or are designed to be used in ways that do not involve
computing p(x) at all. However, other models confront the challenge of intractable
partition functions head on. In this chapter, we describe techniques used for training
and evaluating models that have intractable partition functions.
15.1 Estimating the partition function
While much of this chapter is dedicated to describing methods for working around the
unknown and intractable partition function Z(θ) associated with an undirected graphical
model; in this section we will discuss several methods for directly estimating the partition
function.
Estimating the partition function can be important because we require it if we wish
to compute the normalized likelihood of data. This is often important in evaluating the
model, monitor training performance, and compare it to other models.
For example, imagine we have two models: M
A
: p
A
(x; θ
A
) =
B
Z
A
˜p
A
(x; θ
A
) and M
B
:
p
B
(x; θ
B
) =
B
Z
B
˜p
B
(x; θ
B
). A common way to compare the models is to evaluate the
likelihood of an IID test dataset of size N
test
: D
test
= {x
(t)
i
}
N
test
under both models. If
i
p
A
(x
(t)
i
θ
A
) >
i
p
B
(x
(t)
i
θ
B
) or equivalently if
i
ln p
A
(x
(t)
i
θ
A
)
i
ln p
B
(x
(t)
i
θ
B
) >
0, then we say that M
A
is a better model than M
B
(or, at least, it is a better model of
276
the test set). More specifically, to say that M
A
is better than M
B
, we need that:
i
ln p
A
(x
(t)
i
θ
A
)
i
ln p
B
(x
(t)
i
θ
B
) > 0
i
ln ˜p
A
(x
(t)
i
θ
A
) ln Z(θ
A
)
i
ln ˜p
B
(x
(t)
i
θ
B
) ln Z(θ
B
)
> 0
i
ln ˜p
A
(x
(t)
i
θ
A
) ln ˜p
B
(x
(t)
i
; θ
B
)
N
test
ln Z(θ
A
) + N
test
ln Z(θ
B
) > 0
i
ln
˜p
A
(x
(t)
i
; θ
A
)
˜p
B
(x
(t)
i
; θ
B
)
N
test
ln
Z(θ
A
)
Z(θ
B
)
> 0.
In order to compare two models we need to compare not only their unnormalized prob-
abilities, but also their partition functions. It is interesting to note that, in order to
compare these models, we do not actually need to know the value of their partition
function. We need only know their ratio. That is, we need to know their relative value,
up do some shared constant. If, however, we wanted to know the actual probability
of the test data under either M
A
or M
B
, we would need to know the actual value
of the partition functions. That said, if we knew the ratio of two partition functions,
R =
Z(θ
B
)
Z(θ
A
)
, and we knew the actual value of just one of the two, say Z(θ
A
), we can
compute the value of the other:
Z(θ
B
) = R ×Z(θ
A
) =
Z(θ
B
)
Z(θ
A
)
Z(θ
A
)
We can make use of this observation to estimate the partition functions of undirected
graphical models.
For a given probability distribution, say p
1
(x), the partition function is defined as
Z
1
=
˜p
1
(x) dx (15.1)
where the integral is over the domain of x. Of course, in the case of discrete x, we replace
the integral with a sum. For convenience, we have suppressed the dependency of both
the partition functions and the unnormalized distributions on the model parameters.
A simple way to estimate the partition function is to use a Monte Carlo method
such as simple importance sampling. Here we consider a proposal distribution, say
p
0
(x), from which we can sample and evaluate both its partition function Z
0
, and its
277
unnormalized distribution ˜p
0
(x).
Z
1
=
˜p
1
(x) dx
=
p
0
(x)
p
0
(x)
˜p
1
(x) dx
= Z
0
p
0
(x)
˜p
1
(x)
˜p
0
(x)
dx
Z
1
Z
0
K
k=1:x
(k)
p
0
˜p(x
(k)
)
˜p
0
(x
(k)
)
(15.2)
In the last line, we make a Monte Carlo approximation of the integral using samples
drawn from p
0
(x) and then weight each sample with the ration of the unnormalized ˜p
1
and the proposal p
0
each evaluated at that sample.
If the distribution p
0
is close to p
1
, this can be an effective way of estimating the
partition function Minka (2005). Unfortunately, most of the time p
1
is both complicated,
i.e. multimodal, and defined over a high dimensional space. It is difficult to find a
tractable p
0
that is simple enough to evaluate while still being close enough to p
1
to
result in a high quality approximation. If p
0
and p
1
are not close, most samples from p
0
will have low probability under p
1
and therefore make (relatively) negligible contribution
to the sum in Eq. ??. Having few samples with significant weights in this sum will result
in an estimator with high variance, i.e. a poor quality estimator.
TODO: quantify this
We now turn to two related strategies developed to cope with the challenging task of
estimating partition functions for complex distributions over high-dimensional spaces:
annealed importance sampling and Bennett’s ratio acceptance method. Both start with
the basic importance sampling strategy introduced above and both attempt to over-
come the problem of the proposal p
0
being too far from p
1
by introducing intermediate
distributions that attempt to bridge the gap between p
0
and p
1
.
15.1.1 Annealed importance sampling
TODO– describe how this is the main way of evaluating p(x) when you want to get
test set likelihoods but can’t be used for training TODO– also mention Guillaume’s
”tracking the partition function” paper?
In situations where KL(p
0
||p
1
) is large (i.e., where there is little overlap between
p
0
and p
1
), AIS attempts to bridge the gap by introducing intermediate distributions.
Consider a sequence of distributions p
η
0
, . . . , p
η
n
, with 0 = η
0
< η
1
< ··· < η
n1
< η
n
=
1 so that the first and last distributions in the sequence are p
0
and p
1
respectively. We
278
can now write the ratio
Z
1
Z
0
as
Z
1
Z
0
=
Z
1
Z
0
Z
η
1
Z
η
1
···
Z
η
n1
Z
η
n1
=
Z
η
1
Z
0
Z
η
2
Z
η
1
···
Z
η
n1
Z
η
n2
Z
1
Z
η
n1
=
n1
j=0
Z
η
j+1
Z
η
j
(15.3)
Provided the distributions p
η
j
and p
η
j
+1
, for all 0 j n 1, are sufficiently close, we
can reliably estimate each of the factors
Z
η
j+1
Z
η
j
using simple importance sampling and
then use these to obtain an estimate of
Z
1
Z
0
.
Where do these intermediate distributions come from? Just as the original proposal
distribution p
0
is a design choice, so is the sequence of distributions p
η
1
. . . p
η
n1
. That is,
it can be specifically constructed to suit the problem domain. One general purpose and
popular choice for the intermediate distributions is to use weighted geometric average of
the target distribution p
1
and the starting proposal distribution (for which the partition
function is known) p
0
:
p
η
j
p
η
j
1
p
1η
j
0
(15.4)
In order to sample from these intermediate distributions, we define a series of Markov
chain transition functions T
η
j
(x
, x) that define the probability distribution of transi-
tioning from x
to x. T
η
j
(x
, x) is defined to leave p
η
j
(x) invariant:
p
η
j
(x) =
p
η
j
(x
)T
η
j
(x
, x) dx
(15.5)
These transitions may be constructed as any Markov chain Monte Carlo method (e.g..
Metropolis-Hastings, Gibbs), including methods involving multiple scans or other iter-
ations.
The AIS sampling strategy is then to generate samples from p
0
and then use the
transition operators to sequentially generate samples from the intermediate distributions
until we arrive at samples from the target distribution p
1
:
for k = 1 . . . K
Sample x
(k)
η
1
p
0
(X)
Sample x
(k)
η
2
T
η
1
(x
(k)
η
1
, X)
. . .
Sample x
(k)
η
n1
T
η
n2
(x
(k)
n2
, X)
Sample x
(k)
η
n
T
η
n1
(x
(k)
n1
, X)
end
279
For sample k, we can derive the importance weight by chaining together the impor-
tance weights for the jumps between the intermediate distributions given in Eq. 15.3.
w
(k)
=
˜p
η
1
(x
(k)
η
1
)
˜p
0
(x
(k)
0
)
˜p
η
2
(x
(k)
η
2
)
˜p
1
(x
(k)
η
1
)
. . .
˜p
1
(x
(k)
1
)
˜p
η
n1
(x
(k)
η
n1
)
(15.6)
To avoid computational issues such as overflow, it is probably best to do the computation
in log space, i.e. ln w
(k)
= ln ˜p
η
1
(x) ln ˜p
0
(x) + dots.
With the sampling procedure thus define and the importance weights given in Eq.
15.6, the estimate of the ratio of partition functions is given by:
Z
1
Z
0
1
K
K
k=1
w
(k)
(15.7)
In order to verify that this procedure defines a valid importance sampling scheme,
we can show that the AIS procedure corresponds to simple importance sampling on an
extended state space with points sampled over the product space: [x
η
1
, . . . , x
η
n1
, x
1
]
Neal (2001).
We define the distribution over the extended space as:
˜p(x
η
1
, . . . , x
η
n1
, x
1
) = ˜p
1
(x
1
)
˜
T
η
n1
(x
1
, x
η
n1
)
˜
T
η
n2
(x
η
n1
, x
η
n2
) . . .
˜
T
η
1
(x
η
2
, x
η
1
)
(15.8)
where
˜
T
a
is the reverse of the transition operator defined by T
a
(via an application of
Bayes’ rule):
˜
T
a
(x, x
) =
p
a
(x
)
p
a
(x)
T
a
(x
, x) =
˜p
a
(x
)
˜p
a
(x)
T
a
(x
, x). (15.9)
Plugging the above into the expression for the joint distribution on the extended state
space given in Eq. 15.8, we get:
˜p(x
η
1
, . . . , x
η
n1
, x
1
) = ˜p
1
(x
1
)
˜p
η
n1
(x
η
n1
)
˜p
η
n1
(x
1
)
T
η
n1
(x
η
n1
, x
1
)
˜p
η
n2
(x
η
n2
)
˜p
η
n2
(x
η
n1
)
T
η
n2
(x
η
n2
, x
η
n1
) . . .
˜p
η
1
(x
η
1
)
˜p
η
1
(x
η
2
)
T
η
1
(x
η
1
, x
η
2
)“““
=
˜p
1
(x
1
)
˜p
η
n1
(x
1
)
T
η
n1
(x
η
n1
, x
1
)
˜p
η
n1
(x
η
n1
)
˜p
η
n2
(x
η
n1
)
T
η
n2
(x
η
n2
, x
η
n1
) . . .
˜p
η
n2
(x
η
n2
)
˜p
η
1
(x
η
2
)
T
η
1
(x
η
1
, x
η
2
)˜p
η
1
(x
η
1
)
(15.10)
If we now consider the sampling scheme given above as a means of generating samples
from a proposal distribution q over the extended state, with its distribution given by:
q(x
η
1
, . . . , x
η
n1
, x
1
) = p
0
(x
η
1
)T
η
1
(x
η
1
, X
η
2
) . . . T
η
n1
(x
η
n1
, X
1
) (15.11)
We have a joint distribution on the extended space given by Eq. 15.10. Taking
q(x
η
1
, . . . , x
η
n1
, x
1
) as the proposal distribution on the extended state space from which
we will draw samples, it remains to determine the importance weights:
w
(k)
=
˜p(x
η
1
, . . . , x
η
n1
, x
1
)
q(x
η
1
, . . . , x
η
n1
, x
1
)
=
˜p
1
(x
(k)
1
)
˜p
η
n1
(x
(k)
η
n1
)
. . .
˜p
η
2
(x
(k)
η
2
)
˜p
1
(x
(k)
η
1
)
˜p
η
1
(x
(k)
η
1
)
˜p
0
(x
(k)
0
)
(15.12)
280
These weights are the same as proposed for AIS. Thus we can interpret AIS as simple
importance sampling applied to an extended state and its validity follows immediately
from the validity of importance sampling.
Annealed importance sampling (AIS) was first discovered by Jarzynski (1997) and
then again, independently, by Neal (2001). It is currently the most common way of
estimating the partition function for undirected probabilistic models. The reasons for
this may have more to do with the publication of an influential paper Salakhutdinov
and Murray (2008) describing it’s application to estimating the partition function of re-
stricted Boltzmann machines and deep belief networks than with any inherent advantage
the method has over the other method described below.
A discussion of the properties of the AIS estimator (e.g.. its variance and efficiency)
can be found in Neal (2001).
15.1.2 Bridge sampling
Bridge sampling Bennett (1976) is another method that, like AIS, addresses the short-
comings of importance sampling; however it does so in a different but related manner.
Rather than chain together a series of intermediate distributions, bridge sampling relies
on a single distribution p
, known as the bridge, to interpolate between a distribution
with known partition function, p
0
, and a distribution p
1
for which we are trying to
estimate the partition function Z
1
.
Bridge sampling estimates the ratio Z
1
/Z
0
as the ratio of the expected importance
weights between p
0
and p
and between p
1
and p
, with variance σ
2
bridge
.
Z
1
Z
0
K
k=1
p
(x
(k)
0
)
p
0
(x
(k)
0
)
K
k=1
p
(x
(k)
1
)
p
1
(x
(k)
1
)
(15.13)
If the bridge distribution p
is chosen carefully to have a large overlap of support with
both p
0
and p
1
, then bridge sampling can allow the distance between two distributions (or
more formally, KL(p
0
||p
1
)) to be much larger than with standard importance sampling.
It can be shown than the optimal bridging distribution is given by p
(opt)
(x)
p
0
(x)p
1
(x)
rp
0
(x)+p
1
(x)
where r = Z
1
/Z
0
.
This appears to be an unworkable solution as it would seem to require the very
quantity we are trying to estimate, i.e. Z
1
/Z
0
. However, it is possible to start with
a coarse estimate of r and use the resulting bridge distribution to refine our estimate
recursively Neal (2005).
TODO: illustration of the bridge distribution
15.1.3 Extensions
Linked importance sampling Both AIS and bridge sampling has their advantages.
If KL(p
0
||p
1
) is not too large (i.e. if p
0
and p
1
are sufficiently close) bridge sampling
can be a more effective means of estimating the ratio of partition functions than AIS.
If, however, the two distributions are too far apart for a single distribution p
to bridge
281
the gap then one can at least use AIS with potential many intermediate distributions
to span the distance between p
0
and p
1
. Neal (2005) showed how his linked importance
sampling method leveraged the power of the bridge sampling strategy to bridge the
intermediate distributions used in AIS to significantly improve the overall partition
function estimates.
Tracking the partition function while training Using a combination of bridge
sampling, AIS and parallel tempering, Desjardins et al. (2011) devised a scheme to
track the partition function of an RBM throughout the training process. The strategy
is based on the maintenance of independent estimates of the partition functions of the
RBM at every temperature operating in the parallel tempering scheme. The authors
combined bridge sampling estimates of the ratios of partition functions of neighboring
chains (i.e. from parallel tempering) with AIS estimates across time to come up with a
low variance estimate of the partition functions at every iteration of learning.
15.2 Stochastic maximum likelihood and contrastive di-
vergence
Though the gradient of the log partition function is intractable to evaluate accurately,
it is straightforward to analyze algebraically. The derivatives we need for learning are
of the form
θ
log p(x)
where θ is one of the parameters of p(x). These derivatives are given simply by
θ
log p(x) =
θ
(log ˜p(x) log Z) .
In this chapter, we are primarily concerned with the estimation of the term on the
right:
θ
log Z
=
θ
Z
Z
=
θ
x
˜p(x)
Z
=
x
θ
˜p(x)
Z
.
For models that guarantee p(x) > 0 for all x, we can substitute exp (log ˜p(x)) for
˜p(x):
=
x
θ
exp (log ˜p(x))
Z
282
=
x
exp (log ˜p(x))
θ
log ˜p(x)
Z
=
x
˜p(x)
θ
log ˜p(x)
Z
=
x
p(x)
θ
log ˜p(x)
= E
xp(x)
θ
log ˜p(x).
This derivation made use of summation over discrete x, but a similar result applies
using integration over continuous x. In the continuous version of the derivation, we use
Leibniz’s rule for differentiation under the integral sign to obtain the identity
θ
˜p(x)dx =
θ
˜p(x)dx.
This identity is only applicable when both ˜p and
θ
˜p(x) are continuous. Fortunately,
most machine learning models of interest have these properties.
This identity
θ
log Z = E
xp(x)
θ
log ˜p(x) (15.14)
is the basis for a variety of Monte Carlo methods for approximately maximizing the
likelihood of models with intractable partition functions.
The naive way of implementing equation 15.14 is to compute it by burning in a set
of Markov chains from a random initialization every time the gradient is needed. When
learning is performed using stochastic gradient descent, this means the chains must
be burned in once per gradient step. This approach leads to the training procedure
presented in Algorithm 15.1. The high cost of burning in the Markov chains in the inner
loop makes this procedure computationally infeasible, but this basic procedure is the
starting point that other more practical algorithms aim to approximate.
We can view the MCMC approach to maximum likelihood as trying to achieve bal-
ance between two forces, one pushing up on the model distribution where the data
occurs, and another pushing down on the model distribution where the model samples
occur. Fig. 15.1 illustrates this process. The two forces correspond to maximizing log ˜p
and minimizing log Z. Terms of the approximate gradient intended to maximize log ˜p
are referred to as the positive phase and terms of the approximate gradient intended
to minimize log Z are known as the negative phase. In this chapter, we assume the
positive phase is tractable and may be performed exactly, but other chapters, especially
chapter 16 deal with intractable positive phases. In this chapter, we present several
approximations to the negative phase. Each of these approximations can be understood
as making the negative phase computationally cheaper but also making it push down in
the wrong locations.
Because the negative phase involves drawing samples from the model’s distribution,
we can think of it as finding points that the model believes in strongly. Because the neg-
ative phase acts to reduce the probability of those points, they are generally considered
283
Algorithm 15.1 A naive MCMC algorithm for maximizing the log likelihood with an
intractable partition function using gradient ascent.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow burn in. Perhaps 100 to train
an RBM on a small image patch.
while Not converged do
Sample a minibatch of m examples {x
(1)
, . . . , x
(m)
} from the training set.
g
1
m
m
i=1
θ
log ˜p(x
(i)
; θ)
Initialize a set of m samples {
˜
x
(1)
, . . . ,
˜
x
(m)
} to random values (e.g., from a uniform
or normal distribution, or possibly a distribution with marginals matched to the
model’s marginals)
for i = 1 to k do
for j = 1 to m do
˜
x
(j)
gibbs update(
˜
x
(j)
)
end for
end for
g g
1
m
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
to represent the model’s incorrect beliefs about the world. They are frequently referred
to in the literature as “hallucinations” or “fantasy particles.” In fact, the negative phase
has been proposed as a possible explanation for dreaming in humans and other animals
(Crick and Mitchison, 1983), the idea being that the brain maintains a probabilistic
model of the world and follows the gradient of log ˜p while experiencing real events while
awake and follows the negative gradient of log ˜p to minimize log Z while sleeping and
experiencing events sampled from the current model. This view explains much of the
language used to describe algorithms with a positive and negative phase, but it has not
been proven to be correct with neuroscientific experiments. In machine learning models,
it is usually necessary to use the positive and negative phase simultaneously, rather than
in separate time periods of wakefulness and REM sleep. As we will see in chapter 16.6,
other machine learning algorithms draw samples from the model distribution for other
purposes and such algorithms could also provide an account for the function of dream
sleep.
Given this understanding of the role of the positive and negative phase of learning,
we can attempt to design a less expensive alternative to Algorithm 15.1. The main cost
of the naive MCMC algorithm is the cost of burning in the Markov chains from a random
initialization at each step. A natural solution is to initialize the Markov chains from a
distribution that is very close to the model distribution, so that the burn in operation
does not take as many steps.
The contrastive divergence (CD, or CD-k to indicate CD with k Gibbs steps) al-
gorithm initializes the Markov chain at each step with samples from the data distribu-
tion (Hinton, 2000). This approach is presented as Algorithm 15.2. Obtaining samples
284
Figure 15.1: The view of Algorithm 15.1 as having a “positive phase” and “negative
phase”. Left) In the positive phase, we sample points from the data distribution, and
push up on their unnormalized probability. This means points that are likely in the data
get pushed up on more. Right) In the negative phase, we sample points from the model
distribution, and push down on their unnormalized probability. This counteracts the
positive phase’s tendency to just add a large constant to the unnormalized probability
everywhere. When the data distribution and the model distribution are equal, the
positive phase has the same chance to push up at a point as the negative phase has to
push down. At this point, there is no longer any gradient (in expectation) and training
must terminate.
from the data distribution is free, because they are already available in the data set.
Initially, the data distribution is not close to the model distribution, so the negative
phase is not very accurate. Fortunately, the positive phase can still accurately increase
the model’s probability of the data. After the positive phase has had some time to act,
the model distribution is closer to the data distribution, and the negative phase starts
to become accurate.
Of course, CD is still an approximation to the correct negative phase. The main
way that CD qualitatively fails to implement the correct negative phase is that it fails
to suppress “spurious modes” regions of high probability that are far from actual
training examples. Fig. 15.2 illustrates why this happens. Essentially, it is because
modes in the model distribution that are far from the data distribution will not be
visited by Markov chains initialized at training points, unless k is very large.
Carreira-Perpi˜nan and Hinton (2005) showed experimentally that the CD estimator
is biased for RBMs and fully visible Boltzmann machines, in that it converges to different
points than the maximum likelihood estimator. They argue that because the bias is
small, CD could be used as an inexpensive way to initialize a model that could later be
fine-tuned via more expensive MCMC methods. Bengio and Delalleau (2009) showed
that CD can be interpreted as discarding the smallest terms of the correct MCMC
update gradient, which explains the bias.
CD is useful for training shallow models like RBMs. These can in turn be stacked
to initialize deeper models like DBNs or DBMs. However, CD does not provide much
285
Figure 15.2: An illustration of how the negative phase of contrastive divergence (Al-
gorithm 15.2) can fail to suppress spurious modes. A spurious mode is a mode that
is present in the model distribution but absent in the data distribution. Because con-
trastive divergence initializes its Markov chains from data points and runs the Markov
chain for only a few steps, it is unlikely to visit modes in the model that are far from
the data points. This means that when sampling from the model, we will sometimes
get samples that do not resemble the data. It also means that due to wasting some of
its probability mass on these modes, the model will struggle to place high probability
mass on the correct modes. Note that this figure uses a somewhat simplified concept
of distance–the spurious mode is far from the correct mode along the number line in R.
This corresponds to a Markov chain based on making local moves with a single x vari-
able in R. For most deep probabilistic models, the Markov chains are based on Gibbs
sampling and can make non-local moves of individual variables but cannot move all of
the variables simultaneously. For these problems, it is usually better to consider the edit
distance between modes, rather than the Euclidean distance. However, edit distance in
a high dimensional space is difficult to depict in a 2-D plot.
286
Algorithm 15.2 The contrastive divergence algorithm, using gradient ascent as the
optimization procedure.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow a Markov chain of p(x; θ)
to mix when initializedfrom p
data
. Perhaps 1-20 to train an RBM on a small image
patch.
while Not converged do
Sample a minibatch of m examples {x
(1)
, . . . , x
(m)
} from the training set.
g
1
m
m
i=1
θ
log ˜p(x
(i)
; θ)
for i = 1 to m do
˜
x
(i)
x
(j)
end for
for i = 1 to k do
for j = 1 to m do
˜
x
(j)
gibbs update(
˜
x
(j)
)
end for
end for
g g
1
m
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
help for training deeper models directly. This is because it is difficult to obtain samples
of the hidden units given samples of the visible units. Since the hidden units are not
included in the data, initializing from training points cannot solve the problem. Even
if we initialize the visible units from the data, we will still need to burn in a Markov
chain sampling from the distribution over the hidden units conditioned on those visi-
ble samples. Most of the approximate inference techniques described in chapter 16 for
approximately marginalizing out the hidden units cannot be used to solve this prob-
lem. This is because all of the approximate marginalization methods based on giving a
lower bound on ˜p would give a lower bound on log Z. We need to minimize log Z, and
minimizing a lower bound is not a useful operation.
The CD algorithm can be thought of as penalizing the model for having a Markov
chain that changes the input rapidly when the input comes from the data. This means
training with CD somewhat resembles autoencoder training. Even though CD is more
biased than some of the other training methods, it can be useful for pre-training shallow
models that will later be stacked. This is because the earliest models in the stack are
encouraged to copy more information up to their latent variables, thereby making it
available to the later models. This should be thought of more of as an often-exploitable
side effect of CD training rather than a principled design advantage.
Sutskever and Tieleman (2010) showed that the CD update direction is not the
gradient of any function. This allows for situations where CD could cycle forever, but
in practice this is not a serious problem.
A different strategy that resolves many of the problems with CD is to initialize
287
the Markov chains at each gradient step with their states from the previous gradient
step. This approach was first discovered under the name stochastic maximum likelihood
(SML) in the applied mathematics and statistics community (Younes, 1998) and later
independently rediscovered under the name persistent contrastive divergence (PCD, or
PCD-k to indicate the use of k Gibbs steps per update) in the deep learning commu-
nity (Tieleman, 2008). See Algorithm 15.3. The basic idea of this approach is that, so
long as the steps taken by the stochastic gradient algorithm are small, then the model
from the previous step will be similar to the model from the current step. It follows
that the samples from the previous model’s distribution will be very close to being fair
samples from the current model’s distribution, so a Markov chain initialized with these
samples will not require much time to mix.
Because each Markov chain is continually updated throughout the learning process,
rather than restarted at each gradient step, the chains are free to wander far enough to
find all of the model’s modes. SML is thus considerably more resistant to forming models
with spurious modes than CD is. Moreover, because it is possible to store the state of
all of the sampled variables, whether visible or latent, SML provides an initialization
point for both the hidden and visible units. CD is only able to provide an initialization
for the visible units, and therefore requires burn-in for deep models. SML is able to
train deep models efficiently. Marlin et al. (2010) compared SML to many of the other
criteria presented in this chapter. They found that SML results in the best test set log
likelihood for an RBM, and if the RBM’s hidden units are used as features for an SVM
classifier, SML results in the best classification accuracy.
SML is vulnerable to becoming inaccurate if k is too small or is too large in
other words, if the stochastic gradient algorithm can move the model faster than the
Markov chain can mix between steps. There is no known way to test formally whether
the chain is successfully mixing between steps. Subjectively, if the learning rate is too
high for the number of Gibbs steps, the human operator will be able to observe that
there is much more variance in the negative phase samples across gradient steps rather
than across different Markov chains. For example, a model trained on MNIST might
sample exclusively 7s on one step. The learning process will then push down strongly on
the mode corresponding to 7s, and the model might sample exclusively 9s on the next
step.
Care must be taken when evaluating the samples from a model trained with SML. It
is necessary to draw the samples starting from a fresh Markov chain initialized from
a random starting point after the model is done training. The samples present in
the persistent negative chains used for training have been influenced by several recent
versions of the model, and thus can make the model appear to have greater capacity
than it actually does.
Berglund and Raiko (2013) performed experiments to examine the bias and variance
in the estimate of the gradient provided by CD and SML. CD proves to have low variance
than the estimator based on exact sampling. SML has higher variance. The cause of
CD’s low variance is its use of the same training points in both the positive and negative
phase. If the negative phase is initialized from different training points, the variance
rises above that of the estimator based on exact sampling.
288
Algorithm 15.3 The stochastic maximum likelihood / persistent contrastive divergence
algorithm using gradient ascent as the optimization procedure.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow a Markov chain of p(x; θ +g)
toburn in, starting from samples from p(x; θ). Perhaps 1 for RBM on a small image
patch, or 5-50 for a morecomplicated model like a DBM.
Initialize a set of m samples {
˜
x
(1)
, . . . ,
˜
x
(m)
} to random values (e.g., from a uniform or
normal distribution, or possibly a distribution with marginals matched to the model’s
marginals)
while Not converged do
Sample a minibatch of m examples {x
(1)
, . . . , x
(m)
} from the training set.
g
1
m
m
i=1
θ
log ˜p(x
(i)
; θ)
for i = 1 to k do
for j = 1 to m do
˜
x
(j)
gibbs update(
˜
x
(j)
)
end for
end for
g g
1
m
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
TODO– FPCD? TODO– Rates-FPCD?
TODO– mention that all these things can be coupled with enhanced samplers, which
I believe are mentioned in the intro to graphical models chapter
One key benefit to the MCMC-based methods described in this section is that they
provide an estimate of the gradient of log Z, and thus we can essentially decompose the
problem into the log ˜p contribution and the log Z contribution. We can then use any
other method to tackle log ˜p(x), and just add our negative phase gradient onto the other
method’s gradient. In particular, this means that our positive phase can make use of
methods that provide only a lower bound on ˜p. Most of the other methods of dealing
with log Z presented in this chapter are incompatible with bound-based positive phase
methods.
15.3 Pseudolikelihood
Monte Carlo approximations to the partition function and its gradient confront the par-
tition function head on. Other approaches sidestep the issue, by training the model
without computing the partition function. Most of these approaches are based on the
observation that it is easy to compute ratios of probabilities in an unnormalized prob-
abilistic model. This is because the partition function appears in both the numerator
and the denominator of the ratio and cancels out:
p(x)
p(y)
=
1
Z
˜p(x)
1
Z
˜p(y)
=
˜p(x)
˜p(y)
.
289
The pseudolikelihood is based on the observation that conditional probabilities take
this ratio-based form, and thus can be computed without knowledge of the partition
function. Suppose that we partition x into a, b, and c, where a contains the variables
we want to find the conditional distribution over, b contains the variables we want to
condition on, and c contains the variables that are not part of our query.
p(a | b) =
p(a, p(b)
p(b)
=
p(a, b)
a,c
p(a, b, c)
=
˜p(a, b)
a,c
˜p(a, b, c)
.
This quantity requires marginalizing out a, which can be a very efficient operation
provided that a and c do not contain very many variables. In the extreme case, a can
be a single variable and c can be empty, making this operation require only as many
evaluations of ˜p as there are values of a single random variable.
Unfortunately, in order to compute the log likelihood, we need to marginalize out
large sets of variables. If there are n variables total, we must marginalize a set of size
n 1. By the chain rule of probability,
log p(x) = log p(x
1
) + log p(x
2
| x
1
) + ··· + p(x
n
| x
1:n1
.
In this case, we have made a maximally small, but c can be as large as x
2:n
. What if we
simply move c into b to reduce the computational cost? This yields the pseudolikelihood
(Besag, 1975) objective function:
n
i=1
log p(x
i
|x
i
).
If each random variable has k different values, this requires only k × n evaluations
of ˜p to compute, as opposed to the k
n
evaluations needed to compute the partition
function.
This may look like an unprincipled hack, but it can be proven that estimation by max-
imizing the log pseudolikelihood is asymptotically consistent (Mase, 1995). Of course,
in the case of datasets that do not approach the large sample limit, pseudolikelihood
may display different behavior from the maximum likelihood estimator.
It is possible to trade computational complexity for deviation from maximum like-
lihood behavior by using the generalized pseudolikelihood estimator (Huang and Ogata,
2002). The generalized pseudolikelihood estimator uses m different sets S
(
i), i = 1, . . . , m
of indices variables that appear together on the left side of the conditioning bar. In the
extreme case of m = 1 and S
(1)
= 1, . . . , n the generalized pseudolikelihood recovers
the log likelihood. In the extreme case of m = n and S
(i)
= {i}, the generalized pseu-
dolikelihood recovers the pseudolikelihood. The generalized pseudolikelihood objective
function is given by
m
i=1
log p(x
S
(i)
| x
S
(i)
).
The performance of pseudolikelihood-based approaches depends largely on how the
model will be used. Pseudolikelihood tends to perform poorly on tasks that require a
290
good model of the full joint p(x), such as density estimation and sampling. However, it
can perform better than maximum likelihood for tasks that require only the conditional
distributions used during training, such as filling in small amounts of missing values.
Generalized pseudolikelihood techniques are especially powerful if the data has regular
structure that allows the S index sets to be designed to capture the most important
correlations while leaving out groups of variables that only have negligible correlation.
For example, in natural images, pixels that are widely separated in space also have weak
correlation, so the generalized pseudolikelihood can be applied with each S set being a
small, spatially localized window.
One weakness of the pseudolikelihood estimator is that it cannot be used with other
approximations that provide only a lower bound on ˜p(x), such as variational inference,
which will be covered in chapter 16.4. This is because ˜p appears in the denominator.
A lower bound on the denominator provides only an upper bound on the expression
as a whole, and there is no benefit to maximizing an upper bound. This makes it dif-
ficult to apply pseudolikelihood approaches to deep models such as deep Boltzmann
machines, since variational methods are one of the dominant approaches to approxi-
mately marginalizing out the many layers of hidden variables that interact with each
other. However, pseudolikelihood is still useful for deep learning, because it can be used
to train single layer models, or deep models using approximate inference methods that
are not based on lower bounds.
Pseudolikelihood has a much greater cost per gradient step than SML, due its ex-
plicit computation of all of the conditionals. However, generalized pseudolikelihood and
similar criteria can still perform well if only one randomly selected conditional is com-
puted per example (Goodfellow et al., 2013c), thereby bringing the computational cost
down to match that of SML.
Though the pseudolikelihood estimator does not explicitly minimize log Z, it can still
be thought of as having something resembling a negative phase. The denominators of
each conditional distribution result in the learning algorithm suppressing the probability
of all states that have only one variable differing from a training example.
15.4 Score matching and ratio matching
Score matching (Hyv¨arinen, 2005) provides another consistent means of training a model
without estimating Z or its derivatives. The strategy used by score matching is to
minimize the expected squared difference between the derivatives of the model’s log pdf
with respect to the input and the derivatives of the data’s log pdf with respect to the
input:
θ
= min
θ
J(θ) =
1
2
E
x
||∇
x
log p
model
(x; θ)
x
log p
data
(x)||
2
2
.
Because the
x
Z = 0, this objective function avoids the difficulties associated with
differentiating the partition function. However, it appears to have another difficult: it
requires knowledge of the true distribution generating the training data, p
data
. Fortu-
291
nately, minimizing J(θ) turns out to be equivalent to minimizing
˜
J(θ) =
1
m
m
i=1
n
j=1
2
x
2
j
log p
model
(x; θ) +
1
2
x
i
log p
model
(x; θ)
2
where {x
(1)
, . . . , x
(m)
} is the training set and n is the dimensionality of x.
Because score matching requires taking derivatives with respect to x, it is not ap-
plicable to models of discrete data. However, the latent variables in the model may be
discrete.
Like the pseudolikelihood, score matching only works when we are able to evaluate
log ˜p(x) and its derivatives directly. It is not compatible with methods that only pro-
vide a lower bound on log ˜p(x), because we are not able to conclude anything about the
relationship between the derivatives and second derivatives of the lower bound, and the
relationship of the true derivatives and second derivatives needed for score matching.
This means that score matching cannot be applied to estimating models with com-
plicated interactions between the hidden units, such as sparse coding models or deep
Boltzmann machines. Score matching can be used to pretrain the first hidden layer of
a larger model. Score matching has not been applied as a pretraining strategy for the
deeper layers of a larger model, because the hidden layers of such models usually contain
some discrete variables.
While score matching does not explicitly have a negative phase, it can be viewed as
a version of contrastive divergence using a specific kind of Markov chain (Hyv¨arinen,
2007a). The Markov chain in this case is not Gibbs sampling, but rather a different
approach that makes local moves guided by the gradient. Score matching is equivalent
to CD with this type of Markov chain when the size of the local moves approaches zero.
Lyu (2009) generalized score matching to the discrete case (but made an error in
their derivation that was corrected by Marlin et al. (2010)). Marlin et al. (2010) found
that generalized score matching (GSM) does not work in high dimensional discrete spaces
where the observed probability of many events is 0.
A more successful approach to extending the basic ideas of score matching to discrete
data is ratio matching (Hyv¨arinen, 2007b). Ratio matching applies specifically to binary
data. Ratio matching consists of minimizing the following objective function:
J
(RM)
(θ) =
1
m
m
i=1
n
j=1
1
1 +
p
model
(
x
(i)
;θ
)
p
model
(
f(x
(i)
,j);θ
)
2
where f(x, j) return x with the bit at position j flipped. Ratio matching avoids the
partition function using the same trick as the pseudolikelihood estimator: in a ratio of
two probabilities, the partition function cancels out. Marlin et al. (2010) found that
ratio matching outperforms SML, pseudolikelihood, and GSM in terms of the ability of
models trained with ratio matching to denoise test set images.
Like the pseudolikelihood estimator, ratio matching requires n evaluations of ˜p per
data point, making its computational cost per update roughly n times higher than that
of SML.
292
Like the pseudolikelihood estimator, ratio matching can be thought of as pushing
down on all fantasy states that have only one variable different from a training example.
Since ratio matching applies specifically to binary data, this means that it acts on all
fantasy states within Hamming distance 1 of the data.
Ratio matching can also be useful as the basis for dealing with high-dimensional
sparse data, such as word count vectors. This kind of data poses a challenge for MCMC-
based methods because the data is extremely expensive to represent in dense format,
yet the MCMC sampler does not yield sparse values until the model has learned to
represent the sparsity in the data distribution. Dauphin and Bengio (2013b) overcame
this issue by designing an unbiased stochastic approximation to ratio matching. The
approximation evaluates only a randomly selected subset of the terms of the objective,
and does not require the model to generate complete fantasy samples.
15.5 Denoising score matching
In some cases we may wish to regularize score matching, by fitting a distribution
p
smoothed
(x) =
p
data
(x + y)q(y | x)dy
rather than the true p
data
. This is especially useful because in practice we usually do
not have access to the true p
data
but rather only an empirical distribution defined by
samples from it. Any consistent estimator will, given enough capacity, make p
model
into
a set of Dirac distributions centered on the training points. Smoothing by q helps to
reduce this problem, at the loss of the asymptotic consistency property. Kingma and
LeCun (2010) introduced a procedure for performing regularized score matching with
the smoothing distribution q being normally distributed noise.
Surprisingly, some denoising autoencoder training algorithms correspond to training
energy-based models with denoising score matching (Vincent, 2011). The denoising
autoencoder variant of the algorithm is significantly less computationally expensive than
score matching. Swersky et al. (2011) showed how to derive the denoising autoencoder
for any energy-based model of real data. This approach is known as denoising score
matching (SMD).
15.6 Noise-contrastive estimation
Most techniques for estimating models with intractable partition functions do not pro-
vide an estimate of the partition function. SML and CD estimate only the gradient of
the log partition function, rather than the partition function itself. Score matching and
pseudolikelihood avoid computing quantities related to the partition function altogether.
Noise-contrastive estimation (NCE) (Gutmann and Hyvarinen, 2010) takes a differ-
ent strategy. In this approach, the probability distribution by the model is represented
explicitly as
log p
model
(x) = log ˜p
model
(x; θ) + c,
293
where c is explicitly introduced as an approximation of log Z(θ). Rather than estimat-
ing only θ, the noise contrastive estimation procedure treats c as just another parameter
and estimates θ and c simultaneously, using the same algorithm for both. The resulting
thus may not correspond exactly to a valid probability distribution, but will become
closer and closer to being valid as the estimate of c improves.
1
Such an approach would not be possible using maximum likelihood as the criterion
for the estimator. The maximum likelihood criterion would choose to set c arbitrarily
high, rather than setting c to create a valid probability distribution.
NCE works by reducing the unsupervised learning problem of estimating p(X) to a
supervised learning problem. This supervised learning problem is constructed in such a
way that maximum likelihood estimation in this supervised learning problem defines an
asymptotically consistent estimator of the original problem.
Specifically, we introduce a second distribution, the noise distribution p
noise
(x). The
noise distribution should be tractable to evaluate and to sample from. We can now
construct a model over both x and a new, binary class variable y. In the new joint
model, we specify that p
joint model
(y = 1) =
1
2
, p
joint model
(x | y = 1) = p
model
(x), and
p
joint model
(x | y = 0) = p
noise
(x). In other words, y is a switch variable that determines
whether we will generate x from the model or from the noise distribution.
We can construct a similar joint model of training data. In this case, the switch
variable determines whether we draw x from the data or from the noise distribution.
Formally, p
train
(y = 1) =
1
2
, p
train
(x | y = 1) = p
data
(x), and p
train
(x | y = 0) =
p
noise
(x).
We can now just use standard maximum likelihood learning on the supervised learn-
ing problem of fitting p
joint model
to p
train
:
θ, c = arg max
θ,c
E
x,yp
train
log p
joint model
(y | x).
It turns out that p
joint model
is essentially a logistic regression model applied to the
difference in log probabilities of the model and the noise distribution:
p
joint model
(y = 1 | x) =
p
model
(x)
p
model
(x) + p
noise
(x)
=
1
1 +
p
noise
(x)
p
model
(x)
=
1
1 + exp
log
p
noise
(x)
p
model
(x)
= σ
log
p
noise
(x)
p
model
(x)
1
NCE is also applicable to problems with a tractable partition function, where there is no need to
introduce the extra parameter c. However, it has generated the most interest as a means of estimating
models with difficult partition functions.
294
= σ (log p
model
(x) log p
noise
(x)) .
NCE is thus simple to apply so long as log ˜p
model
is easy to backpropagate through,
and, as specified above, noise is easy to evaluate (in order to evaluate p
joint model
) and
sample from (in order to generate the training data).
NCE is most successful when applied to problems with few random variables, but
can work well even if those random variables can take on a high number of values. For
example, it has been successfully applied to modeling the conditional distribution over
a word given the context of the word (Mnih and Kavukcuoglu, 2013). Though the word
may be drawn from a large vocabulary, there is only one word.
When NCE is applied to problems with many random variables, it becomes less
efficient. The logistic regression classifier can reject a noise sample by identifying any
one variable whose value is unlikely. This means that learning slows down greatly after
p
model
has learned the basic marginal statistics. Imagine learning a model of images
of faces, using unstructured Gaussian noise as p
noise
. If p
model
learns about eyes, it
can reject almost all unstructured noise samples without having learned anything other
facial features, such as mouths.
The constraint that p
noise
must be easy to evaluate and easy to sample from can be
overly restrictive. When p
noise
is simple, most samples are likely to be too obviously
distinct from the data to force p
model
to improve noticeably.
Like score matching and pseudolikelihood, NCE does not work if only a lower bound
on ˜p is available. Such a lower bound could be used to construct a lower bound
on p
joint model
(y = 1 | x), but it can only be used to construct an upper bound on
p
joint model
(y = 0 | x), which appears in half the terms of the NCE objective. Like-
wise, a lower bound on p
noise
is not useful, because it provides only an upper bound on
p
joint model
(y = 1 | x).
TODO– put herding in this chapter? and if not, where to put it?
TODO– cite the Bregman divergence paper?
295