Chapter 18
Confronting the Partition
Function
TODO– make sure the book explains asymptotic consistency somewhere, add
links to it here
In Section 13.2.2 we saw that many probabilistic models (commonly known
as undirected graphical models) are defined by an unnormalized probability dis-
tribution ˜p(x; θ) or energy function (Section 13.2.4)
E(x) = log ˜p(x). (18.1)
Because the analytic formulation of the model is via this energy function or un-
normalized probability, the complete formulation of the probability function or
probability density requires a normalization constant called partition function
Z(θ) such that
p(x; θ) =
1
Z(θ)
˜p(x; θ)
is a valid, normalized probability distribution. The partition function is an in-
tegral or sum over the unnormalized probability of all states. This operation is
intractable for many interesting models.
As we will see in chapter 20, many deep learning models are designed to have
a tractable normalizing constant, or are designed to be used in ways that do not
involve computing p(x) at all. However, other models directly confront the chal-
lenge of intractable partition functions. In this chapter, we describe techniques
used for training and evaluating models that have intractable partition functions.
478
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
18.1 The Log-Likelihood Gradient of Energy-Based
Models
What makes learning by maximum likelihood particularly difficult is that the
partition function depends on the parameters, so that the log-likelihood gradient
has a term corresponding to the gradient of the partition function:
log p(x; θ)
θ
=
E(x)
θ
log Z(θ)
θ
. (18.2)
In the case where the energy function is analytically tractable (e.g., RBMs), the
difficult part is estimating the the gradient of the partition function. Unsurpris-
ingly, since computing Z itself is intractable, we find that computing its gradient
is also intractable, but the good news is that it corresponds to an expectation
over the model distribution, which can be estimated by Monte-Carlo methods.
Though the gradient of the log partition function is intractable to evaluate
accurately, it is straightforward to analyze algebraically. The derivatives we need
for learning are of the form
θ
log p(x)
where θ is one of the parameters of p(x). These derivatives are given simply by
θ
log p(x) =
θ
(log ˜p(x) log Z) .
In this chapter, we are primarily concerned with the estimation of the term
on the right:
θ
log Z
=
θ
Z
Z
=
θ
P
x
˜p(x)
Z
=
P
x
θ
˜p(x)
Z
.
For models that guarantee p(x) > 0 for all x, we can substitute exp (log ˜p(x))
for ˜p(x):
=
P
x
θ
exp (log ˜p(x))
Z
=
P
x
exp (log ˜p(x))
θ
log ˜p(x)
Z
479
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
=
P
x
˜p(x)
θ
log ˜p(x)
Z
=
X
x
p(x)
θ
log ˜p(x)
= E
xp(x)
θ
log ˜p(x).
This derivation made use of summation over discrete x, but a similar result
applies using integration over continuous x. In the continuous version of the
derivation, we use Leibniz’s rule for differentiation under the integral sign to
obtain the identity
θ
Z
˜p(x)dx =
Z
θ
˜p(x)dx.
This identity is only applicable under certain regularity conditions on ˜p and
θ
˜p(x)
1
. Fortunately, most machine learning models of interest have these prop-
erties.
This identity
θ
log Z = E
xp(x)
θ
log ˜p(x) (18.3)
is the basis for a variety of Monte Carlo methods for approximately maximizing
the likelihood of models with intractable partition functions.
Putting this result together with Eq. 18.2, we obtain the following well-known
decomposition of the gradient in terms of the gradient of the energy function on
the observed x and in average over the model distribution:
log p(x; θ)
θ
=
E(x)
θ
E
xp(x)
θ
E(x). (18.4)
The first term is called the positive phase contribution to the gradient and it cor-
responds to pushing the energy down on the “positive” examples and reinforcing
the interactions that are observed between random variables when x is observed,
while the second term is called the negative phase contribution to the gradient and
it corresponds to pushing the energy up everywhere else, with proportionally more
push where the model currently puts more probability mass. When a minimum
of the negative log-likelihood is found, the two terms must of course cancel each
other, and the only thing that prevents the model from putting probability mass
in exactly the same way as the training distribution is that it may be regularized
or have some constraints, e.g. be parametric.
1
In measure theoretic terms, the conditions are: (i) ˜p must be a Lebesgue-integrable function
of x for every value of θ; (ii)
θ
˜p(x) must exist for all θ and almost all x; (iii) There exists
an integrable function R(x) that bounds
θ
˜p(x) (i.e. such that |
θ
˜p(x)| R(x) for all θ and
almost all x).
480
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
18.2 Stochastic Maximum Likelihood and Contrastive
Divergence
The naive way of implementing equation 18.3 is to compute it by burning in a set
of Markov chains from a random initialization every time the gradient is needed.
When learning is performed using stochastic gradient descent, this means the
chains must be burned in once per gradient step. This approach leads to the
training procedure presented in Algorithm 18.1. The high cost of burning in the
Markov chains in the inner loop makes this procedure computationally infeasible,
but this procedure is the starting point that other more practical algorithms aim
to approximate.
Algorithm 18.1 A naive MCMC algorithm for maximizing the log likelihood
with an intractable partition function using gradient ascent.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow burn in. Perhaps 100
to train an RBM on a small image patch.
while Not converged do
Sample a minibatch of m examples {x
(1)
, . . . , x
(m)
} from the training set.
g
1
m
P
m
i=1
θ
log ˜p(x
(i)
; θ)
Initialize a set of m samples {
˜
x
(1)
, . . . ,
˜
x
(m)
} to random values (e.g., from
a uniform or normal distribution, or possibly a distribution with marginals
matched to the model’s marginals)
for i = 1 to k do
for j = 1 to m do
˜
x
(j)
gibbs update(
˜
x
(j)
)
end for
end for
g g
1
m
P
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
We can view the MCMC approach to maximum likelihood as trying to achieve
balance between two forces, one pushing up on the model distribution where the
data occurs, and another pushing down on the model distribution where the model
samples occur. Fig. 18.1 illustrates this process. The two forces correspond to
maximizing log ˜p and minimizing log Z. In this chapter, we assume the positive
phase is tractable and may be performed exactly, but other chapters, especially
chapter 19 deal with intractable positive phases. In this chapter, we present
several approximations to the negative phase. Each of these approximations can
481
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Figure 18.1: The view of Algorithm 18.1 as having a “positive phase” and “negative
phase”. Left) In the positive phase, we sample points from the data distribution, and
push up on their unnormalized probability. This means points that are likely in the data
get pushed up on more. Right) In the negative phase, we sample points from the model
distribution, and push down on their unnormalized probability. This counteracts the
positive phase’s tendency to just add a large constant to the unnormalized probability
everywhere. When the data distribution and the model distribution are equal, the positive
phase has the same chance to push up at a point as the negative phase has to push down.
At this point, there is no longer any gradient (in expectation) and training must terminate.
be understood as making the negative phase computationally cheaper but also
making it push down in the wrong locations.
Because the negative phase involves drawing samples from the model’s distri-
bution, we can think of it as finding points that the model believes in strongly.
Because the negative phase acts to reduce the probability of those points, they are
generally considered to represent the model’s incorrect beliefs about the world.
They are frequently referred to in the literature as “hallucinations” or “fantasy
particles.” In fact, the negative phase has been proposed as a possible explana-
tion for dreaming in humans and other animals (Crick and Mitchison, 1983), the
idea being that the brain maintains a probabilistic model of the world and fol-
lows the gradient of log ˜p while experiencing real events while awake and follows
the negative gradient of log ˜p to minimize log Z while sleeping and experiencing
events sampled from the current model. This view explains much of the language
used to describe algorithms with a positive and negative phase, but it has not
been proven to be correct with neuroscientific experiments. In machine learning
models, it is usually necessary to use the positive and negative phase simultane-
ously, rather than in separate time periods of wakefulness and REM sleep. As
we will see in chapter 19.6, other machine learning algorithms draw samples from
the model distribution for other purposes and such algorithms could also provide
an account for the function of dream sleep.
482
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Given this understanding of the role of the positive and negative phase of
learning, we can attempt to design a less expensive alternative to Algorithm 18.1.
The main cost of the naive MCMC algorithm is the cost of burning in the Markov
chains from a random initialization at each step. A natural solution is to initialize
the Markov chains from a distribution that is very close to the model distribution,
so that the burn in operation does not take as many steps.
The contrastive divergence (CD, or CD-k to indicate CD with k Gibbs steps)
algorithm initializes the Markov chain at each step with samples from the data
distribution (Hinton, 2000). This approach is presented as Algorithm 18.2. Ob-
taining samples from the data distribution is free, because they are already avail-
able in the data set. Initially, the data distribution is not close to the model
distribution, so the negative phase is not very accurate. Fortunately, the positive
phase can still accurately increase the model’s probability of the data. After the
positive phase has had some time to act, the model distribution is closer to the
data distribution, and the negative phase starts to become accurate.
Algorithm 18.2 The contrastive divergence algorithm, using gradient ascent as
the optimization procedure.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow a Markov chain of
p(x; θ) to mix when initializedfrom p
data
. Perhaps 1-20 to train an RBM on a
small image patch.
while Not converged do
Sample a minibatch of m examples {x
(1)
, . . . , x
(m)
} from the training set.
g
1
m
P
m
i=1
θ
log ˜p(x
(i)
; θ)
for i = 1 to m do
˜
x
(i)
x
(i)
end for
for i = 1 to k do
for j = 1 to m do
˜
x
(j)
gibbs update(
˜
x
(j)
)
end for
end for
g g
1
m
P
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
Of course, CD is still an approximation to the correct negative phase. The
main way that CD qualitatively fails to implement the correct negative phase
is that it fails to suppress “spurious modes” regions of high probability that
are far from actual training examples. Fig. 18.2 illustrates why this happens.
483
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Essentially, it is because modes in the model distribution that are far from the data
distribution will not be visited by Markov chains initialized at training points,
unless k is very large.
Carreira-Perpi˜nan and Hinton (2005) showed experimentally that the CD es-
timator is biased for RBMs and fully visible Boltzmann machines, in that it
converges to different points than the maximum likelihood estimator. They argue
that because the bias is small, CD could be used as an inexpensive way to initial-
ize a model that could later be fine-tuned via more expensive MCMC methods.
Bengio and Delalleau (2009) showed that CD can be interpreted as discarding the
smallest terms of the correct MCMC update gradient, which explains the bias.
CD is useful for training shallow models like RBMs. These can in turn be
stacked to initialize deeper models like DBNs or DBMs. However, CD does not
provide much help for training deeper models directly. This is because it is difficult
to obtain samples of the hidden units given samples of the visible units. Since
the hidden units are not included in the data, initializing from training points
cannot solve the problem. Even if we initialize the visible units from the data,
we will still need to burn in a Markov chain sampling from the distribution over
the hidden units conditioned on those visible samples. Most of the approximate
inference techniques described in chapter 19 for approximately marginalizing out
the hidden units cannot be used to solve this problem. This is because all of the
approximate marginalization methods based on giving a lower bound on ˜p would
give a lower bound on log Z. We need to minimize log Z, and minimizing a lower
bound is not a useful operation.
The CD algorithm can be thought of as penalizing the model for having a
Markov chain that changes the input rapidly when the input comes from the
data. This means training with CD somewhat resembles autoencoder training.
Even though CD is more biased than some of the other training methods, it
can be useful for pre-training shallow models that will later be stacked. This is
because the earliest models in the stack are encouraged to copy more information
up to their latent variables, thereby making it available to the later models. This
should be thought of more of as an often-exploitable side effect of CD training
rather than a principled design advantage.
Sutskever and Tieleman (2010) showed that the CD update direction is not
the gradient of any function. This allows for situations where CD could cycle
forever, but in practice this is not a serious problem.
A different strategy that resolves many of the problems with CD is to initialize
the Markov chains at each gradient step with their states from the previous gradi-
ent step. This approach was first discovered under the name stochastic maximum
likelihood (SML) in the applied mathematics and statistics community (Younes,
1998) and later independently rediscovered under the name persistent contrastive
484
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Figure 18.2: An illustration of how the negative phase of contrastive divergence (Al-
gorithm 18.2) can fail to suppress spurious modes. A spurious mode is a mode that is
present in the model distribution but absent in the data distribution. Because contrastive
divergence initializes its Markov chains from data points and runs the Markov chain for
only a few steps, it is unlikely to visit modes in the model that are far from the data
points. This means that when sampling from the model, we will sometimes get samples
that do not resemble the data. It also means that due to wasting some of its probability
mass on these modes, the model will struggle to place high probability mass on the correct
modes. Note that this figure uses a somewhat simplified concept of distance–the spurious
mode is far from the correct mode along the number line in R. This corresponds to a
Markov chain based on making local moves with a single x variable in R. For most deep
probabilistic models, the Markov chains are based on Gibbs sampling and can make non-
local moves of individual variables but cannot move all of the variables simultaneously.
For these problems, it is usually better to consider the edit distance between modes,
rather than the Euclidean distance. However, edit distance in a high dimensional space
is difficult to depict in a 2-D plot.
485
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
divergence (PCD, or PCD-k to indicate the use of k Gibbs steps per update) in
the deep learning community (Tieleman, 2008). See Algorithm 18.3. The basic
idea of this approach is that, so long as the steps taken by the stochastic gradi-
ent algorithm are small, then the model from the previous step will be similar
to the model from the current step. It follows that the samples from the previ-
ous model’s distribution will be very close to being fair samples from the current
model’s distribution, so a Markov chain initialized with these samples will not
require much time to mix.
Because each Markov chain is continually updated throughout the learning
process, rather than restarted at each gradient step, the chains are free to wander
far enough to find all of the model’s modes. SML is thus considerably more
resistant to forming models with spurious modes than CD is. Moreover, because
it is possible to store the state of all of the sampled variables, whether visible or
latent, SML provides an initialization point for both the hidden and visible units.
CD is only able to provide an initialization for the visible units, and therefore
requires burn-in for deep models. SML is able to train deep models efficiently.
Marlin et al. (2010) compared SML to many of the other criteria presented in this
chapter. They found that SML results in the best test set log likelihood for an
RBM, and if the RBM’s hidden units are used as features for an SVM classifier,
SML results in the best classification accuracy.
SML is vulnerable to becoming inaccurate if k is too small or is too large
in other words, if the stochastic gradient algorithm can move the model faster
than the Markov chain can mix between steps. There is no known way to test
formally whether the chain is successfully mixing between steps. Subjectively, if
the learning rate is too high for the number of Gibbs steps, the human operator
will be able to observe that there is much more variance in the negative phase
samples across gradient steps rather than across different Markov chains. For
example, a model trained on MNIST might sample exclusively 7s on one step.
The learning process will then push down strongly on the mode corresponding to
7s, and the model might sample exclusively 9s on the next step.
Care must be taken when evaluating the samples from a model trained with
SML. It is necessary to draw the samples starting from a fresh Markov chain
initialized from a random starting point after the model is done training. The
samples present in the persistent negative chains used for training have been
influenced by several recent versions of the model, and thus can make the model
appear to have greater capacity than it actually does.
Berglund and Raiko (2013) performed experiments to examine the bias and
variance in the estimate of the gradient provided by CD and SML. CD proves to
have low variance than the estimator based on exact sampling. SML has higher
variance. The cause of CD’s low variance is its use of the same training points
486
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Algorithm 18.3 The stochastic maximum likelihood / persistent contrastive
divergence algorithm using gradient ascent as the optimization procedure.
Set , the step size, to a small positive number
Set k, the number of Gibbs steps, high enough to allow a Markov chain of
p(x; θ + g) toburn in, starting from samples from p(x; θ). Perhaps 1 for RBM
on a small image patch, or 5-50 for a morecomplicated model like a DBM.
Initialize a set of m samples {
˜
x
(1)
, . . . ,
˜
x
(m)
} to random values (e.g., from a uni-
form or normal distribution, or possibly a distribution with marginals matched
to the model’s marginals)
while Not converged do
Sample a minibatch of m examples {x
(1)
, . . . , x
(m)
} from the training set.
g
1
m
P
m
i=1
θ
log ˜p(x
(i)
; θ)
for i = 1 to k do
for j = 1 to m do
˜
x
(j)
gibbs update(
˜
x
(j)
)
end for
end for
g g
1
m
P
m
i=1
θ
log ˜p(
˜
x
(i)
; θ)
θ θ + g
end while
in both the positive and negative phase. If the negative phase is initialized from
different training points, the variance rises above that of the estimator based on
exact sampling.
TODO– FPCD? TODO– Rates-FPCD?
TODO– mention that all these things can be coupled with enhanced samplers,
which I believe are mentioned in the intro to graphical models chapter
One key benefit to the MCMC-based methods described in this section is that
they provide an estimate of the gradient of log Z, and thus we can essentially
decompose the problem into the log ˜p contribution and the log Z contribution.
We can then use any other method to tackle log ˜p(x), and just add our negative
phase gradient onto the other method’s gradient. In particular, this means that
our positive phase can make use of methods that provide only a lower bound on
˜p. Most of the other methods of dealing with log Z presented in this chapter are
incompatible with bound-based positive phase methods.
487
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
18.3 Pseudolikelihood
Monte Carlo approximations to the partition function and its gradient directly
confront the partition function. Other approaches sidestep the issue, by training
the model without computing the partition function. Most of these approaches
are based on the observation that it is easy to compute ratios of probabilities
in an unnormalized probabilistic model. This is because the partition function
appears in both the numerator and the denominator of the ratio and cancels out:
p(x)
p(y)
=
1
Z
˜p(x)
1
Z
˜p(y)
=
˜p(x)
˜p(y)
.
The pseudolikelihood is based on the observation that conditional probabilities
take this ratio-based form, and thus can be computed without knowledge of the
partition function. Suppose that we partition x into a, b, and c, where a contains
the variables we want to find the conditional distribution over, b contains the
variables we want to condition on, and c contains the variables that are not part
of our query.
p(a | b) =
p(a, p(b)
p(b)
=
p(a, b)
P
a,c
p(a, b, c)
=
˜p(a, b)
P
a,c
˜p(a, b, c)
.
This quantity requires marginalizing out a, which can be a very efficient operation
provided that a and c do not contain very many variables. In the extreme case, a
can be a single variable and c can be empty, making this operation require only
as many evaluations of ˜p as there are values of a single random variable.
Unfortunately, in order to compute the log likelihood, we need to marginalize
out large sets of variables. If there are n variables total, we must marginalize a
set of size n 1. By the chain rule of probability,
log p(x) = log p(x
1
) + log p(x
2
| x
1
) + ··· + p(x
n
| x
1:n1
).
In this case, we have made a maximally small, but c can be as large as x
2:n
.
What if we simply move c into b to reduce the computational cost? This yields
the pseudolikelihood (Besag, 1975) objective function:
n
X
i=1
log p(x
i
| x
i
).
If each random variable has k different values, this requires only k ×n evalu-
ations of ˜p to compute, as opposed to the k
n
evaluations needed to compute the
partition function.
488
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
This may look like an unprincipled hack, but it can be proven that estimation
by maximizing the log pseudolikelihood is asymptotically consistent (Mase, 1995).
Of course, in the case of datasets that do not approach the large sample limit,
pseudolikelihood may display different behavior from the maximum likelihood
estimator.
It is possible to trade computational complexity for deviation from maximum
likelihood behavior by using the generalized pseudolikelihood estimator (Huang
and Ogata, 2002). The generalized pseudolikelihood estimator uses m different
sets S
(i)
, i = 1, . . . , m of indices variables that appear together on the left side of
the conditioning bar. In the extreme case of m = 1 and S
(1)
= 1, . . . , n the gener-
alized pseudolikelihood recovers the log likelihood. In the extreme case of m = n
and S
(i)
= {i}, the generalized pseudolikelihood recovers the pseudolikelihood.
The generalized pseudolikelihood objective function is given by
m
X
i=1
log p(x
S
(i)
| x
S
(i)
).
The performance of pseudolikelihood-based approaches depends largely on
how the model will be used. Pseudolikelihood tends to perform poorly on tasks
that require a good model of the full joint p(x), such as density estimation and
sampling. However, it can perform better than maximum likelihood for tasks
that require only the conditional distributions used during training, such as fill-
ing in small amounts of missing values. Generalized pseudolikelihood techniques
are especially powerful if the data has regular structure that allows the S index
sets to be designed to capture the most important correlations while leaving out
groups of variables that only have negligible correlation. For example, in natural
images, pixels that are widely separated in space also have weak correlation, so
the generalized pseudolikelihood can be applied with each S set being a small,
spatially localized window.
One weakness of the pseudolikelihood estimator is that it cannot be used with
other approximations that provide only a lower bound on ˜p(x), such as variational
inference, which will be covered in chapter 19.4. This is because ˜p appears in the
denominator. A lower bound on the denominator provides only an upper bound
on the expression as a whole, and there is no benefit to maximizing an upper
bound. This makes it difficult to apply pseudolikelihood approaches to deep
models such as deep Boltzmann machines, since variational methods are one of
the dominant approaches to approximately marginalizing out the many layers of
hidden variables that interact with each other. However, pseudolikelihood is still
useful for deep learning, because it can be used to train single layer models, or
deep models using approximate inference methods that are not based on lower
bounds.
489
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Pseudolikelihood has a much greater cost per gradient step than SML, due
its explicit computation of all of the conditionals. However, generalized pseudo-
likelihood and similar criteria can still perform well if only one randomly selected
conditional is computed per example (Goodfellow et al., 2013b), thereby bringing
the computational cost down to match that of SML.
Though the pseudolikelihood estimator does not explicitly minimize log Z, it
can still be thought of as having something resembling a negative phase. The
denominators of each conditional distribution result in the learning algorithm
suppressing the probability of all states that have only one variable differing from
a training example.
18.4 Score Matching and Ratio Matching
Score matching (Hyv¨arinen, 2005b) provides another consistent means of training
a model without estimating Z or its derivatives. The strategy used by score
matching is to minimize the expected squared difference between the derivatives
of the model’s log pdf with respect to the input and the derivatives of the data’s
log pdf with respect to the input:
θ
= min
θ
J(θ) =
1
2
E
x
||∇
x
log p
model
(x; θ)
x
log p
data
(x)||
2
2
.
Because the
x
Z = 0, this objective function avoids the difficulties associated
with differentiating the partition function. However, it appears to have another
difficult: it requires knowledge of the true distribution generating the training
data, p
data
. Fortunately, minimizing J(θ) turns out to be equivalent to minimizing
˜
J(θ) =
1
m
m
X
i=1
n
X
j=1
2
x
2
j
log p
model
(x; θ) +
1
2
x
i
log p
model
(x; θ)
2
!
where {x
(1)
, . . . , x
(m)
} is the training set and n is the dimensionality of x.
Because score matching requires taking derivatives with respect to x, it is not
applicable to models of discrete data. However, the latent variables in the model
may be discrete.
Like the pseudolikelihood, score matching only works when we are able to eval-
uate log ˜p(x) and its derivatives directly. It is not compatible with methods that
only provide a lower bound on log ˜p(x), because we are not able to conclude any-
thing about the relationship between the derivatives and second derivatives of the
lower bound, and the relationship of the true derivatives and second derivatives
needed for score matching. This means that score matching cannot be applied to
estimating models with complicated interactions between the hidden units, such
490
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
as sparse coding models or deep Boltzmann machines. Score matching can be
used to pretrain the first hidden layer of a larger model. Score matching has not
been applied as a pretraining strategy for the deeper layers of a larger model,
because the hidden layers of such models usually contain some discrete variables.
While score matching does not explicitly have a negative phase, it can be
viewed as a version of contrastive divergence using a specific kind of Markov
chain (Hyv¨arinen, 2007a). The Markov chain in this case is not Gibbs sampling,
but rather a different approach that makes local moves guided by the gradient.
Score matching is equivalent to CD with this type of Markov chain when the size
of the local moves approaches zero.
Lyu (2009) generalized score matching to the discrete case (but made an
error in their derivation that was corrected by Marlin et al. (2010)). Marlin
et al. (2010) found that generalized score matching (GSM) does not work in high
dimensional discrete spaces where the observed probability of many events is 0.
A more successful approach to extending the basic ideas of score matching
to discrete data is ratio matching (Hyv¨arinen, 2007b). Ratio matching applies
specifically to binary data. Ratio matching consists of minimizing the following
objective function:
J
(RM)
(θ) =
1
m
m
X
i=1
n
X
j=1
1
1 +
p
model
(
x
(i)
;θ
)
p
model
(
f(x
(i)
,j);θ
)
2
where f(x, j) return x with the bit at position j flipped. Ratio matching avoids
the partition function using the same trick as the pseudolikelihood estimator: in
a ratio of two probabilities, the partition function cancels out. Marlin et al.
(2010) found that ratio matching outperforms SML, pseudolikelihood, and GSM
in terms of the ability of models trained with ratio matching to denoise test set
images.
Like the pseudolikelihood estimator, ratio matching requires n evaluations of ˜p
per data point, making its computational cost per update roughly n times higher
than that of SML.
Like the pseudolikelihood estimator, ratio matching can be thought of as push-
ing down on all fantasy states that have only one variable different from a training
example. Since ratio matching applies specifically to binary data, this means that
it acts on all fantasy states within Hamming distance 1 of the data.
Ratio matching can also be useful as the basis for dealing with high-dimensional
sparse data, such as word count vectors. This kind of data poses a challenge for
MCMC-based methods because the data is extremely expensive to represent in
dense format, yet the MCMC sampler does not yield sparse values until the model
491
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
has learned to represent the sparsity in the data distribution. Dauphin and Ben-
gio (2013) overcame this issue by designing an unbiased stochastic approximation
to ratio matching. The approximation evaluates only a randomly selected subset
of the terms of the objective, and does not require the model to generate complete
fantasy samples.
18.5 Denoising Score Matching
In some cases we may wish to regularize score matching, by fitting a distribution
p
smoothed
(x) =
Z
p
data
(x + y)q(y | x)dy
rather than the true p
data
. This is especially useful because in practice we usually
do not have access to the true p
data
but rather only an empirical distribution
defined by samples from it. Any consistent estimator will, given enough capac-
ity, make p
model
into a set of Dirac distributions centered on the training points.
Smoothing by q helps to reduce this problem, at the loss of the asymptotic consis-
tency property. Kingma and LeCun (2010b) introduced a procedure for perform-
ing regularized score matching with the smoothing distribution q being normally
distributed noise.
Surprisingly, some denoising autoencoder training algorithms correspond to
training energy-based models with denoising score matching (Vincent, 2011b).
The denoising autoencoder variant of the algorithm is significantly less compu-
tationally expensive than score matching. Swersky et al. (2011) showed how to
derive the denoising autoencoder for any energy-based model of real data. This
approach is known as denoising score matching (SMD).
18.6 Noise-Contrastive Estimation
Most techniques for estimating models with intractable partition functions do not
provide an estimate of the partition function. SML and CD estimate only the
gradient of the log partition function, rather than the partition function itself.
Score matching and pseudolikelihood avoid computing quantities related to the
partition function altogether.
Noise-contrastive estimation (NCE) (Gutmann and Hyvarinen, 2010) takes a
different strategy. In this approach, the probability distribution by the model is
represented explicitly as
log p
model
(x) = log ˜p
model
(x; θ) + c,
492
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
where c is explicitly introduced as an approximation of log Z(θ). Rather than
estimating only θ, the noise contrastive estimation procedure treats c as just
another parameter and estimates θ and c simultaneously, using the same algorithm
for both. The resulting thus may not correspond exactly to a valid probability
distribution, but will become closer and closer to being valid as the estimate of c
improves.
2
Such an approach would not be possible using maximum likelihood as the
criterion for the estimator. The maximum likelihood criterion would choose to set
c arbitrarily high, rather than setting c to create a valid probability distribution.
NCE works by reducing the unsupervised learning problem of estimating p(x)
to a supervised learning problem. This supervised learning problem is constructed
in such a way that maximum likelihood estimation in this supervised learning
problem defines an asymptotically consistent estimator of the original problem.
Specifically, we introduce a second distribution, the noise distribution p
noise
(x).
The noise distribution should be tractable to evaluate and to sample from. We
can now construct a model over both x and a new, binary class variable y. In the
new joint model, we specify that
p
joint model
(y = 1) =
1
2
,
p
joint model
(x | y = 1) = p
model
(x),
and
p
joint
model
(x | y = 0) = p
noise
(x).
In other words, y is a switch variable that determines whether we will generate x
from the model or from the noise distribution.
We can construct a similar joint model of training data. In this case, the
switch variable determines whether we draw x from the data or from the noise
distribution. Formally, p
train
(y = 1) =
1
2
, p
train
(x | y = 1) = p
data
(x), and
p
train
(x | y = 0) = p
noise
(x).
We can now just use standard maximum likelihood learning on the supervised
learning problem of fitting p
joint model
to p
train
:
θ, c = arg max
θ,c
E
x,yp
train
log p
joint model
(y | x).
It turns out that p
joint model
is essentially a logistic regression model applied
to the difference in log probabilities of the model and the noise distribution:
p
joint model
(y = 1 | x) =
p
model
(x)
p
model
(x) + p
noise
(x)
2
NCE is also applicable to problems with a tractable partition function, where there is no
need to introduce the extra parameter c. However, it has generated the most interest as a means
of estimating models with difficult partition functions.
493
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
=
1
1 +
p
noise
(x)
p
model
(x)
=
1
1 + exp
log
p
noise
(x)
p
model
(x)
= σ
log
p
noise
(x)
p
model
(x)
= σ (log p
model
(x) log p
noise
(x)) .
NCE is thus simple to apply so long as log ˜p
model
is easy to backpropagate
through, and, as specified above, noise is easy to evaluate (in order to evaluate
p
joint model
) and sample from (in order to generate the training data).
NCE is most successful when applied to problems with few random variables,
but can work well even if those random variables can take on a high number of
values. For example, it has been successfully applied to modeling the conditional
distribution over a word given the context of the word (Mnih and Kavukcuoglu,
2013). Though the word may be drawn from a large vocabulary, there is only one
word.
When NCE is applied to problems with many random variables, it becomes
less efficient. The logistic regression classifier can reject a noise sample by identify-
ing any one variable whose value is unlikely. This means that learning slows down
greatly after p
model
has learned the basic marginal statistics. Imagine learning a
model of images of faces, using unstructured Gaussian noise as p
noise
. If p
model
learns about eyes, it can reject almost all unstructured noise samples without
having learned anything other facial features, such as mouths.
The constraint that p
noise
must be easy to evaluate and easy to sample from
can be overly restrictive. When p
noise
is simple, most samples are likely to be too
obviously distinct from the data to force p
model
to improve noticeably.
Like score matching and pseudolikelihood, NCE does not work if only a lower
bound on ˜p is available. Such a lower bound could be used to construct a lower
bound on p
joint model
(y = 1 | x), but it can only be used to construct an upper
bound on p
joint model
(y = 0 | x), which appears in half the terms of the NCE
objective. Likewise, a lower bound on p
noise
is not useful, because it provides only
an upper bound on p
joint model
(y = 1 | x).
TODO– put herding in this chapter? and if not, where to put it?
TODO– cite the Bregman divergence paper?
18.7 Estimating the Partition Function
While much of this chapter is dedicated to describing methods for working around
the unknown and intractable partition function Z(θ) associated with an undi-
494
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
rected graphical model; in this section we will discuss several methods for directly
estimating the partition function.
Estimating the partition function can be important because we require it if
we wish to compute the normalized likelihood of data. This is often important in
evaluating the model, monitoring training performance, and comparing models
to each other.
For example, imagine we have two models: M
A
: p
A
(x; θ
A
) =
1
Z
A
˜p
A
(x; θ
A
)
and M
B
: p
B
(x; θ
B
) =
1
Z
B
˜p
B
(x; θ
B
). A common way to compare the mod-
els is to evaluate the likelihood of an i.i.d. test dataset of size N
test
: D
test
=
{x
(t)
i
}
N
test
under both models. If
Q
t
p
A
(x
(t)
θ
A
) >
Q
t
p
B
(x
(t)
θ
B
) or equivalently
if
P
t
ln p
A
(x
(t)
θ
A
)
P
t
ln p
B
(x
(t)
θ
B
) > 0, then we say that M
A
is a better model
than M
B
(or, at least, it is a better model of the test set). More specifically, to
say that M
A
is better than M
B
, we need that:
X
t
ln p
A
(x
(t)
θ
A
)
X
t
ln p
B
(x
(t)
θ
B
) > 0
X
t
ln ˜p
A
(x
(t)
θ
A
) ln Z(θ
A
)
X
t
ln ˜p
B
(x
(t)
θ
B
) ln Z(θ
B
)
> 0
X
t
ln ˜p
A
(x
(t)
θ
A
) ln ˜p
B
(x
(t)
; θ
B
)
N
test
ln Z(θ
A
) + N
test
ln Z(θ
B
) > 0
X
t
ln
˜p
A
(x
(t)
; θ
A
)
˜p
B
(x
(t)
; θ
B
)
!
N
test
ln
Z(θ
A
)
Z(θ
B
)
> 0.
TODO: too much repetition of ”to know” TODO: be more specific about what
it means to ”compare”, does this mean to take the ratio of two likelihoods? In
order to compare two models we need to compare not only their unnormalized
probabilities, but also their partition functions. It is interesting to note that, in
order to compare these models, we do not actually need to know the value of their
partition function. We need only know their ratio. That is, we need to know their
relative value, up to some shared constant. If, however, we wanted to know the
actual probability of the test data under either M
A
or M
B
, we would need to
know the actual value of the partition functions. That said, if we knew the ratio
of two partition functions, R =
Z(θ
B
)
Z(θ
A
)
, and we knew the actual value of just one
of the two, say Z(θ
A
), we can compute the value of the other:
Z(θ
B
) = R × Z(θ
A
) =
Z(θ
B
)
Z(θ
A
)
Z(θ
A
)
We can make use of this observation to estimate the partition functions of
undirected graphical models.
495
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
For a given probability distribution, say p
1
(x), the partition function is defined
as
Z
1
=
Z
˜p
1
(x) dx (18.5)
where the integral is over the domain of x. Of course, in the case of discrete
x, we replace the integral with a sum. For convenience, we have suppressed the
dependency of both the partition functions and the unnormalized distributions
on the model parameters.
A simple way to estimate the partition function is to use a Monte Carlo method
such as simple importance sampling. Here we consider a proposal distribution,
say p
0
(x), from which we can sample and evaluate both its partition function Z
0
,
and its unnormalized distribution ˜p
0
(x).
Z
1
=
Z
˜p
1
(x) dx
=
Z
p
0
(x)
p
0
(x)
˜p
1
(x) dx
= Z
0
Z
p
0
(x)
˜p
1
(x)
˜p
0
(x)
dx
Z
1
Z
0
K
X
k=1
˜p(x
(k)
)
˜p
0
(x
(k)
)
s.t. : x
(k)
p
0
(18.6)
In the last line, we make a Monte Carlo approximation of the integral using
samples drawn from p
0
(x) and then weigh each sample with the ratio of the
unnormalized ˜p
1
and the proposal p
0
each evaluated at that sample.
If the distribution p
0
is close to p
1
, this can be an effective way of estimating
the partition function (Minka, 2005). Unfortunately, most of the time p
1
is both
complicated, i.e. multimodal, and defined over a high dimensional space. It is
difficult to find a tractable p
0
that is simple enough to evaluate while still being
close enough to p
1
to result in a high quality approximation. If p
0
and p
1
are not
close, most samples from p
0
will have low probability under p
1
and therefore make
(relatively) negligible contribution to the sum in Eq. 18.6. Having few samples
with significant weights in this sum will result in an estimator with high variance,
i.e. a poor quality estimator.
TODO: quantify this
We now turn to two related strategies developed to cope with the challeng-
ing task of estimating partition functions for complex distributions over high-
dimensional spaces: annealed importance sampling and Bennett’s ratio accep-
tance method. Both start with the simple importance sampling strategy intro-
duced above and both attempt to overcome the problem of the proposal p
0
being
496
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
too far from p
1
by introducing intermediate distributions that attempt to bridge
the gap between p
0
and p
1
.
18.7.1 Annealed Importance Sampling
TODO– describe how this is the main way of evaluating p(x) when you want
to get test set likelihoods but can’t be used for training TODO– also mention
Guillaume’s ”tracking the partition function” paper?
In situations where D
KL
(p
0
|p
1
) is large (i.e., where there is little overlap be-
tween p
0
and p
1
), AIS attempts to bridge the gap by introducing intermediate
distributions. Consider a sequence of distributions p
η
0
, . . . , p
η
n
, with 0 = η
0
<
η
1
< ··· < η
n1
< η
n
= 1 so that the first and last distributions in the sequence
are p
0
and p
1
respectively. We can now write the ratio
Z
1
Z
0
as
Z
1
Z
0
=
Z
1
Z
0
Z
η
1
Z
η
1
···
Z
η
n1
Z
η
n1
=
Z
η
1
Z
0
Z
η
2
Z
η
1
···
Z
η
n1
Z
η
n2
Z
1
Z
η
n1
=
n1
Y
j=0
Z
η
j+1
Z
η
j
(18.7)
Provided the distributions p
η
j
and p
η
j
+1
, for all 0 j n 1, are sufficiently
close, we can reliably estimate each of the factors
Z
η
j+1
Z
η
j
using simple importance
sampling and then use these to obtain an estimate of
Z
1
Z
0
.
Where do these intermediate distributions come from? Just as the original
proposal distribution p
0
is a design choice, so is the sequence of distributions
p
η
1
. . . p
η
n1
. That is, it can be specifically constructed to suit the problem do-
main. One general-purpose and popular choice for the intermediate distributions
is to use the weighted geometric average of the target distribution p
1
and the
starting proposal distribution (for which the partition function is known) p
0
:
p
η
j
p
η
j
1
p
1η
j
0
(18.8)
In order to sample from these intermediate distributions, we define a series of
Markov chain transition functions T
η
j
(x
0
, x) that define the probability distribu-
tion of transitioning from x
0
to x. T
η
j
(x
0
, x) is defined to leave p
η
j
(x) invariant:
p
η
j
(x) =
Z
p
η
j
(x
0
)T
η
j
(x
0
, x) dx
0
(18.9)
497
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
These transitions may be constructed as any Markov chain Monte Carlo method
(e.g.. Metropolis-Hastings, Gibbs), including methods involving multiple scans or
other iterations.
The AIS sampling strategy is then to generate samples from p
0
and then use
the transition operators to sequentially generate samples from the intermediate
distributions until we arrive at samples from the target distribution p
1
:
for k = 1 . . . K
Sample x
(k)
η
1
p
0
(x)
Sample
vx
(k)
η
2
T
η
1
(x
(k)
η
1
, x)
. . .
Sample x
(k)
η
n1
T
η
n2
(x
(k)
η
n2
, x)
Sample x
(k)
η
n
T
η
n1
(x
(k)
η
n1
, x)
end
For sample k, we can derive the importance weight by chaining together the
importance weights for the jumps between the intermediate distributions given in
Eq. 18.7.
w
(k)
=
˜p
η
1
(x
(k)
η
1
)
˜p
0
(x
(k)
0
)
˜p
η
2
(x
(k)
η
2
)
˜p
1
(x
(k)
η
1
)
. . .
˜p
1
(x
(k)
1
)
˜p
η
n1
(x
(k)
η
n1
)
(18.10)
To avoid computational issues such as overflow, it is probably best to do the
computation in log space, i.e. ln w
(k)
= ln ˜p
η
1
(x) ln ˜p
0
(x) + . . . .
With the sampling procedure thus define and the importance weights given
in Eq. 18.10, the estimate of the ratio of partition functions is given by:
Z
1
Z
0
1
K
K
X
k=1
w
(k)
(18.11)
In order to verify that this procedure defines a valid importance sampling
scheme, we can show that the AIS procedure corresponds to simple importance
sampling on an extended state space with points sampled over the product space:
[x
η
1
, . . . , x
η
n1
, x
1
] Neal (2001).
We define the distribution over the extended space as:
˜p(x
η
1
, . . . , x
η
n1
, x
1
) = ˜p
1
(x
1
)
˜
T
η
n1
(x
1
, x
η
n1
)
˜
T
η
n2
(x
η
n1
, x
η
n2
) . . .
˜
T
η
1
(x
η
2
, x
η
1
)
(18.12)
498
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
where
˜
T
a
is the reverse of the transition operator defined by T
a
(via an application
of Bayes’ rule):
˜
T
a
(x, x
0
) =
p
a
(x
0
)
p
a
(x)
T
a
(x
0
, x) =
˜p
a
(x
0
)
˜p
a
(x)
T
a
(x
0
, x). (18.13)
Plugging the above into the expression for the joint distribution on the extended
state space given in Eq. 18.12, we get:
˜p(x
η
1
, . . . , x
η
n1
, x
1
)
= ˜p
1
(x
1
)
˜p
η
n1
(x
η
n1
)
˜p
η
n1
(x
1
)
T
η
n1
(x
η
n1
, x
1
)
˜p
η
n2
(x
η
n2
)
˜p
η
n2
(x
η
n1
)
T
η
n2
(x
η
n2
, x
η
n1
) . . .
˜p
η
1
(x
η
1
)
˜p
η
1
(x
η
2
)
T
η
1
(x
η
1
, x
η
2
)
=
˜p
1
(x
1
)
˜p
η
n1
(x
1
)
T
η
n1
(x
η
n1
, x
1
)
˜p
η
n1
(x
η
n1
)
˜p
η
n2
(x
η
n1
)
T
η
n2
(x
η
n2
, x
η
n1
) . . .
˜p
η
n2
(x
η
n2
)
˜p
η
1
(x
η
2
)
T
η
1
(x
η
1
, x
η
2
)˜p
η
1
(x
η
1
)
(18.14)
If we now consider the sampling scheme given above as a means of generating
samples from a proposal distribution q over the extended state, with its distribu-
tion given by:
q(x
η
1
, . . . , x
η
n1
, x
1
) = p
0
(x
η
1
)T
η
1
(x
η
1
, x
η
2
) . . . T
η
n1
(x
η
n1
, x
1
) (18.15)
We have a joint distribution on the extended space given by Eq. 18.14. Taking
q(x
η
1
, . . . , x
η
n1
, x
1
) as the proposal distribution on the extended state space from
which we will draw samples, it remains to determine the importance weights:
w
(k)
=
˜p(x
η
1
, . . . , x
η
n1
, x
1
)
q(x
η
1
, . . . , x
η
n1
, x
1
)
=
˜p
1
(x
(k)
1
)
˜p
η
n1
(x
(k)
η
n1
)
. . .
˜p
η
2
(x
(k)
η
2
)
˜p
1
(x
(k)
η
1
)
˜p
η
1
(x
(k)
η
1
)
˜p
0
(x
(k)
0
)
(18.16)
These weights are the same as proposed for AIS. Thus we can interpret AIS as
simple importance sampling applied to an extended state and its validity follows
immediately from the validity of importance sampling.
Annealed importance sampling (AIS) was first discovered by Jarzynski (1997)
and then again, independently, by Neal (2001). It is currently the most common
way of estimating the partition function for undirected probabilistic models. The
reasons for this may have more to do with the publication of an influential paper
Salakhutdinov and Murray (2008) describing its application to estimating the
partition function of restricted Boltzmann machines and deep belief networks
than with any inherent advantage the method has over the other method described
below.
A discussion of the properties of the AIS estimator (e.g.. its variance and
efficiency) can be found in Neal (2001).
499
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
18.7.2 Bridge Sampling
Bridge sampling Bennett (1976) is another method that, like AIS, addresses the
shortcomings of importance sampling; however it does so in a different but re-
lated manner. Rather than chaining together a series of intermediate distribu-
tions, bridge sampling relies on a single distribution p
, known as the bridge,
to interpolate between a distribution with known partition function, p
0
, and a
distribution p
1
for which we are trying to estimate the partition function Z
1
.
Bridge sampling estimates the ratio Z
1
/Z
0
as the ratio of the expected impor-
tance weights between ˜p
0
and ˜p
and between ˜p
1
and ˜p
:
Z
1
Z
0
K
X
k=1
˜p
(x
(k)
0
)
˜p
0
(x
(k)
0
)
,
K
X
k=1
˜p
(x
(k)
1
)
˜p
1
(x
(k)
1
)
(18.17)
If the bridge distribution p
is chosen carefully to have a large overlap of support
with both p
0
and p
1
, then bridge sampling can allow the distance between two
distributions (or more formally, D
KL
(p
0
|p
1
)) to be much larger than with standard
importance sampling.
It can be shown than the optimal bridging distribution is given by p
(opt)
(x)
˜p
0
(x)˜p
1
(x)
r˜p
0
(x)+˜p
1
(x)
where r = Z
1
/Z
0
.
This appears to be an unworkable solution as it would seem to require the
very quantity we are trying to estimate, i.e. Z
1
/Z
0
. However, it is possible to
start with a coarse estimate of r and use the resulting bridge distribution to refine
our estimate recursively Neal (2005).
TODO: illustration of the bridge distribution
18.7.3 Extensions
Linked importance sampling Both AIS and bridge sampling have their ad-
vantages. If D
KL
(p
0
|p
1
) is not too large (i.e. if p
0
and p
1
are sufficiently close)
bridge sampling can be a more effective means of estimating the ratio of partition
functions than AIS. If, however, the two distributions are too far apart for a sin-
gle distribution p
to bridge the gap then one can at least use AIS with potential
many intermediate distributions to span the distance between p
0
and p
1
. Neal
(2005) showed how his linked importance sampling method leveraged the power
of the bridge sampling strategy to bridge the intermediate distributions used in
AIS to significantly improve the overall partition function estimates.
Tracking the partition function while training Using a combination of
bridge sampling, AIS and parallel tempering, Desjardins et al. (2011) devised
a scheme to track the partition function of an RBM throughout the training
500
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
process. The strategy is based on the maintenance of independent estimates of
the partition functions of the RBM at every temperature operating in the parallel
tempering scheme. The authors combined bridge sampling estimates of the ratios
of partition functions of neighboring chains (i.e. from parallel tempering) with
AIS estimates across time to come up with a low variance estimate of the partition
functions at every iteration of learning.
501