
CHAPTER 17. MONTE CARLO METHODS
in high dimensional cases, MCMC samples become very correlated. We refer to
such behavior as slow mixing or even failure to mix. MCMC methods with slow
mixing can be seen as performing a noisy gradient descent in energy or noisy hill
climbing in probability, in the space of configurations of the state of the chain (the
random variables being sampled). The chain tends to take small steps (in the space
of the state of the Markov chain), from a configuration
x
(t−1)
to a configuration
x
(t)
, with the energy
E
(
x
(t)
) generally lower or approximately equal to the energy
E
(
x
(t−1)
), with a preference for moves that yield lower energy configurations.
When starting from a rather improbable configuration (higher energy than the
typical ones from
p
(
x
)), the chain tends to gradually reduce the energy of the state
and only occasionally move to another mode. Once the chain has found a region
of low energy (such as the region around a connected manifold associated with a
category), which we call a mode, the chain will tend to walk around that mode
(following a kind of random walk). Once in a while it will step out of that mode
and generally return to it or (if it finds an escape route) move towards another
mode.
This is very clear when we consider the Gibbs sampling algorithm (Sec. 17.3):
higher energy configurations are likely to be rejected, and the rejection probability
is exponential in the difference of energies between the two configurations (or
proportional to the difference of their energies). In this context, consider the
probability of going from one mode to a nearby mode within a given number of
steps. What will determine that probability is the shape of the “energy barrier”
between these modes. Transitions between two modes that are separated by a high
energy barrier (a region of low probability) are exponentially less likely (in terms
of the height of the energy barrier). This is illustrated in Fig. 17.1. The problem
arises when there are multiple modes with high probability that are separated by
regions of low probability, especially when each Gibbs sampling step must update
only a small subset of variables whose values are largely determined by the other
variables.
As a simple example, consider an energy-based model over two variables
a
and
b
, which are both binary with a sign, taking on values
−
1 and 1. If
E
(
a, b
) =
−wab
for some large positive number
w
, then the model expresses a strong belief that
a
and
b
have the same sign. Consider updating
b
using a Gibbs sampling step with
a
= 1. The conditional distribution over
b
is given by
P
(
b
= 1
| a
= 1) =
σ
(
w
).
If
w
is large, the sigmoid saturates, and the probability of also assigning
b
to be
1 is close to 1. Likewise, if
a
=
−
1, the probability of assigning
b
to be
−
1 is
close to 1. According to
P
model
(
a, b
), both signs of both variables are equally likely.
According to
P
model
(
a | b
), both variables should have the same sign. This means
that Gibbs sampling will only very rarely flip the signs of these variables.
604