Chapter 15
Monte Carlo Methods
TODO plan organization of chapter (spun off from graphical models chapter)
15.1 Markov Chain Monte Carlo Methods
Drawing a sample x from the probability distribution p(x) defined by a structured model
is an important operation. The following techniques are described in (Koller and Fried-
man, 2009).
Sampling from an energy-based model is not straightforward. Suppose we have an
EBM defining a distribution p(a, b). In order to sample a, we must draw it from p(a | b),
and in order to sample b, we must draw it from p(b | a). It seems to be an intractable
chicken-and-egg problem. Directed models avoid this because their G is directed and
acyclical. In ancestral sampling one simply samples each of the variables in topological
order, conditioning on each variable’s parents, which are guaranteed to have already
been sampled. This defines an efficient, single-pass method of obtaining a sample.
In an EBM, it turns out that we can get around this chicken and egg problem by
sampling using a Markov chain. A Markov chain is defined by a state x and a transition
distribution T (x
0
| x). Running the Markov chain means repeatedly updating the state
x to a value x
0
sampled from T (x
0
| x).
Under certain distributions, a Markov chain is eventually guaranteed to draw x from
an equilibrium distribution π(x
0
), defined by the condition
x
0
, π(x
0
) =
X
x
T (rvx
0
| x)π(x).
TODO– this vector / matrix view needs a whole lot more exposition only literally
a vector / matrix when the state is discrete unpack into multiple sentences, the paren-
thetical is hard to parse is the term “stochastic matrix” defined anywhere? make sure
it’s in the index at least whoever finishes writing this section should also finish making
the math notation consistent terms in this section need to be in the index
We can think of π as a vector (with the probability for each possible value x in the
element indexed by x, π(x)) and T as a corresponding stochastic matrix (with row index
279
x
0
and column index x), i.e., with non-negative entries that sum to 1 over elements of a
column. Then, the above equation becomes
T π = π
an eigenvector equation that says that π is the eigenvector of T with eigenvalue 1. It can
be shown (Perron-Frobenius theorem) that this is the largest possible eigenvalue, and
the only one with value 1 under mild conditions (for example T(x
0
| x) > 0). We can also
see this equation as a fixed point equation for the update of the distribution associated
with each step of the Markov chain. If we start a chain by picking x
0
p
0
, then we
get a distribution p
1
= Tp
0
after one step, and p
t
= Tp
t1
= T
t
p
0
after t steps. If this
recursion converges (the chain has a so-called stationary distribution), then it converges
to a fixed point which is precisely p
t
= π for t , and the dynamical systems view
meets and agrees with the eigenvector view.
This condition guarantees that repeated applications of the transition sampling pro-
cedure don’t change the distribution over the state of the Markov chain. Running the
Markov chain until it reaches its equilibrium distribution is called “burning in” the
Markov chain.
Unfortunately, there is no theory to predict how many steps the Markov chain must
run before reaching its equilibrium distribution, nor any way to tell for sure that this
event has happened. Also, even though successive samples come from the same distri-
bution, they are highly correlated with each other, so to obtain multiple samples one
should run the Markov chain for many steps between collecting each sample. Markov
chains tend to get stuck in a single mode of π(x) for several steps. The speed with which
a Markov chain moves from mode to mode is called its mixing rate. Since burning in
a Markov chain and getting it to mix well may take several sampling steps, sampling
correctly from an EBM is still a somewhat costly procedure.
TODO: mention Metropolis-Hastings
Of course, all of this depends on ensuring π(x) = p(x) . Fortunately, this is easy
so long as p(x) is defined by an EBM. The simplest method is to use Gibbs sampling,
in which sampling from T (x
0
| x) is accomplished by selecting one variable x
i
and
sampling it from p conditioned on its neighbors in G. It is also possible to sample
several variables at the same time so long as they are conditionally independent given
all of their neighbors.
TODO: discussion of mixing example with 2 binary variables that prefer to both
have the same state IG’s graphic from lecture on adversarial nets
TODO: refer to this figure in the text:
TODO: refer to this figure in the text
15.1.1 Markov Chain Theory
TODO
State Perron’s theorem
DEFINE detailed balance
280
Figure 15.1: Paths followed by Gibbs sampling for three distributions, with the Markov
chain initialized at the mode in both cases. Left) A multivariate normal distribution
with two independent variables. Gibbs sampling mixes well because the variables are
independent. Center) A multivariate normal distribution with highly correlated vari-
ables. The correlation between variables makes it difficult for the Markov chain to mix.
Because each variable must be updated conditioned on the other, the correlation reduces
the rate at which the Markov chain can move away from the starting point. Right) A
mixture of Gaussians with widely separated modes that are not axis-aligned. Gibbs
sampling mixes very slowly because it is difficult to change modes while altering only
one variable at a time.
Figure 15.2: An illustration of the slow mixing problem in deep probabilistic models.
Each panel should be read left to right, top to bottom. Left) Consecutive samples
from Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST
dataset. Consecutive samples are similar to each other. Because the Gibbs sampling is
performed in a deep graphical model, this similarity is based more on semantic rather
than raw visual features, but it is still difficult for the Gibbs chain to transition from
one mode of the distribution to another, for example by changing the digit identity.
Right) Consecutive ancestral samples from a generative adversarial network. Because
ancestral sampling generates each sample independently from the others, there is no
mixing problem.
281