Chapter 15

Monte Carlo Methods

TODO plan organization of chapter (spun oﬀ from graphical models chapter)

15.1 Markov Chain Monte Carlo Methods

Drawing a sample x from the probability distribution p(x) deﬁned by a structured model

is an important operation. The following techniques are described in (Koller and Fried-

man, 2009).

Sampling from an energy-based model is not straightforward. Suppose we have an

EBM deﬁning a distribution p(a, b). In order to sample a, we must draw it from p(a | b),

and in order to sample b, we must draw it from p(b | a). It seems to be an intractable

chicken-and-egg problem. Directed models avoid this because their G is directed and

acyclical. In ancestral sampling one simply samples each of the variables in topological

order, conditioning on each variable’s parents, which are guaranteed to have already

been sampled. This deﬁnes an eﬃcient, single-pass method of obtaining a sample.

In an EBM, it turns out that we can get around this chicken and egg problem by

sampling using a Markov chain. A Markov chain is deﬁned by a state x and a transition

distribution T (x

| x). Running the Markov chain means repeatedly updating the state

x to a value x

sampled from T (x

| x).

Under certain distributions, a Markov chain is eventually guaranteed to draw x from

an equilibrium distribution π(x

), deﬁned by the condition

∀x

, π(x

) =

T (rvx

| x)π(x).

TODO– this vector / matrix view needs a whole lot more exposition only literally

a vector / matrix when the state is discrete unpack into multiple sentences, the paren-

thetical is hard to parse is the term “stochastic matrix” deﬁned anywhere? make sure

it’s in the index at least whoever ﬁnishes writing this section should also ﬁnish making

the math notation consistent terms in this section need to be in the index

We can think of π as a vector (with the probability for each possible value x in the

element indexed by x, π(x)) and T as a corresponding stochastic matrix (with row index

279

and column index x), i.e., with non-negative entries that sum to 1 over elements of a

column. Then, the above equation becomes

T π = π

an eigenvector equation that says that π is the eigenvector of T with eigenvalue 1. It can

be shown (Perron-Frobenius theorem) that this is the largest possible eigenvalue, and

the only one with value 1 under mild conditions (for example T(x

| x) > 0). We can also

see this equation as a ﬁxed point equation for the update of the distribution associated

with each step of the Markov chain. If we start a chain by picking x

∼ p

, then we

get a distribution p

= Tp

after one step, and p

= Tp

t−1

= T

after t steps. If this

recursion converges (the chain has a so-called stationary distribution), then it converges

to a ﬁxed point which is precisely p

= π for t → ∞, and the dynamical systems view

meets and agrees with the eigenvector view.

This condition guarantees that repeated applications of the transition sampling pro-

cedure don’t change the distribution over the state of the Markov chain. Running the

Markov chain until it reaches its equilibrium distribution is called “burning in” the

Markov chain.

Unfortunately, there is no theory to predict how many steps the Markov chain must

run before reaching its equilibrium distribution, nor any way to tell for sure that this

event has happened. Also, even though successive samples come from the same distri-

bution, they are highly correlated with each other, so to obtain multiple samples one

should run the Markov chain for many steps between collecting each sample. Markov

chains tend to get stuck in a single mode of π(x) for several steps. The speed with which

a Markov chain moves from mode to mode is called its mixing rate. Since burning in

a Markov chain and getting it to mix well may take several sampling steps, sampling

correctly from an EBM is still a somewhat costly procedure.

TODO: mention Metropolis-Hastings

Of course, all of this depends on ensuring π(x) = p(x) . Fortunately, this is easy

so long as p(x) is deﬁned by an EBM. The simplest method is to use Gibbs sampling,

in which sampling from T (x

| x) is accomplished by selecting one variable x

and

sampling it from p conditioned on its neighbors in G. It is also possible to sample

several variables at the same time so long as they are conditionally independent given

all of their neighbors.

TODO: discussion of mixing example with 2 binary variables that prefer to both

have the same state IG’s graphic from lecture on adversarial nets

TODO: refer to this ﬁgure in the text:

TODO: refer to this ﬁgure in the text

15.1.1 Markov Chain Theory

TODO

State Perron’s theorem

DEFINE detailed balance

280

Figure 15.1: Paths followed by Gibbs sampling for three distributions, with the Markov

chain initialized at the mode in both cases. Left) A multivariate normal distribution

with two independent variables. Gibbs sampling mixes well because the variables are

independent. Center) A multivariate normal distribution with highly correlated vari-

ables. The correlation between variables makes it diﬃcult for the Markov chain to mix.

Because each variable must be updated conditioned on the other, the correlation reduces

the rate at which the Markov chain can move away from the starting point. Right) A

mixture of Gaussians with widely separated modes that are not axis-aligned. Gibbs

sampling mixes very slowly because it is diﬃcult to change modes while altering only

one variable at a time.

Figure 15.2: An illustration of the slow mixing problem in deep probabilistic models.

Each panel should be read left to right, top to bottom. Left) Consecutive samples

from Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST

dataset. Consecutive samples are similar to each other. Because the Gibbs sampling is

performed in a deep graphical model, this similarity is based more on semantic rather

than raw visual features, but it is still diﬃcult for the Gibbs chain to transition from

one mode of the distribution to another, for example by changing the digit identity.

Right) Consecutive ancestral samples from a generative adversarial network. Because

ancestral sampling generates each sample independently from the others, there is no

mixing problem.

281