Chapter 9

Structured Probabilistic Models:

A Deep Learning Perspective

Deep learning draws upon many modeling formalisms that researchers can use to guide

their design eﬀorts and describe their algorithms. One of these formalisms is the idea of

structured probabilistic models . A structured probabilistic model is a way of describing

a probability distribution, using a graph to describe which random variables in the

probability distribution interact with each other directly. Here we use “graph” in the

graph theory sense–a set of vertices connected to one another by a set of edges. Because

the structure of the model is deﬁned by a graph, these models are often also referred to

as graphical models .

The graphical models research community is large and has developed many diﬀerent

models, training algorithms, and inference algorithms. In this chapter, we provide basic

background on some of the most central ideas of graphical models, with an emphasis on

the concepts that have proven most useful to the deep learning research community. If

you already have a strong background in graphical models, you may wish to skip most

of this chapter. However, even a graphical model expert may beneﬁt from reading the

ﬁnal section of this chapter, section 9.7, in which we highlight some of the unique ways

that graphical models are used for deep learning algorithms. Deep learning practitioners

tend to use very diﬀerent model structures, learning algorithms, and inference procedures

than are commonly used by the rest of the graphical models research community. In this

chapter, we identify these diﬀerences in preferences and explain the reasons for them.

In this chapter we ﬁrst describe the challenges of building large-scale probabilistic

models in section 9.1. Next, we describe how to use a graph to describe the structure

of a probability distribution in section 9.2. We then revisit the challenges we described

in section 9.1 and show how the structured approach to probabilistic modeling can

overcome these challenges in section 9.3. One of the major diﬃculties in graphical

modeling is understanding which variables need to be able to interact directly, i.e, which

graph structures are most suitable for a given problem. We outline two approaches to

resolving this diﬃculty by learning about the dependencies in section 9.4. Finally, we

close with a discussion of the unique emphasis that deep learning practitioners place on

158

speciﬁc approaches to graphical modeling in section 9.7.

9.1 The Challenge of Unstructured Modeling

The goal of deep learning is to scale machine learning to the kinds of challenges needed to

solve artiﬁcial intelligence. The feature spaces we work with are often very large, because

they are intended to be reasonable approximations of the kind of features that people

and animals observe through their various sensory organs. For example, many deep

learning researchers are interested in modeling the distribution over natural images. By

“natural images” we mean images that might be observed by a camera in a reasonably

ordinary environment, as opposed to synthetically rendered images, screen shots of web

pages, etc. Of course, natural images are very high dimensional. Even though much

research into natural image modeling focuses on modeling only small patches, a 32 ×32

pixel image with three color channels has 3,072 features. See Fig. 9.1 for an example of

this kind of data, and the kind of samples a machine learning model is able to generate

after learning to represent this distribution.

Modeling a rich distribution over thousands or millions of random variables is a

challenging task, both computationally and statistically. Suppose we only wanted to

model binary variables. This is the simplest possible case, and yet already it seems

overwhelming. There are 2

3072

possible binary images of this form. This number is over

800

times larger than the estimated number of atoms in the universe.

In general, if we wish to model a distribution over a random vector x containing

n discrete variables capable of taking on k values each, then the naive approach of

representing P (x) by storing a lookup table with one probability value per possible

outcome requires k

parameters!

This is not feasible for several reasons:

• Memory: the cost of storing the representation : For all but very small

values of n and k, representing the distribution as a table will require too many

values to store.

• Statistical eﬃciency: in order to avoid overﬁtting, a reasonable rule of thumb

is that one should have about ten times more training examples than parameters

in the model. No dataset with even a tiny fraction of this amount of examples is

available for the table-based model. Any such model will overﬁt the training set

very badly, unless some very strong regularizer suﬃciently cripples the model in

other ways.

• Runtime: the cost of inference: Suppose we want to perform an inference

task where we use our model of P (x) to compute some other distribution, such as

P (x

) or P (x

| x

). Computing these distributions will require summing across

the entire table, so the runtime of these operations is as high as the intractable

memory cost of storing the model.

159

Figure 9.1: Probabilistic modeling of natural images. Top: Example 32 ×32 pixel color

images from the CIFAR-10 dataset (Krizhevsky and Hinton, 2009). Bottom: Samples

drawn from a structured probabilistic model trained on this dataset. Each sample

appears at the same position in the grid as the training example that is closest to it in

Euclidean space. This comparison allows us to see that the model is truly synthesizing

new images, rather than memorizing the training data. Contrast of both sets of images

has been adjusted for display. Figure reproduced with permission from (Courville et al.,

2011).

160

• Runtime: the cost of sampling: Likewise, suppose we want to draw a sample

from the model. The naive way to do this is to sample some value u ∼ U(0, 1),

then iterate through the table adding up the probability values until they exceed

u and return the outcome whose probability value was added last. This requires

reading through the whole table in the worst case, so it has the same exponential

cost as the other operations.

The problem with the table-based approach is that we are allowing every possible

kind of interaction between every possible subset of variables. The probability distri-

butions we encounter in real tasks are much simpler than this. Usually, most variables

inﬂuence each other only indirectly.

For example, consider modeling the ﬁnishing times of a team in a relay race. Suppose

the team consists of three runners, Alice, Bob, and Carol. At the start of the race, Alice

carries a baton and begins running around a track. After completing her lap around

the track, she hands the baton to Bob. Bob then runs his own lap and hands the

baton to Carol, who runs the ﬁnal lap. We can model each of their ﬁnishing times as

a continuous random variable. Alice’s ﬁnishing time does not depend on anyone else’s,

since she goes ﬁrst. Bob’s ﬁnishing time depends on Alice’s, because Bob does not

have the opportunity to start his lap until Alice has completed hers. If Alice ﬁnishes

faster, Bob will ﬁnish faster, all else being equal. Finally, Carol’s ﬁnishing time depends

on both her teammates. If Alice is slow, Bob will probably ﬁnish late too, and Carol

will have quite a late starting time and thus is likely to have a late ﬁnishing time as

well. However, Carol’s ﬁnishing time depends only indirectly on Alice’s ﬁnishing time

via Bob’s. If we already know Bob’s ﬁnishing time, we won’t be able to estimate Carol’s

ﬁnishing time better by ﬁnding out what Alice’s ﬁnishing time was. This means we can

model the relay race using only two interactions: Alice’s eﬀect on Bob, and Bob’s eﬀect

on Carol. We can omit the third, indirect interaction between Alice and Carol from our

model.

Structured probabilistic models provide a formal framework for modeling only direct

interactions between random variables. This allows the models to have signiﬁcantly

fewer parameters which can in turn be estimated reliably from less data. These smaller

models also have dramatically reduced computation cost in terms of storing the model,

performing inference in the model, and drawing samples from the model.

9.2 A Graphical Syntax for Describing Model Structure

The graphs used by structured probabilistic models represent random variables using

vertices (also called nodes). These graphs represent direct interactions between random

variables using edges between nodes. Diﬀerent types of graphs assign diﬀerent exact

meanings to these edges.

161

Alice Bob Carol

Figure 9.2: A directed graphical model depicting the relay race example. Alice’s ﬁnishing

time t

inﬂuences Bob’s ﬁnishing time t

, because Bob does not get to start running

until Alice ﬁnishes. Likewise, Carol only gets to start running after Bob ﬁnishes, so

Bob’s ﬁnishing time t

inﬂuences Carol’s ﬁnishing time t

9.2.1 Directed Models

One kind of structured probabilistic model is the directed graphical model otherwise

known as the belief network or Bayesian network

(Pearl, 1985). Directed graphical

models are called “directed” because the edges in them are directed, that is, they point

from one vertex to another. This direction is represented in the drawing with an arrow.

In a directed graphical model, the arrow indicates which variable’s probability distri-

bution is deﬁned in terms of the other’s. If the probability distribution over a random

variable b is deﬁned in terms of the state of a random variable a, then we draw an arrow

from a to b.

Let’s continue with the relay race example from Section 9.1. Suppose we name

Alice’s ﬁnishing time t

, Bob’s ﬁnishing time t

, and Carol’s ﬁnishing time t

. As we

saw earlier, our estimate of t

depends on t

. Our estimate of t

depends directly on t

but only indirectly on t

. We can draw this relationship in a directed graphical model,

illustrated in Fig. 9.2.

Formally, a directed graphical model deﬁned on variables x is deﬁned by a directed

acyclic graph G whose vertices are the random variables in the model, and a set of local

conditional probability distributions p(x

| P a

)) where P a

) gives the parents of

in G. The probability distribution over x is given by

p(x) = Π

p(x

| P a

)).

In our relay race example, this means that, using the graph drawn in Fig. 9.2,

p(t

, t

) = p(t

)p(t

| t

)p(t

| t

Suppose we represented time by discretizing time ranging from minute 0 to minute

10 into 6 second chunks. This would make t

, t

, and t

each be discrete variables with

100 possible values. If we attempted to represent p(t

, t

) with a table, it would need

to store 999,999 values (100 values of t

× 100 values of t

, minus 1,

Judea Pearl suggested using the term Bayes Network when one wishes to “emphasize the judgmental

nature” of the values computed by the network, i.e. to highlight that they usually represent degrees of

belief rather than frequencies of events.

162

since the probability of one of the conﬁgurations is made redundant by the constraint

that the sum of the probabilities be 1). If instead, we only make a table for each of the

conditional probability distributions, then the distribution over t

requires 99 values,

the table over t

and t

requires 9900 values, and so does the table over t

and t

. This

comes to a total of 19,899 values. This means that using the directed graphical model

reduced our number of parameters by a factor of more than 50!

In general, to model n discrete variables each having k values, the cost of the single

table approach scales like O(k

), as we’ve observed before. Now suppose we build a

directed graphical model over these variables. If m is the maximum number of variables

appearing in a single conditional probability distribution, then the cost of the tables

for the directed model scales like O(k

). As long as we can design a model such that

m << n, we get very dramatic savings.

In other words, so long as each variable has few parents in the graph, the distri-

bution can be represented with very few parameters. Some restrictions on the graph

structure (e.g. it is a tree) can also guarantee that operations like computing marginal

or conditional distributions over subsets of variables are eﬃcient.

It’s important to realize what kinds of information can be encoded in the graph, and

what can’t be. The graph just encodes simplifying assumptions about which variables

are conditionally independent from each other. It’s also possible to make other kinds of

simplifying assumptions. For example, suppose we assume Bob always runs the same

regardless of how Alice performed. (In reality, he might get overconﬁdent and lazy and

run slower if Alice was especially fast, or he might try especially hard to make up time

if Alice was slower than usual). Then the only eﬀect Alice has on Bob’s ﬁnishing time

is we must add Alice’s ﬁnishing time to the total amount of time we think Bob needs

to run. This could let us make a model with O(k) eﬃciency instead of O(k

). However,

note that t

and t

are still directly dependent with this assumption. It is a useful

assumption, but not one that can be encoded in a graph. It’s just part of our deﬁnition

of the conditional probability distribution p(t

| t

9.2.2 Undirected Models

Directed graphical models make a lot of intuitive sense in many situations. Usually

these are situations where we understand the causality, and the causality only ﬂows in

one direction. (Such as in the relay race example: earlier runners cause eﬀects on the

ﬁnishing times of later runners; later runners do not cause any eﬀects on the ﬁnishing

times of earlier runners) Not all situations ﬁt this model, so directed models are not

always the most natural formalism to use.

Suppose we want to model a distribution over three binary variables–whether or not

you are sick, whether or not your coworker is sick, and whether or not your roommate is

sick. As in the relay race example, we can still make simplifying assumptions about the

kinds of interactions that take place: assuming that your coworker and your roommate

do not know each other, it is very unlikely that one of them will give the other a disease

such as a cold directly. This event can be seen as so rare that it is acceptable not to

model it. However, it is reasonably likely that either of them could give you a cold, and

163

Figure 9.3: An undirected graph representing how your roommate’s health h

, your

health rh

, and your work colleague’s health rh

aﬀect each other. You and your room-

mate might infect each other with a cold, and you and your work colleague might do

the same, but assuming that your roommate and your colleague don’t know each other,

they can only infect each other indirectly via you.

that you could pass it on to the other. We can model the indirect transmission of a cold

from your coworker to your roommate by modeling the transmission of the cold from

your coworker to you and the transmission of the cold from you to your roommate.

In this case, it’s just as easy for you to cause your roommate to get sick as it is for

your roommate to make you sick, so there is not a clean, uni-directional narrative on

which to base the model. To handle this kind of situation, we can use another kind of

model.

Undirected models, otherwise known as Markov random ﬁelds (MRFs) or Markov

networks (Kindermann, 1980), are graphical models in which the edges are not directed.

If two nodes are connected by an edge, then the random variables corresponding to those

nodes interact with each other directly.

Let’s call the random variable representing your health h

, the random variable

representing your roommate’s health h

, and the random variable representing your

colleague’s health h

. See Fig. 9.3 for a drawing of the graph representing this scenario.

Formally, an undirected graphical model is a structured probabilistic model deﬁned

on an undirected graph G. For each clique C in the graph

, a factor φ(C) (also called a

clique potential) measures the aﬃnity of the variables in that clique for being in each

of their possible states. The factors are constrained to be non-negative. Together they

deﬁne an unnormalized probability distribution

˜p(x) = Π

C∈G

φ(C).

The unnormalized probability distribution is eﬃcient to work with so long as all the

cliques are small. It encodes the idea that states with higher aﬃnity are more likely.

However, unlike in a Bayesian network, there is little structure to the deﬁnition of the

cliques, so there is nothing to guarantee that multiplying them together will yield a valid

probability distribution. See Fig. 9.4 for an example of reading factorization information

from an undirected graph.

9.2.3 The Partition Function

While the unnormalized probability distribution is guaranteed to be non-negative ev-

erywhere, it is not guaranteed to sum or integrate to 1. To obtain a valid probability

A clique of the graph is a subset of nodes that are all connected to each other by an arc of the graph.

164

distribution, we must use the corresponding normalized probability distribution,

p(x) =

˜p(x)

where Z is the value that results in the probability distribution summing or integrating

to 1,

Z =



˜p(x)dx.

You can think of Z as a constant when the φ functions are held constant. Note

that if the φ functions have parameters, then Z is a function of those parameters. It

is common in the literature to write Z with its arguments omitted to save space. Z is

known as the partition function, a term borrowed from statistical physics.

Since Z is an integral or sum over all possible joint assignments of the state x it is

often intractable to compute. In order to be able to obtain the normalized probability

distribution of an undirected model, the model structure and the deﬁnitions of the φ

functions must be conducive to computing Z eﬃciently.

Note that for p(x) to exist, Z must exist. The generic deﬁnition of Z does not

guarantee that it exist in general. For example, suppose we want to model a single

scalar variable x ∈ R with a single clique potential φ(x) = x

. In this case,

Z =



∞

−∞

dx.

Since this integral diverges, there is no probability distribution corresponding to this

choice of φ(x). Sometimes the choice of some parameter of the φ functions determines

whether the probability distribution is deﬁned. For example, for φ(x) = exp



βx



, the

β parameter determines whether Z exists. Negative β results in a Gaussian distribution

over x but all other values of β make φ impossible to normalize.

One key diﬀerence between directed modeling and undirected modeling is that di-

rected models are deﬁned directly in terms of probability distributions from the start,

while undirected models are deﬁned more loosely by φ functions that are then converted

into probability distributions. This changes the intuitions one must develop in order to

work with these models. One key idea to keep in mind while working with undirected

models is that the domain of each of the variables has a dramatic consequence on the

kind of probability distribution that a given set of φ functions results in. For example,

consider an n-dimensional vector-valued random variable x and an undirected model

parameterized by a vector of biases b. Suppose we have one clique for each element of

x, φ

) = exp(b

). What kind of probability distribution does this result in? The

answer is that we don’t have enough information, because we have not yet speciﬁed

the domain of x. If x ∈ R

, then the integral deﬁning Z diverges and no probability

distribution exists. If x ∈ {0, 1}

, then p(x) factorizes into n independent distributions,

with p(x

= 1) = sigmoid (b

). If the domain of x is the set of elementary basis vectors

A distribution deﬁned by normalizing a product of clique potentials is also called a Gibbs distribution.

165

A B C

D E F

Figure 9.4: This graph implies that p(A, B, C, D, E, F) can be written as

A,B

(A, B)φ

B,C

(B, C)φ

A,D

(A, D)φ

B,E

(B, E)φ

E,F

(E, F ) for an appropriate choice of

the φ functions.

({[1, 0, . . . , 0], [0, 1, . . . , 0], . . . , [0, 0, . . . , 1]} ) then p(x) = softmax(b), so a large value of

actually reduces p(x

= 1) for j = i. Often, it is possible to leverage the eﬀect of a

carefully chosen domain of a variable in order to obtain complicated behavior

9.2.4 Energy-Based Models

Many interesting theoretical results about undirected models depend on the assumption

that ∀x, ˜p(x) > 0. A convenient way to enforce this to use an energy-based model (EBM)

where

˜p(x) = exp(−E(x)) (9.1)

and E(x) is known as the energy function. Because exp(z) is positive for all z, this

guarantees that no energy function will result in a probability of zero for any state

x. Being completely free to choose the energy function makes learning simpler. If we

learned the clique potentials directly, we would need to use constrained optimization,

and we would need to impose some speciﬁc minimal probability value. By learning the

energy function, we can use unconstrained optimization

, and the probabilities in the

model can approach arbitrarily close to zero but never reach it.

Any distribution of the form given by equation 9.1 is an example of a Boltzmann

distribution. For this reason, many energy-based models are called Boltzmann machines.

There is no accepted guideline for when to call a model an energy-based model and when

to call it a Boltzmann machines. The term Boltzmann machine was ﬁrst introduced to

describe a model with exclusively binary variables, but today many models such as the

mean-covariance restricted Boltzmann machine incorporate real-valued variables as well.

Cliques in an undirected graph correspond to factors of the unnormalized probabil-

ity function. Because exp(a) exp(b) = exp(a + b), this means that diﬀerent cliques in

the undirected graph correspond to the diﬀerent terms of the energy function. In other

words, an energy-based model is just a special kind of Markov network: the exponen-

tiation makes each term in the energy function correspond to a factor for a diﬀerent

clique. See Fig. 9.5 for an example of how to read the form of the energy function from

an undirected graph structure.

For some models, we may still need to use constrained optimization to make sure Z exists.

166

A B C

D E F

Figure 9.5: This graph implies that E(A, B, C, D, E, F) can be written as E

A,B

(A, B)+

B,C

(B, C) + E

A,D

(A, D) + E

B,E

(B, E) + E

E,F

(E, F ) for an appropriate choice of the

per-clique energy functions. Note that we can obtain the φ functions in Fig. 9.4 by

setting each φ to the exp of the corresponding negative energy, e.g, φ

A,B

(A, B) =

exp (−E(A, B)).

As a historical note, observe that the − sign in 9.1 does not change the represen-

tational power of the energy-based model. From a machine learning point of view, the

negation serves no purpose. Some machine learning researchers (e.g., Smolensky (1986),

who referred to negative energy as harmony) have worked on related ideas that omit the

negation. However, in the ﬁeld of statistical physics, energy is a useful concept because

it refers to real energy of physical particles. Many advances in probabilistic modeling

were originally developed by statistical physicists, and terminology such as “energy” and

“partition function” remains associated with these techniques, even though their math-

ematical applicability is broader than the physics context in which they were developed.

9.2.5 Separation and D-Separation

The main purpose of a graphical model is to specify which interactions do not occur in a

given probability distribution so that we can save computational resources and estimate

the model with greater statistical eﬃciency. The edges in the graph show which variables

directly interact, but it can also be useful to know which variable indirectly interact,

and in what context.

Identifying the independences in a graph is very simple in the case of undirected

models. In the context of undirected models, this independence implied by the graph

is called separation. We say that a set of variables A is separated from another set

of variables B given a third set of variables S if the graph structure implies that A is

independent from B given S. Determining which variables are separated is simple. If two

variables A and B are connected by a path involving only unobserved variables, then

those variables are not separated. If these variables cannot be shown to depend indirectly

on each other in this manner, then they are separated. We refer to paths involving only

unobserved variables as “active” and paths including an observed variable as “inactive.”

When we draw a graph, we can indicate observed variables by shading them in. See

Fig. 9.6 for a depiction of what an active and an inactive path looks like when drawn in

this way. See Fig. 9.7 for an example of reading separation from a graph.

Similar concepts apply to directed models, except that in the context of directed

167

A S B A S B

(a) (b)

Figure 9.6: a) The path between random variable A and random variable B through

S is active, because S is not observed. This means that A and B are not separated.

b) Here S is shaded in, to indicate that it is observed. Because the only path between

A and B is through S, and that path is inactive, we can conclude that A and B are

separated given S.

B C

Figure 9.7: An example of reading separation properties from an undirected graph. B

is shaded to indicate that it is observed. Because observing B blocks the only path from

A to C, we say that A and C are separated from each other given B. The observation

of B also blocks one path between A and D, but there is a second, active path between

them. Therefore, A and D are not separated given B.

168

models, these concepts are referred to as d-separation. The “d” stands for “dependence.”

D-separation for directed graphs is deﬁned the same as separation for undirected graphs:

We say that a set of variables A is d-separated from another set of variables B given

a third set of variables S if the graph structure implies that A is independent from B

given S.

As with undirected models, we can examine the independences implied by the graph

by looking at what active paths exist in the graph. As before, two variables are depen-

dent if there is an active path between them, and d-separated if no such path exists. In

directed nets, determining whether a path is active is somewhat more complicated. See

Fig. 9.8 for a guide to identifying active paths in a directed model. See Fig. 9.9 for an

example of reading some properties from a graph.

It is important to remember that separation and d-separation tell us only about

those conditional independences that are implied by the graph. There is no requirement

that the graph imply all independences that are present. In particular, it is always

legitimate to use the complete graph (the graph with all possible edges) to represent

any distribution. In fact, some distributions contain independences that are not possible

to represent with graphical notation. Context-speciﬁc independences are independences

that are present dependent on the value of some variables in the network. For example,

consider a model of three binary variables, A, B, and C. Suppose that when A is 0, B

and C are independent, but when A is 1, B is deterministically equal to C. Encoding

the behavior when A = 1 requires an edge connecting B and C. The graph then fails

to indicate that B and C are independent when A = 1.

In general, a graph will never imply that an independence exists when it does not.

However, a graph may fail to encode an independence.

9.2.6 Operations on a Graph

TODO: conversion between directed and undirected models

TODO: marginalizing variables out of a graph

9.2.7 Factor Graphs

Factor graphs are another way of drawing undirected models that resolve an ambiguity

in the graphical representation of standard undirected model syntax. In an undirected

model, the scope of every φ function must be a subset of some clique in the graph.

However, it is not necessary that there exist any φ whose scope contains the entirety of

every clique. Factor graphs explicitly represent the scope of each φ function. Speciﬁcally,

a factor graph is a graphical representation of an undirected model that consists of

a bipartite undirected graph. Some of the nodes are drawn as circles. These nodes

correspond to random variables as in a standard undirected model. The rest of the

nodes are drawn as squares. These nodes correspond to the factors φ of the unnormalized

probability distribution. Variables and factors may be connected with undirected edges.

A variable and a factor are connected in the graph if and only if the variable is one

of the arguments to the factor in the unnormalized probability distribution. No factor

169

A S B

B A S B

A S B

B A S B

A S B

(a) (b)

A S B

B A S B

A S B

B A S B

A S B

Figure 9.8: All of the kinds of active paths of length two that can exist between random

variables A and B. a) Any path with arrows proceeding directly from A to B or vice

versa. This kind of path becomes blocked if S is observed. We have already seen this

kind of path in the relay race example. b) A and B are connected by a common cause S.

For example, suppose S is a variable indicating whether or not there is a hurricane and

A and B measure the wind speed at two diﬀerent nearby weather monitoring outposts.

If we observe very high winds at station A, we might expect to also see high winds at

B. This kind of path can be blocked by observing S. If we already know there is a

hurricane, we expect to see high winds at B, regardless of what is observed at A. A

lower than expected wind at A (for a hurricane) would not change our expectation of

winds at B (knowing there is a hurricane). However, if S is not observed, then A and

B are dependent, i.e., the path is inactive. c) A and B are both parents of S. This is

called a V-structure or the collider case. A and B are related by the explaining away

eﬀect. In this case, the path is actually active when S is observed. For example, suppose

S is a variable indicating that your colleague is not at work. A represents her being

sick, while B represents her being on vacation. If you observe that she is not at work,

you can presume she is probably sick or on vacation, but it’s not especially likely that

both have happened at the same time. If you ﬁnd out that she is on vacation, this fact

is suﬃcient to explain her absence, and you can infer that she is probably not also sick.

d) The explaining away eﬀect happens even if any descendant of the S is observed! For

example, suppose that C is a variable representing whether you have received a report

from your colleague. If you notice that you have not received the report, this increases

your estimate of the probability that she is not at work today, which in turn makes it

more likely that she is either sick or on vacation. The only way to block a path through

a V-structure is to observe none of the descendants of the shared child.

170

A B

D E

Figure 9.9: From this graph, we can read out several d-separation properties. Examples

include: A and B are d-separated given the empty set; A and E are d-separated given

C; D and E are d-separated given C. Note that there are a few d-separations that do

not occur: A and B are not d-separated given C; A and B are still not d-separated

given D. TODO: format this caption better, evidently latex does not support itemize

or endlines in captions

may be connected to another factor in the graph, nor can a variable be connected to a

variable. See Fig. 9.10 for an example of how factor graphs can resolve ambiguity in the

interpretation of undirected networks.

9.3 Advantages of Structured Modeling

TODO: hammer point that graphical models convey information by leaving edges out

TODO: revisit each of the three challenges from sec:unstructured TODO: don’t forget

to teach about ancestral and Gibbs sampling while showing the reduced cost of sampling

TODO: beneﬁt of separating representation from learning and inference

9.4 Learning about Dependencies

We consider here two types of random variables: observed or “visible” variables v and

latent or “hidden” variables h. v corresponds to the variables actually provided in the

data set during training. h consists of variables that are introduced to the model in

order to help it explain the structure in v. Generally the exact semantics of h depend

on the model parameters and are created by the learning algorithm. The motivation for

this is twofold.

9.4.1 Latent Variables Versus Structure Learning

Often the diﬀerent elements of v are highly dependent on each other. A good model of

v which did not contain any latent variables would need to have very large numbers of

parents per node in a Bayesian network or very large cliques in a Markov network. Just

representing these higher order interactions is costly–both in a computational sense,

171

A B

(a) (b) (c)

Figure 9.10: An example of how a factor graph can resolve ambiguity in the interpre-

tation of undirected networks. a) An undirected network with a clique involving three

variables A, B, and C. b) A factor graph corresponding to the same undirected model.

This factor graph has one factor over all three variables. c) Another valid factor graph

for the same undirected model. This factor graph has three factors, each over only

two variables. Note that representation, inference, and learning are all asymptotically

cheaper in (c) compared to (b), even though both require the same undirected graph to

represent.

because the number of parameters that must be stored in memory scales exponentially

with the number of members in a clique, but also in a statistical sense, because this

exponential number of parameters requires a wealth of data to estimate accurately.

There is also the problem of learning which variables need to be in such large cliques.

An entire ﬁeld of machine learning called structure learning is devoted to this problem

(Koller and Friedman, 2009). Most structure learning techniques involve ﬁtting a model

with a speciﬁc structure to the data, assigning it some score that rewards high training

set accuracy and penalizes model complexity, then greedily adding or subtracting an

edge from the graph in a way that is expected to increase the score.

Using latent variables instead avoids this whole problem. A ﬁxed structure over

visible and hidden variables can use direct interactions between visible and hidden units

to impose indirect interactions between visible units. Using simple parameter learning

techniques we can learn a model with a ﬁxed structure that imputes the right structure

on the marginal p(v).

9.4.2 Latent Variables for Feature Learning

Another advantage of using latent variables is that they often develop useful semantics.

As discussed in section 3.10.5, the mixture of Gaussians model learns a latent variable

that corresponds to which category of examples the input was drawn from. Other more

sophisticated models with more latent variables can create even richer descriptions of

the input. Most of the approaches mentioned in sec. 9.4.2 accomplish feature learning

by learning latent variables. Often, given some model of v and h, it turns out that

E[h | v] or argmax

p(h, v) is a good feature mapping for v.

TODO: structure learning

172

TODO: latent variables

9.5 Markov Chain Monte Carlo Methods

Drawing a sample x from the probability distribution p(x) deﬁned by a structured model

is an important operation. The following techniques are described in (Koller and Fried-

man, 2009).

Sampling from an energy-based model is not straightforward. Suppose we have an

EBM deﬁning a distribution p(a, b). In order to sample a, we must draw it from p(a | b),

and in order to sample b, we must draw it from p(b | a). It seems to be an intractable

chicken-and-egg problem. Directed models avoid this because their G is directed and

acyclical. In ancestral sampling one simply samples each of the variables in topological

order, conditioning on each variable’s parents, which are guaranteed to have already

been sampled. This deﬁnes an eﬃcient, single-pass method of obtaining a sample.

In an EBM, it turns out that we can get around this chicken and egg problem by

sampling using a Markov chain. A Markov chain is deﬁned by a state x and a transition

distribution T (x



| x). Running the Markov chain means repeatedly updating the state

x to a value x



sampled from T (x



| x).

Under certain distributions, a Markov chain is eventually guaranteed to draw x from

an equilibrium distribution π(x



), deﬁned by the condition

∀x



, π(x



) =



T (rvx



| x)π(x).

TODO– this vector / matrix view needs a whole lot more exposition only literally

a vector / matrix when the state is discrete unpack into multiple sentences, the paren-

thetical is hard to parse is the term “stochastic matrix” deﬁned anywhere? make sure

it’s in the index at least whoever ﬁnishes writing this section should also ﬁnish making

the math notation consistent

We can think of π as a vector (with the probability for each possible value x in the

element indexed by x, π(x)) and T as a corresponding stochastic matrix (with row index



and column index x), i.e., with non-negative entries that sum to 1 over elements of a

column. Then, the above equation becomes

T π = π

an eigenvector equation that says that π is the eigenvector of T with eigenvalue 1. It can

be shown (Perron-Frobenius theorem) that this is the largest possible eigenvalue, and

the only one with value 1 under mild conditions (for example T (x



| x) > 0). We can also

see this equation as a ﬁxed point equation for the update of the distribution associated

with each step of the Markov chain. If we start a chain by picking x

∼ p

, then we

get a distribution p

= Tp

after one step, and p

= Tp

t−1

= T

after t steps. If this

recursion converges (the chain has a so-called stationary distribution), then it converges

to a ﬁxed point which is precisely p

= π for t → ∞, and the dynamical systems view

meets and agrees with the eigenvector view.

173

This condition guarantees that repeated applications of the transition sampling pro-

cedure don’t change the distribution over the state of the Markov chain. Running the

Markov chain until it reaches its equilibrium distribution is called “burning in” the

Markov chain.

Unfortunately, there is no theory to predict how many steps the Markov chain must

run before reaching its equilibrium distribution, nor any way to tell for sure that this

event has happened. Also, even though successive samples come from the same distri-

bution, they are highly correlated with each other, so to obtain multiple samples one

should run the Markov chain for many steps between collecting each sample. Markov

chains tend to get stuck in a single mode of π(x) for several steps. The speed with which

a Markov chain moves from mode to mode is called its mixing rate. Since burning in

a Markov chain and getting it to mix well may take several sampling steps, sampling

correctly from an EBM is still a somewhat costly procedure.

TODO: mention Metropolis-Hastings

Of course, all of this depends on ensuring π(x) = p(x) . Fortunately, this is easy

so long as p(x) is deﬁned by an EBM. The simplest method is to use Gibbs sampling,

in which sampling from T(x



| x) is accomplished by selecting one variable x

and

sampling it from p conditioned on its neighbors in G. It is also possible to sample

several variables at the same time so long as they are conditionally independent given

all of their neighbors.

TODO: discussion of mixing example with 2 binary variables that prefer to both

have the same state IG’s graphic from lecture on adversarial nets

TODO: add section on MCMC, it needs to be developed here so both the generative

autoencoders and the advanced deep nets can refer back to it TODO: there is some

discussion of Markov chains already when describing how to sample from an EBM,

determine how to present content. NOTE: there seems to be stuﬀ about MCMC in

section 7.3 already

TODO: refer to this ﬁgure in the ext:

TODO: refer to this ﬁgure in the text

9.6 Inference and Approximate Inference Over Latent Vari-

ables

As soon as we introduce latent variables in a graphical model, this raises the question:

how to choose values of the latent variables h given values of the visible variables x?

This is what we call inference, in particular inference over the latent variables. The

general question of inference is to guess some variables given others.

TODO: mention loopy BP, show how it is very expensive for DBMs

TODO: brieﬂy explain what variational inference is and reference approximate in-

ference chapter

174

Figure 9.11: Paths followed by Gibbs sampling for three distributions, with the Markov

chain initialized at the mode in both cases. Left) A multivariate normal distribution

with two independent variables. Gibbs sampling mixes well because the variables are

independent. Center) A multivariate normal distribution with highly correlated vari-

ables. The correlation between variables makes it diﬃcult for the Markov chain to mix.

Because each variable must be updated conditioned on the other, the correlation reduces

the rate at which the Markov chain can move away from the starting point. Right) A

mixture of Gaussians with widely separated modes that are not axis-aligned. Gibbs

sampling mixes very slowly because it is diﬃcult to change modes while altering only

one variable at a time.

Figure 9.12: An illustration of the slow mixing problem in deep probabilistic models.

Each panel should be read left to right, top to bottom. Left) Consecutive samples

from Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST

dataset. Consecutive samples are similar to each other. Because the Gibbs sampling is

performed in a deep graphical model, this similarity is based more on semantic rather

than raw visual features, but it is still diﬃcult for the Gibbs chain to transition from

one mode of the distribution to another, for example by changing the digit identity.

Right) Consecutive ancestral samples from a generative adversarial network. Because

ancestral sampling generates each sample independently from the others, there is no

mixing problem.

175

9.7 The Deep Learning Approach to Structured Proba-

bilistic Modeling

Deep learning practictioners generally use the same basic computational tools as other

machine learning practitioners who work with structured probabilistic models. However,

in the context of deep learning, we usually make diﬀerent design decisions about how to

combine these tools, resulting in overall algorithms and models that have a very diﬀerent

ﬂavor from more traditional graphical models.

The most striking diﬀerence between the deep learning style of graphical model de-

sign and the traditional style of graphical model design is that the deep learning style

heavily emphasizes the use of latent variables. Deep learning models typically have more

latent variables than observed variables. Moreover, the practitioner typically does not

intend for the latent variables to take on any speciﬁc semantics ahead of time— the

training algorithm is free to invent the concepts it needs to model a particular dataset.

The latent variables are usually not very easy for a human to interpret after the fact,

though visualization techniques may allow some rough characterization of what they

represent. Complicated non-linear interactions between variables are accomplished via

indirect connections that ﬂow through multiple latent variables. By contrast, traditional

graphical models usually contain variables that are at least occasionally observed, even

if many of the variables are missing at random from some training examples. Com-

plicated non-linear interactions between variables are modeled by using higher-order

terms, with structure learning algorithms used to prune connections and control model

capacity. When latent variables are used, they are often designed to represent some

with some speciﬁc semantics in mind—the topic of a document, the intelligence of a

student, the disease causing a patient’s symptoms, etc. These models are often much

more interpretable by human practitioners and often have more theoretical guarantees,

yet are less able to scale to complex problems and are not re-useable in as many diﬀerent

contexts as deep models.

Another obvious diﬀerence is the kind of graph structure typically used in the deep

learning approach. This is tightly linked with the choice of inference algorithm. Tradi-

tional approaches to graphical models typically aim to maintain the tractability of exact

inference. When this constraint is too limiting, a popular exact inference algorithm is

loopy belief propagation. Both of these approaches often work well with very sparsely

connected graphs. By comparison, very few interesting deep models admit exact infer-

ence, and loopy belief propagation is almost never used for deep learning. Most deep

models are designed to make Gibbs sampling or variational inference algorithms, rather

than loopy belief propagation, eﬃcient. Another consideration is that deep learning

models contain a very large number of latent variables, making eﬃcient numerical code

essential. As a result of these design constraints, most deep learning models are orga-

nized into regular repeating patterns of units grouped into layers, but neighboring layers

may be fully connected to each other. When sparse connections are used, they usually

follow a regular pattern, such as the block connections used in convolutional models.

Finally, the deep learning approach to graphical modeling is characterized by a

marked tolerance of the unknown. Rather than simplifying the model until all quantities

176

Figure 9.13: An example RBM drawn as a Markov network

we might want can be computed exactly, we increase the power of the model until it is

just barely possible to train or use. We often use models whose marginal distributions

cannot be computed, and are satisﬁed simply to draw approximate samples from these

models. We often train models with an intractable objective function that we cannot

even approximate in a reasonable amount of time, but we are still able to approximately

train the model if we can eﬃciently obtain an estimate of the gradient of such a func-

tion. The deep learning approach is often to ﬁgure out what the minimum amount

of information we absolutely need is, and then to ﬁgure out how to get a reasonable

approximation of that information as quickly as possible.

9.7.1 Example: The Restricted Boltzmann Machine

TODO: rework this section. Add pointer to Chapter 17.1. TODO what do we want to

exemplify here?

The restricted Boltzmann machine (RBM) (Smolensky, 1986) or harmonium is an

example of a model that TODO what do we want to exemplify here?

It is an energy-based model with binary visible and hidden units. Its energy function

E(v, h) = −b



v − c



h −v



W h

where b, c, and W are unconstrained, real-valued, learnable parameters. The model

is depicted graphically in Fig. 9.13. As this ﬁgure makes clear, an important aspect

of this model is that there are no direct interactions between any two visible units or

between any two hidden units (hence the “restricted,” a general Boltzmann machine

may have arbitrary connections).

The restrictions on the RBM structure yield the nice properties

p(h | v) = Π

p(h

| v)

and

p(v | h) = Π

p(v

| h).

177

The individual conditionals are simple to compute as well, for example

p(h

= 1 | v) = σ





:,i

+ b



Together these properties allow for eﬃcient block Gibbs sampling, alternating be-

tween sampling all of h simultaneously and sampling all of v simultaneously.

Since the energy function itself is just a linear function of the parameters, it is easy

to take the needed derivatives. For example,

∂

∂W

i,j

v,h

E(v, h) = −v

These two properties–eﬃcient Gibbs sampling and eﬃcient derivatives– make it pos-

sible to train the RBM with stochastic approximations to ∇

log Z.

TODO– do we want to say anything about Sum Product Networks here? If not

here, where else in the book? cite James Martens’ new paper on their representational

capacity

178