Chapter 9

Structured Probabilistic Models:

A Deep Learning Perspective

Deep learning draws upon many modeling formalisms that researchers can use to guide

their design eﬀorts and describe their algorithms. One of these formalisms is the idea of

structured probabilistic models. A structured probabilistic model is a way of describing

a probability distribution, using a graph to describe which random variables in the

probability distribution interact with each other directly. Here we use “graph” in the

graph theory sense–a set of vertices connected to one another by a set of edges. Because

the structure of the model is deﬁned by a graph, these models are often also referred to

as graphical models.

The graphical models community is large and has developed many diﬀerent models,

training algorithms, and inference algorithms. In this chapter, we provide basic back-

ground on some of the most central ideas of graphical models, with an emphasis on the

concepts that have proven most useful to the deep learning community. If you already

have a strong background in graphical models, you may wish to skip most of this chap-

ter. However, even a graphical model expert may beneﬁt from reading the ﬁnal section

of this chapter, section 9.5, in which we highlight some of the unique ways that graphical

models are used for deep learning algorithms. Deep learning practitioners tend to use

very diﬀerent model structures, learning algorithms, and inference procedures than are

commonly used by the rest of the graphical models community. In this chapter, we

identify these diﬀerences in preferences and explain the reasons for them.

In this chapter we ﬁrst describe the challenges of building large-scale probabilistic

models in section 9.1. Next, we describe how to use a graph to describe the structure

of a probability distribution in section 9.2. We then revisit the challenges we described

in section 9.1 and show how the structured approach to probabilistic modeling can

overcome these challenges in section 9.3. One of the major diﬃculties in graphical

modeling is understanding which variables need to be able to interact directly, i.e, which

graph structures are most suitable for a given problem. We outline two approaches to

resolving this diﬃculty by learning about the dependencies in section 9.4. Finally, we

close with a discussion of the unique emphasis that deep learning practitioners place on

101

speciﬁc approaches to graphical modeling in section 9.5.

9.1 The challenge of unstructured modeling

The goal of deep learning is to scale machine learning to the kinds of challenges needed to

solve artiﬁcial intelligence. The feature spaces we work with are often very large, because

they are intended to be reasonable approximations of the kind of features that people

and animals observe through their various sensory organs. For example, many deep

learning researchers are interested in modeling the distribution over natural images. By

“natural images” we mean images that might be observed by a camera in a reasonably

ordinary environment, as opposed to synthetically rendered images, screenshots of web

pages, etc. Of course, natural images are very high dimensional. Even though much

research into natural image modeling focuses on modeling only small patches, a 32 ×32

pixel image with three color channels has 3,072 features. See Fig. 9.1 for an example of

this kind of data, and the kind of samples a machine learning model is able to generate

after learning to represent this distribution.

Modeling a rich distribution over thousands or millions of random variables is a

challenging task, both computationally and statistically. Suppose we only wanted to

model binary variables. This is the simplest possible case, and yet already it seems

overwhelming. There are 2

3072

possible binary images of this form. This number is over

800

times larger than the estimated number of atoms in the universe.

In general, if we wish to model a distribution over a vector x containing n discrete

variables capable of taking on k values each, then the naive approach of representing

P (x) by storing a lookup table with one probability value per possible outcome requires

parameters!

This is not feasible for several reasons:

• Memory: the cost of storing the representation : For all but very small

values of n and k, representing the distribution as a table will require too many

values to store.

• Statistical eﬃciency: in order to avoid overﬁtting, a reasonable rule of thumb

is that one should have about ten times more training examples than parameters

in the model. No dataset with even a tiny fraction of this amount of examples

is available for the table-based model. Any such model will overﬁt the training

set very badly, unless some very strong regulizer suﬃciently cripples the model in

other ways.

• Runtime: the cost of inference: Suppose we want to perform an inference

task where we use our model of P(x) to compute some other distribution, such as

P (x

) or P (x

| x

). Computing these distributions will require summing across

the entire table, so the runtime of these operations is as high as the intractable

memory cost of storing the model.

102

Figure 9.1: Probabilistic modeling of natural images. Top: Example 32 × 32 pixel

color images from the CIFAR-10 dataset citep (Krizhevsky and Hinton, 2009). Bottom:

Samples drawn from a structured probabilistic model trained on this dataset. Each

sample appears at the same position in the grid as the training example that is closest

to it in Euclidean space. This comparison allows us to see that the model is truly

synthesizing new images, rather than memorizing the training data. Contrast of both

sets of images has been adjusted for display. Figure reproduced with permission from

(Courville et al., 2011).

103

• Runtime: the cost of sampling: Likewise, suppose we want to draw a sample

from the model. The naive way to do this is to sample some value u ∼ U(0, 1),

then iterate through the table adding up the probability values until they exceed

u and return the outcome whose probability value was added last. This requires

reading through the whole table in the worst case, so it has the same exponential

cost as the other operations.

The problem with the table-based approach is that we are allowing every possible

kind of interaction between every possible subset of variables. The probability distri-

butions we encounter in real tasks are much simpler than this. Usually, most variables

inﬂuence each other only indirectly.

For example, consider modeling the ﬁnishing times of a team in a relay race. Suppose

the team consists of three runners, Alice, Bob, and Carol. At the start of the race, Alice

carries a baton and begins running around a track. After completing her lap around

the track, she hands the baton to Bob. Bob then runs his own lap and hands the

baton to Carol, who runs the ﬁnal lap. We can model each of their ﬁnishing times as

a continuous random variable. Alice’s ﬁnishing time does not depend on anyone else’s,

since she goes ﬁrst. Bob’s ﬁnishing time depends on Alice’s, because Bob does not have

the opportunity to start his lap until Alice has completed hers. If Alice ﬁnishes faster,

Bob will ﬁnish faster, all else being equal. Finally, Carol’s ﬁnishin time depends on both

her teammates. If Alice is slow, Bob will probably ﬁnish late too, and Carol will have

quite a late starting time and thus is likely to have a late ﬁnishing time as well. However,

Carol’s ﬁnishing time depends only indirectly on Alice’s ﬁnishing time via Bob’s. If we

already know Bob’s ﬁnishing time, we won’t be able to estimate Carol’s ﬁnishing time

better by ﬁnding out what Alice’s ﬁnishing time was. This means we can model the

relay race using only two interactions: Alice’s eﬀect on Bob, and Bob’s eﬀect on Carol.

We can ommit the third, indirect interaction between Alice and Carol from our model.

Structured probabilistic models provide a formal framework for modeling only direct

interactions between random variables. This allows the models to have signiﬁcantly

fewer parameters which can in turn be estimated reliably from less data. These smaller

models also have dramatically reduced computation cost in terms of storing the model,

performing inference in the model, and drawing samples from the model.

9.2 A graphical syntax for describing model structure

The graphs used by structured probabilistic models represent random variables using

vertices (also called nodes). These graphs represent direct interactions between random

variables using edges between nodes. Diﬀerent types of graphs assign diﬀerent exact

meanings to these edges.

104

Figure 9.2: A directed graphical model depicting the relay race example. Alice’s ﬁnishing

time t

inﬂuences Bob’s ﬁnishing time t

, because Bob does not get to start running

until Alice ﬁnishes. Likewise, Carol only gets to start running after Bob ﬁnishes, so

Bob’s ﬁnishing time t

inﬂuences Carol’s ﬁnishing time t

9.2.1 Directed models

One kind of structured probabilistic model is the directed graphical model otherwise

known as the belief network or Bayesian network

(Pearl, 1985). Directed graphical

models are called “directed” because the edges in them are directed, that is, they point

from one vertex to another. This direction is represented in the drawing with an arrow.

In a directed graphical model, the arrow indicates which variable’s probability distribu-

tion is deﬁned in terms of the other’s. If the probability distribution over a variable b is

deﬁned in terms of the state of a variable a, then we draw an arrow from a to b.

Let’s continue with the relay race example from Section 9.1. Suppose we name

Alice’s ﬁnishing time t

, Bob’s ﬁnishing time t

, and Carol’s ﬁnishing time t

. As we

saw earlier, our estimate of t

depends on t

. Our estimate of t

depends directly on t

but only indirectly on t

. We can draw this relationship in a directed graphical model,

illustrated in Fig. 9.2.

Formally, a directed graphical model deﬁned on variables x is deﬁned by a directed

acyclic graph G whose vertices are the random variables in the model, and a set of local

conditional probability distributions p(x

| P a

)) where Pa

) gives the parents of

in G. The probability distribution over x is given by

p(x) = Π

p(x

| Pa

)).

In our relay race example, this means that, using the graph drawn in Fig. 9.2,

p(t

, t

) = p(t

)p(t

| t

)p(t

| t

Suppose we represented time by discretizing time ranging from minute 0 to minute

10 into 6 second chunks. This would make t

, t

, and t

each be discrete variables with

100 possible values. If we attempted to represent p(t

, t

) with a table, it would need

to store 999,999 values (100 values of t

× 100 values of t

, minus 1,

since the probability of one of the conﬁgurations is made redundant by the constraint

that the sum of the probabilities be 1). If instead, we only make a table for each of the

conditional probability distributions, then the distribution over t

requires 99 values,

the table over t

and t

requires 9900 values, and so does the table over t

and t

. This

Judea Pearl suggested using the term Bayes Network when one wishes to “emphasize the judgmental

nature” of the values computed by the network, i.e. to highlight that they usually represent degrees of

belief rather than frequencies of events.

105

comes to a total of 19,899 values. This means that using the directed graphical model

reduced our number of parameters by a factor of more than 50!

In general, to model n discrete variables each having k values, the cost of the single

table approach scales like O(k

), as we’ve observed before. Now suppose we build a

directed graphical model over these variables. If m is the maximum number of variables

appearing in a single conditional probability distribution, then the cost of the tables

for the directed model scales like O(k

). As long as we can design a model such that

m << n, we get very dramatic savings.

In other words, so long as each variable has few parents in the graph, the distri-

bution can be represented with very few parameters. Some restrictions on the graph

structure (e.g. it is a tree) can also guarantee that operations like computing marginal

or conditional distributions over subsets of variables are eﬃcient.

It’s important to realize what kinds of information can be encoded in the graph, and

what can’t be. The graph just encodes simplifying assumptions about which variables

are conditionally independent from each other. It’s also possible to make other kinds of

simplifying assumptions. For example, suppose we assume Bob always runs the same

regardless of how Alice performed. (In reality, he might get overconﬁdent and lazy and

run slower if Alice was especially fast, or he might try especially hard to make up time

if Alice was slower than usual). Then the only eﬀect Alice has on Bob’s ﬁnishing time

is we must add Alice’s ﬁnishing time to the total amount of time we think Bob needs

to run. This could let us make a model with O(k) eﬃciency instead of O(k

). However,

note that t

and t

are still directly dependent with this assumption. It is a useful

assumption, but not one that can be encoded in a graph. It’s just part of our deﬁnition

of the conditional probability distribution p(t

| t

9.2.2 Undirected models

Directed graphical models make a lot of intuitive sense in many situations. Usually

these are situations where we understand the causality, and the causality only ﬂows in

one direction. (Such as in the relay race example: earlier runners cause eﬀects on the

ﬁnishing times of later runners; later runners do not cause any eﬀects on the ﬁnishing

times of earlier runners) Not all situations ﬁt this model, so directed models are not

always the most natural formalism to use.

Suppose we want to model a distribution over three binary variables–whether or not

you are sick, whether or not your coworker is sick, and whether or not your roommate is

sick. As in the relay race example, we can still make simplifying assumptions about the

kinds of interactions that take place: assuming that your coworker and your roommate

do not know each other, it is very unlikely that one of them will give the other a disease

such as a cold directly. This event can be seen as so rare that it is acceptable not to

model it. However, it is reasonably likely that either of them could give you a cold, and

that you could pass it on to the other. We can model the indirect transmission of a cold

from your coworker to your roommate by modeling the transmission of the cold from

your coworker to you and the transmission of the cold from you to your roommate.

In this case, it’s just as easy for you to cause your roommate to get sick as it is for

106

Figure 9.3: An undirected graph representing how your roommate’s health h

, your

health h

, and your work colleague’s health h

aﬀect each other. You and your roommate

might infect each other with a cold, and you and your work colleauge might do the same,

but assuming that your roommate and your colleague don’t know each other, they can

only infect each other indirectly via you.

your roommate to make you sick, so there is not a clean, uni-directional narrative on

which to base the model. To handle this kind of situation, we can use another kind of

model.

Undirected models, otherwise known as Markov random ﬁelds (MRFs) or Markov

networks (Kindermann, 1980), are graphical models in which the edges are not directed.

If two nodes are connected by an edge, then the random variables corresponding to those

nodes interact with each other directly.

Let’s call the random variable representing your health h

, the random variable

representing your roommate’s health h

, and the random variable representing your

colleague’s health h

. See Fig. 9.3 for a drawing of the graph representing this scenario.

Formally, an undirected graphical model is a structured probabilistic model deﬁned

on an undirected graph G. For each clique C in the graph

, a factor φ(C) (also called

a clique potential) measures the aﬃnity of the variables in that clique for being in each

of their possible states. The factors are constrained to be non-negative. Together they

deﬁne an unnormalized probability distribution

˜p(x) = Π

C∈G

φ(C).

The unnormalized probability distribution is eﬃcient to work with so long as all the

cliques are small. It encodes the idea that states with higher aﬃnity are more likely.

However, unlike in a Bayesian network, there is little structure to the deﬁnition of the

cliques, so there is nothing to guarantee that multiplying them together will yield a valid

probability distribution. While the unnormalized probability distribution is guaranteed

to be non-negative everwhere, it is not guaranteed to sum or integrate to 1. To obtain

a valid probability distribution, we must use the corresponding normalized probability

distribution,

p(x) =

˜p(x)

where Z is a the value that results in the probability distribution summing or integrating

to 1,

A clique of the graph is a subset of nodes that are all connected to each other by an arc of the graph.

A distribution deﬁned by normalizing a product of clique potentials is also called a Gibbs distribution.

107

Z =



˜p(x)dx.

You can think of Z as a constant when the φ functions are held constant. Note

that if the φ functions have parameters, then Z is a function of those parameters. It is

common in the literature to write Z with its arguments ommitted to save space. Z is

known as the partition function, a term borrowed from statistical physics.

Since Z is an integral or sum over all possible joint assignments of the state x it is

often intractable to compute. In order to be able to obtain the normalized probability

distribution of an undirected model, the model structure and the deﬁnitions of the φ

functions must be conducive to computing Z eﬃciently.

Note that for p(x) to exist, Z must exist. The generic deﬁnition of Z does not

guarantee that it exist in general. For example, suppose we want to model a single

scalar variable x ∈ R with a single clique potential φ(x) = x

. In this case,

Z =



∞

−∞

dx.

Since this integral diverges, there is no probability distribution corresponding to this

choice of φ(x). Sometimes the choice of some parameter of the φ functions determines

whether the probability distribution is deﬁned. For example, for φ(x) = exp



βx



, the

β parameter determines whether Z exists. Negative β results in a Gaussian distribution

over x but all other values of β make φ impossible to normalize.

One key diﬀerence between directed modeling and undirected modeling is that di-

rected models are deﬁned directly in terms of probability distributions from the start,

while undirected models are deﬁned more loosely by φ functions that are then converted

into probability distributions. This changes the intuitions one must develop in order to

work with these models. One key idea to keep in mind while working with undirected

models is that the domain of each of the variables has a dramatic consequence on the

kind of probability distribution that a given set of φ functions results in. For example,

consider an n-dimensional vector-valued random variable x and an undirected model

parameterized by a vector of biases b. Suppose we have one clique for each element of

x, φ

) = exp(b

). What kind of probability distribution does this result in? The

answer is that we don’t have enough information, because we have not yet speciﬁed

the domain of x. If x ∈ R

, then the integral deﬁning Z diverges and no probability

distribution exists. If x ∈ {0, 1}

, then p(x) factorizes into n independent distributions,

with p(x

= 1) = sigmoid (b

). If the domain of x is the set of elementary basis vectors

({[1, 0, . . . , 0], [0, 1, . . . , 0], . . . , [0, 0, . . . , 1]} ) then p(x) = softmax(b), so a large value of

actually reduces p(x

= 1) for j = i. Often, it is possible to leverage the eﬀect of a

carefully chosen domain of a variable in order to obtain complicated behavior from a

simple energy function. We’ll explore a practical application later, in TODO add ref to

probabilistic max pooling in advanced graphical models chapter.

See Fig. 9.4 for an example of reading factorization information from an undirected

graph.

108

A B C

D E F

Figure 9.4: This graph implies that p(A, B, C, D, E, F) can be written as

A,B

(A, B)φ

B,C

(B, C)φ

A,D

(A, D)φ

B,E

(B, E)φ

E,F

(E, F ) for an appropriate choice of

the φ functions.

Many interesting theoretical results about undirected models depend on the assump-

tion that ∀x, ˜p(x) > 0. A convenient way to enforce this to use an energy-based model

(EBM) where

˜p(x) = exp(−E(x)) (9.1)

and E(x) is known as the energy function. Because exp(z) is positive for all z, this

guarantees that no energy function will result in a probability of zero for any state

x. Being completely free to choose the energy function makes learning simpler. If we

learned the clique potentials directly, we would need to use constrained optimization,

and we would need to impose some speciﬁc minimal probability value. By learning the

energy function, we can use unconstrained optimization

, and the probabilities in the

model can approach arbitrarily close to zero but never reach it.

Cliques in an undirected graph correspond to factors of the unnormalized probability

function. Because exp(a) exp(b) = exp(a + b), this means that cliques in the undirected

graph correspond to the diﬀerent terms of the energy function. In other words, an

energy-based model is just a special kind of Markov network: the exponentation makes

each term in the energy function correspond to a factor for a diﬀerent clique. See Fig. 9.5

for an example of how to read the form of the energy function from an undirected graph

structure.

As a historical note, observe that the − sign in 9.1 does not change the represen-

tational power of the energy-based model. From a machine learning point of view, the

negation serves no purpose. Some machine learning researchers (e.g., Smolensky (1986),

who referred to negative energy as harmony) have worked on related ideas that omit the

negation. However, in the ﬁeld of statistical physics, energy is a useful concept because

it refers to real energy of physical particles. Many advances in probabilistic modeling

were originally developed by statistical physicists, and terminology such as “energy” and

“partition function” remains associated with these techniques, even though their math-

ematical applicability is broader than the physics context in which they were developed.

For some models, we may still need to use constrained optimization to make sure Z exists.

109

A B C

D E F

Figure 9.5: This graph implies that E(A, B, C, D, E, F) can be written as E

A,B

(A, B)+

B,C

(B, C) + E

A,D

(A, D) + E

B,E

(B, E) + E

E,F

(E, F ) for an appropriate choice of the

per-clique energy functions. Note that we can obtain the φ functions in Fig. 9.4 by

setting each φ to the exp of the corresponding negative energy, e.g, φ

A,B

(A, B) =

exp (−E(A, B)).

A S B A S B

(a) (b)

Figure 9.6: a) The path between random variable A and random variable B through

S is active, because S is not observed. This means that A and B are not separated.

b) Here S is shaded in, to indicate that it is observed. Because the only path between

A and B is through S, and that path is inactive, we can conclude that A and B are

separated given S.

9.2.3 Separation and d-separation

The main purpose of a graphical model is to specify which interactions do not occur in a

given probability distribution so that we can save computational resources and estimate

the model with greater statistical eﬃciency. The edges in the graph show which variables

directly interact, but it can also be useful to know which variable indirectly interact,

and in what context.

Identifying the independences in a graph is very simple in the case of undirected

models. In the context of undirected models, this independence implied by the graph

is called separation. We say that a set of variables A is separated from another set

of variables B given a third set of variables S if the graph structure implies that A is

independent from B given S. Determining which variables are separated is simple. If two

variables A and B are connected by a path involving only unobserved variables, then

those variables are not separated. If these variables cannot be shown to depend indirectly

on each other in this manner, then they are separated. We refer to paths involving only

unobserved variables as “active” and paths including an observed variable as “inactive.”

When we draw a graph, we can indicate observed variables by shading them in. See

Fig. 9.6 for a depiction of what an active and an inactive path looks like when drawn in

this way. See Fig. 9.7 for an example of reading separation from a graph.

110

B C

Figure 9.7: An example of reading separation properties from an undirected graph. B

is shaded to indicate that it is observed. Because observing B blocks the only path from

A to C, we say that A and C are separated from each other given B. The observation

of B also blocks one path between A and D, but there is a second, active path between

them. Therefore, A and D are not separated given B.

Similar concepts apply to directed models, except that in the context of directed

models, these concepts are referred to as d-separation. The “d” stands for “dependence.”

D-separation for directed graphs is deﬁned the same as separation for undirected graphs:

We say that a set of variables A is d-separated from another set of variables B given

a third set of variables S if the graph structure implies that A is independent from B

given S.

As with undirected models, we can examine the independences implied by the graph

by looking at what active paths exist in the graph. As before, two variables are depen-

dent if there is an active path between them, and d-separated if no such path exists. In

directed nets, determining whether a path is active is somewhat more complicated. See

Fig. 9.8 for a guide to identifying active paths in a directed model. See Fig. 9.9 for an

example of reading some properties from a graph.

It is important to remember that separation and d-separation tell us only about

those conditional independences that are implied by the graph. There is no requirement

that the graph imply all independences that are present. In particular, it is always

legitimate to use the complete graph (the graph with all possible edges) to represent

any distribution. In fact, some distributions contain independences that are not possible

to represent with graphical notation. Context-speciﬁc independences are independences

that are present dependent on the value of some variables in the network. For example,

consider a model of three binary variables, A, B, and C. Suppose that when A is 0, B

and C are independent, but when A is 1, B is deterministically equal to C. Encoding

the behavior when A = 1 requires an edge connecting B and C. The graph then fails

to indicate that B and C are independent when A = 1.

TODO–should we say anything about sum product networks? Do we discuss them

in this book?

In general, a graph will never imply that an independence exists when it does not.

However, a graph may fail to encode an independence.

111

A S B

B A S B

A S B

B A S B

A S B

(a) (b)

A S B

B A S B

A S B

B A S B

A S B

Figure 9.8: All of the kinds of active paths of length two that can exist between random

variables A and B. a) Any path with arrows proceeding directly from A to B or vice

versa. This kind of path becomes blocked if S is observed. We have already seen this

kind of path in the relay race example. b) A and B are connected by a common cause S.

For example, suppose S is a variable indicating whether or not there is a hurricane and

A and B measure the wind speed at two diﬀerent nearby weather monitoring outposts.

If we observe very high windows at station A, we might expect to also see high winds

at B. This kind of path can be blocked by observing S. If we already know there is

a hurricane, we expect to see high winds at B, regardless of what is observed at A. A

lower than expected wind at A (for a hurricane) would not change our expectation of

winds at B (knowing there is a hurricane). However, if S is not observed, then A and

B are dependent, i.e., the path is inactive. c) A and B are both parents of S. This is

called a V-structure or the collider case. A and B are related by the explaining away

eﬀect. In this case, the path is actually active when S is observed. For example, suppose

S is a variable indicating that your colleague is not at work. A represents her being

sick, while B represents her being on vacation. If you observe that she is not at work,

you can presume she is probably sick or on vacation, but it’s not especially likely that

both have happened at the same time. If you ﬁnd out that she is on vacation, this fact

is suﬃcient to explain her absence, and you can infer that she is probably not also sick.

d) The explaining away eﬀect happens even if any descendant of the S is observed! For

example, suppose that C is a variable representing whether you have received a report

from your colleauge. If you notice that you have not received the report, this increases

your estimate of the probability that she is not at work today, which in turn makes it

more likely that she is either sick or on vacation. The only way to block a path through

a V-structure is to observe none of the descendants of the shared child.

112

A B

D E

Figure 9.9: From this graph, we can read out several d-separation properties. Examples

include: A and B are d-separated given the empty set; A and E are d-separated given

C; D and E are d-separated given C. Note that there are a few d-separations that do

not occur: A and B are not d-separated given C; A and B are still not d-separated

given D. TODO: format this caption better, evidently latex does not support itemize

or endlines in captions

9.2.4 Operations on a graph

TODO: conversion between directed and undirected models TODO: marginalizing vari-

ables out of a graph

9.2.5 Other forms of graphical syntax

Undirected models and directed models describe the basic computational principles that

are used in structured probabilistic modeling. A few other kinds of graphical notation

are used to simplify the presentation of structured models in visual form. These forms

of graphical syntax do not change what computational operations can be performed,

they just help to visually represent information in a richer or more succinct manner.

Factor graphs are another way of drawing undirected models that resolve an ambigu-

ity in the standard undirected model syntax. In an undirected model, the scope of every

φ function must be a subset of some clique in the graph. However, it is not necessary

that there exist any φ whose scope contains the entirety of every clique. Factor graphs

explicitly represent the scope of each φ function. Speciﬁcally, a factor graph is a graph-

ical representation of an undirected model that consists of a bipartite undirected graph.

Some of the nodes are drawn as circles. These nodes correspond to random variables

as in a standard undirected model. The rest of the nodes are drawn as squares. These

nodes correspond to the factors φ of the unnormalized probability distribution. Vari-

ables and factors may be connected with undirected edges. A variable and a factor are

connected in the graph if and only if the variable is one of the arguments to the factor

in the unnormalized probability distribution. No factor may be connected to another

factor in the graph, nor can a variable be connected to a varibale. See Fig. 9.10 for an

example of how factor graphs can resolve ambiguity in undirected networks.

TODO: plate models. Have I been using them wrong? does the plate imply that all

113

A B

(a) (b) (c)

Figure 9.10: An example of how a factor graph can resolve ambiguity in an undirected

network. a) An undirected network with a clique involving three variables A, B, and

C. b) A factor graph corresponding to the same undirected model. This factor graph

has one factor over all three variables. c) Another valid factor graph for the same

undirected model. This factor graph has three factors, each over only two variables.

Note that representation, inference, and learning are all asymptotically cheaper in (b)

compared to (c), even though both require the same undirected graph to represent.

the interactions are the same?

TODO: do we want to talk about dynamic bayes nets in this book?

9.3 Advantages of structured modeling

Drawing a sample x from the probability distribution p(x) deﬁned by a structured

model is an important operation. The following techniques are described in (Koller and

Friedman, 2009).

Sampling from an energy-based model is not straightforward. Suppose we have an

EBM deﬁning a distribution p(a, b). In order to sample a, we must draw it from p(a | b),

and in order to sample b, we must draw it from p(b | a). It seems to be an intractable

chicken-and-egg problem. Directed models avoid this because their G is directed and

acyclical. In ancestral sampling one simply samples each of the variables in topological

order, conditioning on each variable’s parents, which are guaranteed to have already

been sampled. This deﬁnes an eﬃcient, single-pass method of obtaining a sample.

In an EBM, it turns out that we can get around this chicken and egg problem by

sampling using a Markov chain. A Markov chain is deﬁned by a state x and a transition

distribution T (x



| x). Running the Markov chain means repeatedly updating the state

x to a value x



sampled from T(x



| x).

Under certain distributions, a Markov chain is eventually guaranteed to draw x from

an equilibrium distribution π(x



), deﬁned by the condition

∀x



, π(x



) =



T (x



| x)π(x).

This condition guarantees that repeated applications of the transition sampling pro-

cedure don’t change the distribution over the state of the Markov chain. Running the

114

Markov chain until it reaches its equilibrium distribution is called “burning it in”, and

the initial run of samples required to do so is called burn in.

Unfortunately, there is no theory to predict how many steps the Markov chain must

run before reaching its equilibrium distribution, nor any way to tell for sure that this

event has happened. Also, even though successive samples come from the same distri-

bution, they are highly correlated with each other, so to obtain multiple samples one

should run the Markov chain for several steps between collecting each sample. Markov

chains tend to get stuck in a single mode of π(x) for several steps. The speed with which

a Markov chain moves from mode to mode is called its mixing rate. Since burning in

a Markov chain and getting it to mix well may take several sampling steps, sampling

correctly from an EBM is still a somewhat costly procedure.

Of course, all of this depends on ensuring π(x) = p(x) . Fortunately, this is easy

so long as p(x) is deﬁned by an EBM. The simplest method is to use Gibbs sampling,

in which sampling from T (x



| x) is accomplished by selecting one variable x

and

sampling it from p conditioned on its neighbors in G. It is also possible to sample

several variables at the same time so long as they are conditionally independent given

all of their neighbors.

TODO: discussion of mixing example with 2 binary variables that prefer to both

have the same state IG’s graphic from lecture on adversarial nets

TODO: hammer point that graphical models convey information by leaving edges

out TODO: revisit each of the three challenges from sec:unstructured TODO: don’t

forget to teach about ancestral and gibbs sampling while showing the reduced cost of

sampling TODO: beneﬁt of separating representation from learning and inference

9.4 Learning about dependencies

Throughout most of the rest of this proposal I will discuss models that have two types

of variables: observed or “visible” variables v and latent or “hidden” variables h. v

corresponds to the variables actually provided in the design matrix X during training.

h consists of variables that are introduced to the model in order to help it explain the

structure in v. Generally the exact semantics of h depend on the model parameters and

are created by the learning algorithm. The motivation for this is twofold.

9.4.1 Latent variables versus structure learning

Often the diﬀerent elements of v are highly dependent on each other. A good model of

v which did not contain any latent variables would need to have very large numbers of

parents per node in a Bayesian network or very large cliques in a Markov network. Just

representing these higher order interactions is costly–both in a computational sense,

because the number of parameters that must be stored in memory scales exponentially

with the number of members in a clique, but also in a statistical sense, because this

exponential number of parameters requires a wealth of data to estimate accurately.

There is also the problem of learning which variables need to be in such large cliques.

An entire ﬁeld of machine learning called structure learning is devoted to this problem

115

(Koller and Friedman, 2009). Most structure learning techniques involve ﬁtting a model

with a speciﬁc structure to the data, assigning it some score that rewards high training

set accuracy and penalizes model complexity, then greedily adding or subtracting an

edge from the graph in a way that is expected to increase the score.

Using latent variables instead avoids this whole problem. A ﬁxed structure over

visible and hidden variables can use direct interactions between visible and hidden units

to impose indirect interactions between visible units. Using simple parameter learning

techniques we can learn a model with a ﬁxed structure that imputes the right structure

on the marginal p(v).

9.4.2 Latent variables for feature learning

Another advantage of using latent variables is that they often develop useful semantics.

model learns a latent variable that corresponds to which category of examples the input

was drawn from. Other more sophisticated models with more latent variables can create

even richer descriptions of the input. Most of the approaches mentioned Often, given

some model of v and h, it turns out that E[h | v] or argmax

p(h, v) is a good feature

mapping for v.

TODO: structure learning TODO: latent variables

9.5 The deep learning approach to structured probabilistic

modeling

TODO: we tend to use densely connected graphs TODO: we tend not to do struc-

ture learning TODO: we tend to use a lot of latent variables TODO: importance of

layer structures for block gibbs TODO: scarcity of exact answers (samples instead of

marginals, follow gradient without computing objective, etc. get by with minimum that

you *need*)

9.5.1 Example: The restricted Boltzmann machine

TODO: rework this section. Add pointer to Chapter 17.1.

The restricted Boltzmann machine (RBM) (Smolensky, 1986) or harmonium is an

example of a model that TODO what do we want to exemplify here?

It is an energy-based model with binary visible and hidden units. Its energy function

E(v, h) = −b



v − c



h − v



W h

where b, c, and W are unconstrained, real-valued, learnable parameters. The model

is depicted graphically in Fig. 9.11. As this ﬁgure makes clear, an important aspect

of this model is that there are no direct interactions between any two visible units or

between any two hidden units (hence the “restricted,” a general Boltzmann machine

may have arbitrary connections).

116

Figure 9.11: An example RBM drawn as a Markov network

The restrictions on the RBM structure yield the nice properties

p(h | v) = Π

p(h

| v)

and

p(v | h) = Π

p(v

| h).

The individual conditionals are simple to compute as well, for example

p(h

= 1 | v) = σ





+ b



Together these properties allow for eﬃcient block Gibbs sampling, alternating be-

tween sampling all of h simultaneously and sampling all of v simultaneously.

Since the energy function itself is just a linear function of the parameters, it is easy

to take the needed derivatives. For example,

∂

∂W

E(v, h) = −v

These two properties–eﬃcient Gibbs sampling and eﬃcient derivatives– make it pos-

sible to train the RBM with stochastic approximations to ∇

log Z.

9.6 Markov chain Monte Carlo methods

TODO: add section on MCMC, it needs to be developed here so both the generative

autoencoders and the advanced deep nets can refer back to it TODO: there is some

discussion of markov chains already when describing how to sample from an EBM,

determine how to present content. NOTE: there seems to be stuﬀ about MCMC in

section 7.3 already

TODO: refer to this ﬁgure in the ext:

TODO: refer to this ﬁgure in the text

117

Figure 9.12: Paths followed by Gibbs sampling for three distributions, with the Markov

chain initialized at the mode in both cases. Left) A multivariate normal distribution

with two independent variables. Gibbs sampling mixes well because the variables are

independent. Center) A multivariate normal distribution with highly correlated vari-

ables. The correlation between variables makes it diﬃcult for the Markov chain to mix.

Because each variable must be updated conditioned on the other, the correlation reduces

the rate at which the Markov chain can move away from the starting point. Right) A

mixture of Gaussians with widely separated modes that are not axis-aligned. Gibbs

sampling mixes very slowly because it is diﬃcult to change modes while altering only

one variable at a time.

Figure 9.13: An illustration of the slow mixing problem in deep probabilistic models.

Each panel should be read left to right, top to bottom. Left) Consecutive samples

from Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST

dataset. Consecutive samples are similar to each other. Because the Gibbs sampling is

performed in a deep graphical model, this similarity is based more on semantic rather

than raw visual features, but it is still diﬃcult for the Gibbs chain to transition from

one mode of the distribution to another, for example by changing the digit identity.

Right) Consecutive ancestral samples from a generative adversarial network. Because

ancestral sampling generates each sample independently from the others, there is no

mixing problem.

118