Chapter 14

Structured Probabilistic Models:

A Deep Learning Perspective

Deep learning draws upon many modeling formalisms that researchers can use to guide

their design eﬀorts and describe their algorithms. One of these formalisms is the idea of

structured probabilistic models. A structured probabilistic model is a way of describing

a probability distribution, using a graph to describe which random variables in the

probability distribution interact with each other directly. Here we use “graph” in the

graph theory sense–a set of vertices connected to one another by a set of edges. Because

the structure of the model is deﬁned by a graph, these models are often also referred to

as graphical models.

The graphical models research community is large and has developed many diﬀerent

models, training algorithms, and inference algorithms. In this chapter, we provide basic

background on some of the most central ideas of graphical models, with an emphasis on

the concepts that have proven most useful to the deep learning research community. If

you already have a strong background in graphical models, you may wish to skip most

of this chapter. However, even a graphical model expert may beneﬁt from reading the

ﬁnal section of this chapter, section 14.6, in which we highlight some of the unique ways

that graphical models are used for deep learning algorithms. Deep learning practitioners

tend to use very diﬀerent model structures, learning algorithms, and inference procedures

than are commonly used by the rest of the graphical models research community. In this

chapter, we identify these diﬀerences in preferences and explain the reasons for them.

In this chapter we ﬁrst describe the challenges of building large-scale probabilistic

models in section 14.1. Next, we describe how to use a graph to describe the structure

of a probability distribution in section 14.2. We then revisit the challenges we described

in section 14.1 and show how the structured approach to probabilistic modeling can

overcome these challenges in section 14.3. One of the major diﬃculties in graphical

modeling is understanding which variables need to be able to interact directly, i.e., which

graph structures are most suitable for a given problem. We outline two approaches to

resolving this diﬃculty by learning about the dependencies in section 14.4. Finally, we

close with a discussion of the unique emphasis that deep learning practitioners place on

255

speciﬁc approaches to graphical modeling in section 14.6.

14.1 The Challenge of Unstructured Modeling

The goal of deep learning is to scale machine learning to the kinds of challenges needed

to solve artiﬁcial intelligence. This means being able to understand high-dimensional

data with rich structure. For example, we would like AI algorithms to be able to under-

stand natural images

, audio waveforms representing speech, and documents containing

multiple words and punctuation characters.

Classiﬁcation algortihms can take such a rich high-dimensional input and summarize

it with a categorical label—what object is in a photo, what word is spoken in a recording,

what topic a document is about. The process of classiﬁcation discards most of the

information in the input and produces on a single output (or a probability distribution

over values of that single output). The classiﬁer is also often able to ignore many parts

of the input. For example, when recognizing an object in a photo, it is usually possible

to ignore the background of the photo.

It is possible to ask probabilistic models to do many other tasks. These tasks are

often more expensive than classiﬁcation. Some of them require producing multiple

output values. Most require a complete understanding of the entire structure of the

input, with no option to ignore sections of it. These tasks include

• Density estimation: given an input x, the machine learning system returns an

estimate of p(x). This requires only a single output, but it does require a complete

understanding of the entire input. If even one element of the vector is unusual,

the system must assign it a low probability.

• Denoising: given a damaged or incorrectly observed input

x, the machine learning

system returns an estimate of the original or correct x. For example, the ma-

chine learning system might be asked to remove dust or scratches from an old

photograph. This requires multiple outputs (every element of the estimated clean

example x) and an understanding of the entire input (since even one damaged

area will still reveal the ﬁnal estimate as being damaged).

• Missing value imputation: given the observations of some elements of x, the model

is asked to return estimates of or a probability distribution over some or all of the

unobserved elements of x. This requires multiple outputs, and because the model

could be asked to restore any of the elements of x, it must understand the entire

input.

• Sampling: the model generates new samples from the distribution p(x). Appli-

cations include speech synthesis, i.e. producing new waveforms thatsound like

natural human speech. This requires multiple output values and a good model

A natural image is an image that might captured by a camera in a reasonably ordinary environment,

as opposed to synthetically rendered images, screenshots of web pages, etc.

256

of the entire input. If the samples have even one element drawn from the wrong

distribution, then the sampling process is wrong.

For an example of the sampling tasks on small natural images, see Fig. 14.1.

Modeling a rich distribution over thousands or millions of random variables is a

challenging task, both computationally and statistically. Suppose we only wanted to

model binary variables. This is the simplest possible case, and yet already it seems

overwhelming. For a small, 32 × 32 pixel image, there are 2

3072

possible binary images

of this form. This number is over 10

800

times larger than the estimated number of atoms

in the universe.

In general, if we wish to model a distribution over a random vector x containing

n discrete variables capable of taking on k values each, then the naive approach of

representing P (x) by storing a lookup table with one probability value per possible

outcome requires k

parameters!

This is not feasible for several reasons:

• Memory: the cost of storing the representation : For all but very small

values of n and k, representing the distribution as a table will require too many

values to store.

• Statistical eﬃciency: As the number of parameters in a model increases, so does

the amount of training examples needed to choose the values of those parameters

using a statistical estimator. Because the table-based model has an astronomical

number of parameters, it will require an astronomically large training set to ﬁt

accurately. Any such model will overﬁt the training set very badly.

• Runtime: the cost of inference: Suppose we want to perform an inference

task where we use our model of the joint distribution P(x) to compute some other

distribution, such as the marignal distribution P (x

) or the conditional distribution

P (x

| x

). Computing these distributions will require summing across the entire

table, so the runtime of these operations is as high as the intractable memory cost

of storing the model.

• Runtime: the cost of sampling: Likewise, suppose we want to draw a sample

from the model. The naive way to do this is to sample some value u ∼ U(0, 1),

then iterate through the table adding up the probability values until they exceed

u and return the outcome whose probability value was added last. This requires

reading through the whole table in the worst case, so it has the same exponential

cost as the other operations.

The problem with the table-based approach is that we are explicitly modeling every

possible kind of interaction between every possible subset of variables. The probability

distributions we encounter in real tasks are much simpler than this. Usually, most

variables inﬂuence each other only indirectly.

For example, consider modeling the ﬁnishing times of a team in a relay race. Suppose

the team consists of three runners, Alice, Bob, and Carol. At the start of the race, Alice

257

Figure 14.1: Probabilistic modeling of natural images. Top: Example 32×32 pixel color

images from the CIFAR-10 dataset (Krizhevsky and Hinton, 2009). Bottom: Samples

drawn from a structured probabilistic model trained on this dataset. Each sample

appears at the same position in the grid as the training example that is closest to it in

Euclidean space. This comparison allows us to see that the model is truly synthesizing

new images, rather than memorizing the training data. Contrast of both sets of images

has been adjusted for display. Figure reproduced with permission from (Courville et al.,

2011).

258

carries a baton and begins running around a track. After completing her lap around

the track, she hands the baton to Bob. Bob then runs his own lap and hands the

baton to Carol, who runs the ﬁnal lap. We can model each of their ﬁnishing times as

a continuous random variable. Alice’s ﬁnishing time does not depend on anyone else’s,

since she goes ﬁrst. Bob’s ﬁnishing time depends on Alice’s, because Bob does not

have the opportunity to start his lap until Alice has completed hers. If Alice ﬁnishes

faster, Bob will ﬁnish faster, all else being equal. Finally, Carol’s ﬁnishing time depends

on both her teammates. If Alice is slow, Bob will probably ﬁnish late too, and Carol

will have quite a late starting time and thus is likely to have a late ﬁnishing time as

well. However, Carol’s ﬁnishing time depends only indirectly on Alice’s ﬁnishing time

via Bob’s. If we already know Bob’s ﬁnishing time, we won’t be able to estimate Carol’s

ﬁnishing time better by ﬁnding out what Alice’s ﬁnishing time was. This means we can

model the relay race using only two interactions: Alice’s eﬀect on Bob, and Bob’s eﬀect

on Carol. We can omit the third, indirect interaction between Alice and Carol from our

model.

Structured probabilistic models provide a formal framework for modeling only direct

interactions between random variables. This allows the models to have signiﬁcantly

fewer parameters which can in turn be estimated reliably from less data. These smaller

models also have dramatically reduced computation cost in terms of storing the model,

performing inference in the model, and drawing samples from the model.

14.2 Using Graphs to Describe Model Structure

Structured probabilistc models use graphs (in the graph theory sense of “nodes” or

“vertices” connected by edges) to represent interactions between random variables. Each

node represents a random variable. Each edge represents a direct interaction. These

direct interactions imply other, indirect interactions, but only the direct interactions

need to be explicitly modeled.

There is more than one way to describe the interactions in a probability distribution

using a graph. In the following sections we describe some of the most popular and useful

approaches.

14.2.1 Directed Models

One kind of structured probabilistic model is the directed graphical model otherwise

known as the belief network or Bayesian network

(Pearl, 1985).

Directed graphical models are called “directed” because their edges are directed,

that is, they point from one vertex to another. This direction is represented in the

drawing with an arrow. The direction of the arrow indicates which variable’s probability

distribution is deﬁned in terms of the other’s. Drawing an arrow from a to b means that

we deﬁne the probability distribution over b via a conditional distribution, with a as one

Judea Pearl suggested using the term Bayes Network when one wishes to “emphasize the judgmen-

tal” nature of the values computed by the network, i.e. to highlight that they usually represent degrees

of belief rather than frequencies of events.

259

Alice Bob Carol

Figure 14.2: A directed graphical model depicting the relay race example. Alice’s ﬁnish-

ing time t

inﬂuences Bob’s ﬁnishing time t

, because Bob does not get to start running

until Alice ﬁnishes. Likewise, Carol only gets to start running after Bob ﬁnishes, so

Bob’s ﬁnishing time t

inﬂuences Carol’s ﬁnishing time t

of the variables on the right side of the conditioning bar. In other words, the distribution

over b depends on the value of a.

Let’s continue with the relay race example from Section 14.1. Suppose we name

Alice’s ﬁnishing time t

, Bob’s ﬁnishing time t

, and Carol’s ﬁnishing time t

. As we

saw earlier, our estimate of t

depends on t

. Our estimate of t

depends directly on t

but only indirectly on t

. We can draw this relationship in a directed graphical model,

illustrated in Fig. 14.2.

Formally, a directed graphical model deﬁned on variables x is deﬁned by a directed

acyclic graph G whose vertices are the random variables in the model, and a set of local

conditional probability distributions p(x

| P a

)) where P a

) gives the parents of

in G. The probability distribution over x is given by

p(x) = Π

p(x

| P a

)).

In our relay race example, this means that, using the graph drawn in Fig. 14.2,

p(t

, t

) = p(t

)p(t

| t

)p(t

| t

This is our ﬁrst time seeing a structured probabilistic model in action. We can

examine the cost of using it, in order to observe how structured modeling has many

advantages relative to unstructured modeling.

Suppose we represented time by discretizing time ranging from minute 0 to minute

10 into 6 second chunks. This would make t

, t

, and t

each be discrete variables with

100 possible values. If we attempted to represent p(t

, t

) with a table, it would need

to store 999,999 values (100 values of t

× 100 values of t

, minus 1,

since the probability of one of the conﬁgurations is made redundant by the constraint

that the sum of the probabilities be 1). If instead, we only make a table for each of the

conditional probability distributions, then the distribution over t

requires 99 values,

the table deﬁning t

given t

requires 9900 values, and so does the table deﬁning t

and

. This comes to a total of 19,899 values. This means that using the directed graphical

model reduced our number of parameters by a factor of more than 50!

In general, to model n discrete variables each having k values, the cost of the single

table approach scales like O(k

), as we’ve observed before. Now suppose we build a

260

directed graphical model over these variables. If m is the maximum number of variables

appearing (on either side of the conditioning bar) in a single conditional probability

distribution, then the cost of the tables for the directed model scales like O(k

). As

long as we can design a model such that m << n, we get very dramatic savings.

In other words, so long as each variable has few parents in the graph, the distri-

bution can be represented with very few parameters. Some restrictions on the graph

structure (e.g. it is a tree) can also guarantee that operations like computing marginal

or conditional distributions over subsets of variables are eﬃcient.

It’s important to realize what kinds of information can be encoded in the graph, and

what can’t be. The graph just encodes simplifying assumptions about which variables

are conditionally independent from each other. It’s also possible to make other kinds of

simplifying assumptions. For example, suppose we assume Bob always runs the same

regardless of how Alice performed. (In reality, Alice’s performance probably inﬂuences

Bob’s performance–depending on Bob’s personality, if Alice runs especially fast in a given

race, this might encourage Bob to push hard and match her exceptional performance,

or it might make him overconﬁdent and lazy). Then the only eﬀect Alice has on Bob’s

ﬁnishing time is that we must add Alice’s ﬁnishing time to the total amount of time

we think Bob needs to run. This observation allows us to deﬁne a model with O(k)

parameters instead of O(k

). However, note that t

and t

are still directly dependent

with this assumption, because t

represents the absolute time at which Bob ﬁnishes, not

the total time he himself spends running. This means our graph must still contain an

arrow from t

to t

. The assumption that Bob’s personal running time is independent

from all other factors cannot be encoded in a graph over t

, t

, and t

. Instead, we

encode this information in the deﬁnition of the conditional distribution itself. The

conditional distribution is no longer a k × k −1 element table indexed by t

and t

but

is now a slightly more complicated formula using only k −1 parameters. The directed

graphical model syntax does not place any constraint on how we deﬁne our conditional

distributions. It only deﬁnes which variables they are allowed to take in as arguments.

14.2.2 Undirected Models

Directed graphical models give us one language for describing structured probabilistic

models. Another popular language is that of undirected models, otherwise known as

Markov random ﬁelds (MRFs) or Markov networks (Kindermann, 1980). As their

name implies, undirected models use graphs whose edges are undirected.

Directed models are most naturally applicable to situations where there is a clear

reason to draw each arrow in one particular direction. Often these are situations where

we understand the causality, and the causality only ﬂows in one direction. One such

situation is the relay race example. Earlier runners aﬀects the ﬁnishing times of later

runners; later runners do not aﬀect the ﬁnishing times of earlier runners.

Not all situations we might want to model have such a clear direction to their in-

teractions. When the interactions seem to have no intrinsic direction, or to operate in

both directions, it may be more appropriate to use an undirected model.

As an example of such a situation, suppose we want to model a distribution over

261

Figure 14.3: An undirected graph representing how your roommate’s health h

, your

health h

, and your work colleague’s health h

aﬀect each other. You and your roommate

might infect each other with a cold, and you and your work colleague might do the same,

but assuming that your roommate and your colleague don’t know each other, they can

only infect each other indirectly via you.

three binary variables: whether or not you are sick, whether or not your coworker is sick,

and whether or not your roommate is sick. As in the relay race example, we can make

simplifying assumptions about the kinds of interactions that take place. Assuming that

your coworker and your roommate do not know each other, it is very unlikely that one

of them will give the other a disease such as a cold directly. This event can be seen as

so rare that it is acceptable not to model it. However, it is reasonably likely that either

of them could give you a cold, and that you could pass it on to the other. We can model

the indirect transmission of a cold from your coworker to your roommate by modeling

the transmission of the cold from your coworker to you and the transmission of the cold

from you to your roommate.

In this case, it’s just as easy for you to cause your roommate to get sick as it is for your

roommate to make you sick, so there is not a clean, uni-directional narrative on which to

base the model. This motivates using an undirected model. As with directed models, if

two nodes in an undirected model are connected by an edge, then the random variables

corresponding to those nodes interact with each other directly. Unlike directed models,

the edge in an undirected model has no arrow, and is not associated with a conditional

probability distribution.

Let’s call the random variable representing your health h

, the random variable

representing your roommate’s health h

, and the random variable representing your

colleague’s health h

. See Fig. 14.3 for a drawing of the graph representing this scenario.

Formally, an undirected graphical model is a structured probabilistic model deﬁned

on an undirected graph G. For each clique C in the graph

, a factor φ(C) (also called a

clique potential) measures the aﬃnity of the variables in that clique for being in each

of their possible joint states. The factors are constrained to be non-negative. Together

they deﬁne an unnormalized probability distribution

˜p(x) = Π

C∈G

φ(C).

The unnormalized probability distribution is eﬃcient to work with so long as all the

cliques are small. It encodes the idea that states with higher aﬃnity are more likely.

However, unlike in a Bayesian network, there is little structure to the deﬁnition of the

cliques, so there is nothing to guarantee that multiplying them together will yield a

A clique of the graph is a subset of nodes that are all connected to each other by an arc of the graph.

262

valid probability distribution. See Fig. 14.4 for an example of reading factorization

information from an undirected graph.

Our example of the cold spreading between you, your roommate, and your colleague

contains two cliques. One clique contains h

and h

. The factor for this clique can be

deﬁned by a table, and might have values resembling these:

= 0 h

= 1

= 0 2 1

= 1 1 10

A state of 1 indicates good health, while a state of 0 indicates poor health (having

been infected with a cold). Both of you are usually healthy, so the corresponding state

has the highest aﬃnity. The state where only one of you is sick has the lowest aﬃnity,

because this is a rare state. The state where both of you are sick (because one of you

has infected the other) is a higher aﬃnity state, though still not as common as the state

where both are healthy.

To complete the model, we would need to also deﬁne a similar factor for the clique

containing h

and h

14.2.3 The Partition Function

While the unnormalized probability distribution is guaranteed to be non-negative ev-

erywhere, it is not guaranteed to sum or integrate to 1. To obtain a valid probability

distribution, we must use the corresponding normalized probability distribution

p(x) =

˜p(x)

where Z is the value that results in the probability distribution summing or integrating

to 1:

Z =

˜p(x)dx.

You can think of Z as a constant when the φ functions are held constant. Note that if

the φ functions have parameters, then Z is a function of those parameters. It is common

in the literature to write Z with its arguments omitted to save space. Z is known as the

partition function, a term borrowed from statistical physics.

Since Z is an integral or sum over all possible joint assignments of the state x it is

often intractable to compute. In order to be able to obtain the normalized probability

distribution of an undirected model, the model structure and the deﬁnitions of the φ

functions must be conducive to computing Z eﬃciently. In the context of deep learning,

Z is usually intractable, and we must resort to approximations. Such approximate

algorithms are the topic of Chap. 19.

One important consideration to keep in mind when designing undirected models is

that it is possible for Z not to exist. This happens if some of the variables in the model

A distribution deﬁned by normalizing a product of clique potentials is also called a Gibbs distribution.

263

A B C

D E F

Figure 14.4: This graph implies that p(A, B, C, D, E, F) can be written as

A,B

(A, B)φ

B,C

(B, C)φ

A,D

(A, D)φ

B,E

(B, E)φ

E,F

(E, F ) for an appropriate choice of

the φ functions.

are continuous and the integral of ˜p over their domain diverges. For example, suppose

we want to model a single scalar variable x ∈ R with a single clique potential φ(x) = x

In this case,

Z =

dx.

Since this integral diverges, there is no probability distribution corresponding to this

choice of φ(x). Sometimes the choice of some parameter of the φ functions determines

whether the probability distribution is deﬁned. For example, for φ(x; β) = exp



−βx



the β parameter determines whether Z exists. Positive β results in a Gaussian distri-

bution over x but all other values of β make φ impossible to normalize.

One key diﬀerence between directed modeling and undirected modeling is that di-

rected models are deﬁned directly in terms of probability distributions from the start,

while undirected models are deﬁned more loosely by φ functions that are then converted

into probability distributions. This changes the intuitions one must develop in order to

work with these models. One key idea to keep in mind while working with undirected

models is that the domain of each of the variables has dramatic eﬀect on the kind of

probability distribution that a given set of φ functions corresponds to. For example,

consider an n-dimensional vector-valued random variable x and an undirected model

parameterized by a vector of biases b. Suppose we have one clique for each element of

x, φ

) = exp(b

). What kind of probability distribution does this result in? The

answer is that we don’t have enough information, because we have not yet speciﬁed

the domain of x. If x ∈ R

, then the integral deﬁning Z diverges and no probability

distribution exists. If x ∈ {0, 1}

, then p(x) factorizes into n independent distributions,

with p(x

= 1) = sigmoid (b

). If the domain of x is the set of elementary basis vectors

({[1, 0, . . . , 0], [0, 1, . . . , 0], . . . , [0, 0, . . . , 1]} ) then p(x) = softmax(b), so a large value of

actually reduces p(x

= 1) for j 6= i. Often, it is possible to leverage the eﬀect of

a carefully chosen domain of a variable in order to obtain complicated behavior from

a relatively simple set of φ functions. We’ll explore a practical application of this idea

later, in Chap. 21.6.

264

A B C

D E F

Figure 14.5: This graph implies that E(a, b, c, d, e, f) can be written as E

a,b

(a, b) +

b,c

(b, c) + E

a,d

(a, d) + E

b,e

(b, e) + E

e,f

(e, f) for an appropriate choice of the per-clique

energy functions. Note that we can obtain the φ functions in Fig. 14.4 by setting each

φ to the exp of the corresponding negative energy, e.g., φ

a,b

(a, b) = exp (−E(a, b)).

14.2.4 Energy-Based Models

Many interesting theoretical results about undirected models depend on the assumption

that ∀x, ˜p(x) > 0. A convenient way to enforce this to use an energy-based model (EBM)

where

˜p(x) = exp(−E(x)) (14.1)

and E(x) is known as the energy function. Because exp(z) is positive for all z, this

guarantees that no energy function will result in a probability of zero for any state x.

Being completely free to choose the energy function makes learning simpler. If we learned

the clique potentials directly, we would need to use constrained optimization, and we

would need to arbitrarily impose some speciﬁc minimal probability value. By learning

the energy function, we can use unconstrained optimization

, and the probabilities in

the model can approach arbitrarily close to zero but never reach it.

Any distribution of the form given by equation 14.1 is an example of a Boltzmann

distribution. For this reason, many energy-based models are called Boltzmann machines.

There is no accepted guideline for when to call a model an energy-based model and when

to call it a Boltzmann machines. The term Boltzmann machine was ﬁrst introduced to

describe a model with exclusively binary variables, but today many models such as the

mean-covariance restricted Boltzmann machine incorporate real-valued variables as well.

Cliques in an undirected graph correspond to factors of the unnormalized probabil-

ity function. Because exp(a) exp(b) = exp(a + b), this means that diﬀerent cliques in

the undirected graph correspond to the diﬀerent terms of the energy function. In other

words, an energy-based model is just a special kind of Markov network: the exponentia-

tion makes each term in the energy function correspond to a factor for a diﬀerent clique.

See Fig. 14.5 for an example of how to read the form of the energy function from an

undirected graph structure.

One part of the deﬁnition of an energy-based model serves no functional purpose

from a machine learning point of view: the − sign in Eq. 14.1. This − sign could be in-

corporated into the deﬁnition of E, or for many functions E the learning algorithm could

For some models, we may still need to use constrained optimization to make sure Z exists.

265

A S B A S B

(a) (b)

Figure 14.6: a) The path between random variable a and random variable b through s

is active, because s is not observed. This means that a and b are not separated. b) Here

s is shaded in, to indicate that it is observed. Because the only path between ra and

b is through s, and that path is inactive, we can conclude that a and b are separated

given s.

simply learn parameters with opposite sign. The − sign is present primarily to preserve

compatibility between the machine learning literature and the physics literature. Many

advances in probabilistic modeling were originally developed by statistical physicists, for

whom E refers to actual, physical energy and does not have arbitrary sign. Terminology

such as “energy” and “partition function” remains associated with these techniques, even

though their mathematical applicability is broader than the physics context in which

they were developed. Some machine learning researchers (e.g., Smolensky (1986), who

referred to negative energy as harmony) have chosen to emit the negation, but this is

not the standard convention.

14.2.5 Separation and D-Separation

The edges in a graphical model tell us which variables directly interact. We often need

to know which variables indirectly interact. Some of these indirect interactions can

be enabled or disabled by observing other variables. More formally, we would like to

know which subsets of variables are conditionally independent from each other, given

the values of other subsets of variables.

Identifying the conditional independences in a graph is very simple in the case of

undirected models. In this case, conditional independence implied by the graph is called

separation. We say that a set of variables A is separated from another set of variables B

given a third set of variables S if the graph structure implies that A is independent from

B given S. If two variables a and b are connected by a path involving only unobserved

variables, then those variables are not separated. If no path exists between them, or

all paths contain an observed variable, then they are separated. We refer to paths

involving only unobserved variables as “active” and paths including an observed variable

as “inactive.”

When we draw a graph, we can indicate observed variables by shading them in. See

Fig. 14.6 for a depiction of how active and inactive paths in an undirected look when

drawn in this way. See Fig. 14.7 for an example of reading separation from an undirected

graph.

Similar concepts apply to directed models, except that in the context of directed

models, these concepts are referred to as d-separation. The “d” stands for “dependence.”

D-separation for directed graphs is deﬁned the same as separation for undirected graphs:

266

B C

Figure 14.7: An example of reading separation properties from an undirected graph.

Here b is shaded to indicate that it is observed. Because observing b blocks the only

path from a to c, we say that a and c are separated from each other given b. The

observation of b also blocks one path between a and d, but there is a second, active path

between them. Therefore, a and d are not separated given b.

We say that a set of variables A is d-separated from another set of variables B given

a third set of variables S if the graph structure implies that A is independent from B

given S.

As with undirected models, we can examine the independences implied by the graph

by looking at what active paths exist in the graph. As before, two variables are depen-

dent if there is an active path between them, and d-separated if no such path exists. In

directed nets, determining whether a path is active is somewhat more complicated. See

Fig. 14.8 for a guide to identifying active paths in a directed model. See Fig. 14.9 for

an example of reading some properties from a graph.

It is important to remember that separation and d-separation tell us only about

those conditional independences that are implied by the graph. There is no requirement

that the graph imply all independences that are present. In particular, it is always le-

gitimate to use the complete graph (the graph with all possible edges) to represent any

distribution. In fact, some distributions contain independences that are not possible to

represent with existing graphical notation. Context-speciﬁc independences are indepen-

dences that are present dependent on the value of some variables in the network. For

example, consider a model of three binary variables, a, b, and c. Suppose that when a is

0, b and c are independent, but when a is 1, b is deterministically equal to c. Encoding

the behavior when a = 1 requires an edge connecting b and c. The graph then fails to

indicate that b and c are independent when a = 0.

In general, a graph will never imply that an independence exists when it does not.

However, a graph may fail to encode an independence.

14.2.6 Converting Between Undirected and Directed Graphs

In common parlance, we often refer to certain model classes as being undirected or

directed. For example, we typically refer to RBMs as undirected and sparse coding as

directed. This way of speaking can be somewhat leading, because no probabilistic model

is inherently directed or undirected. Instead, some models are most easily described

using a directed graph, or most easily described using an undirected graph.

267

A S B

B A S B

A S B

B A S B

A S B

(a) (b)

A S B

B A S B

A S B

B A S B

A S B

Figure 14.8: All of the kinds of active paths of length two that can exist between random

variables a and rb. a) Any path with arrows proceeding directly from a to b or vice

versa. This kind of path becomes blocked if s is observed. We have already seen this

kind of path in the relay race example. b) a and b are connected by a common cause s.

For example, suppose s is a variable indicating whether or not there is a hurricane and

a and b measure the wind speed at two diﬀerent nearby weather monitoring outposts.

If we observe very high winds at station a, we might expect to also see high winds at

b. This kind of path can be blocked by observing s. If we already know there is a

hurricane, we expect to see high winds at b, regardless of what is observed at a. A lower

than expected wind at a (for a hurricane) would not change our expectation of winds

at b (knowing there is a hurricane). However, if s is not observed, then a and b are

dependent, i.e., the path is inactive. c) a and b are both parents of s. This is called a

V-structure or the collider case, and it causes a and a to be related by the explaining

away eﬀect. In this case, the path is actually active when s is observed. For example,

suppose s is a variable indicating that your colleague is not at work. The variable a

represents her being sick, while b represents her being on vacation. If you observe that

she is not at work, you can presume she is probably sick or on vacation, but it’s not

especially likely that both have happened at the same time. If you ﬁnd out that she is

on vacation, this fact is suﬃcient to explain her absence, and you can infer that she is

probably not also sick. d) The explaining away eﬀect happens even if any descendant

of s is observed! For example, suppose that c is a variable representing whether you

have received a report from your colleague. If you notice that you have not received

the report, this increases your estimate of the probability that she is not at work today,

which in turn makes it more likely that she is either sick or on vacation. The only way to

block a path through a V-structure is to observe none of the descendants of the shared

child.

268

A B

D E

Figure 14.9: From this graph, we can read out several d-separation properties. Examples

include:

• a and b are d-separated given the empty set.

• a and e are d-separated given c.

• d and e are d-separated given c.

We can also see that some variables are no longer d-separated when we observe some

variables:

• a and b are not d-separated given c.

• a and b are not d-separated given d.

269

Ever probability distribution can be represented by either a directed model or by

an undirected model. In the worst case, one can always represent any distribution by

using a “complete graph.” In the case of a directed model, the complete graph is any

directed acyclic graph where we impose some ordering on the random variables, and

each variable has all other variables that precede it in the ordering as its ancestors in

the graph. For an undirected model, the complete graph is simply a graph containing a

single clique encompassing all of the variables.

Of course, the utility of a graphical model is that the graph implies that some

variables do not interact directly. The complete graph is not very useful because it does

not imply any independences. TODO ﬁgure complete graph

When we represent a probability distribution with a graph, we want to choose a graph

that implies as many independences as possible, without implying any independences

that do not actually exist.

From this point of view, some distributions can be represented more eﬃciently us-

ing directed models, while other distributions can be represented more eﬃciently using

undirected models. In other words, directed models can encode some independences

that undirected models cannot encode, and vice versa.

Directed models are able to use one speciﬁc kind of substructure that undirected

models cannot represent perfectly. This substructure is called an immorality. The

structure occurs when two random variables a and b are both parents of a third random

variable c, and there is no edge directly connecting a and b in either direction. (The

name “immorality” may seem strange; it was coined in the graphical models literature

as a joke about unmarried parents) To convert a directed model with graph D into an

undirected model, we need to create a new graph U. For every pair of variables x and

y, we add an undirected edge connecting x and y to U if there is a directed edge (in

either direction) connecting x and y in D or if x and y are both parents in D of a third

variable z. The resulting U is known as a moralized graph. See Fig. 14.10 for examples

of converting directed models to undirected models via moralization.

Likewise, undirected models can include substructures that no directed model can

represent perfectly. Speciﬁcally, a directed graph D cannot capture all of the conditional

independences implied by an undirected graph U if U contains a loop of length greater

than three, unless that loop also contains a chord. A loop is a sequence of variables

connected by undirected edges, with the last variable in the sequence connected back

to the ﬁrst variable in the sequence. A chord is a connection between any two non-

consecutive variables in this sequence. If U has loops of length four or greater and does

not have chords for these loops, we must add the chords before we can convert it to

a directed model. Adding these chords discards some of the independence information

that was encoded in U. The graph formed by adding chords to U is known as a chordal

or triangulated graph, because all the loops can now be described in terms of smaller,

triangular loops. To build a directed graph D from the chordal graph, we need to also

assign directions to the edges. When doing so, we must not create a directed cycle in

D, or the result does not deﬁne a valid directed probabilistic model. One way to assign

directions to the edges in D is to impose an ordering on the random variables, then

point each edge from the node that comes earlier in the ordering to the node that comes

270

a b

a cb

a b

a cb

Figure 14.10: Examples of converting directed models to undirected models by construct-

ing moralized graphs. Left) This simple chain can be converted to a moralized graph

merely by replacing its directed edges with undirected edges. The resulting undirected

model implies exactly the same set of independences and conditional independences.

Center) This graph is the simplest directed model that cannot be converted to an undi-

rected model without losing some independences. This graph consists entirely of a single

immorality. Because a and b are parents of c, they are connected by an active path when

c is observed. To capture this dependence, the undirected model must include a clique

encompassing all three variables. This clique fails to encode the fact that a⊥b. Right)

In general, moralization may add many edges to the graph, thus losing many implied

independences. For example, this sparse coding graph requires adding moralizing edges

between every pair of latent variables, thus introducing a quadratic number of new direct

dependences.

271

later in the ordering. TODO point to ﬁg

IG HERE

TODO: started this above, need to scrap some some BNs encode independences that

MNs can’t encode, and vice versa example of BN that an MN can’t encode: A and B

are parents of C A is d-separated from B given the empty set The Markov net requires

a clique over A, B, and C in order to capture the active path from A to B when C is

observed This clique means that the graph cannot imply A is separated from B given

the empty set example of a MN that a BN can’t encode: A, B, C, D connected in a loop

BN cannot have both A d-sep D given B, C and B d-sep C given A, D

In many cases, we may want to convert an undirected model to a directed model,

or vice versa. To do so, we choose the graph in the new format that implies as many

independences as possible, while not implying any independences that were not implied

by the original graph.

To convert a directed model D to an undirected model U, we re

TODO: conversion between directed and undirected models

14.2.7 Marginalizing Variables out of a Graph

TODO: marginalizing variables out of a graph

14.2.8 Factor Graphs

Factor graphs are another way of drawing undirected models that resolve an ambiguity

in the graphical representation of standard undirected model syntax. In an undirected

model, the scope of every φ function must be a subset of some clique in the graph.

However, it is not necessary that there exist any φ whose scope contains the entirety of

every clique. Factor graphs explicitly represent the scope of each φ function. Speciﬁcally,

a factor graph is a graphical representation of an undirected model that consists of

a bipartite undirected graph. Some of the nodes are drawn as circles. These nodes

correspond to random variables as in a standard undirected model. The rest of the

nodes are drawn as squares. These nodes correspond to the factors φ of the unnormalized

probability distribution. Variables and factors may be connected with undirected edges.

A variable and a factor are connected in the graph if and only if the variable is one

of the arguments to the factor in the unnormalized probability distribution. No factor

may be connected to another factor in the graph, nor can a variable be connected to a

variable. See Fig. 14.11 for an example of how factor graphs can resolve ambiguity in

the interpretation of undirected networks.

14.3 Advantages of Structured Modeling

TODO– note that we have already shown that some things are cheaper in the sections

where we introduce the modeling syntax

TODO: revisit each of the three challenges from sec:unstructured TODO: hammer

point that graphical models convey information by leaving edges out TODO: need to

272

A B

(a) (b) (c)

Figure 14.11: An example of how a factor graph can resolve ambiguity in the interpre-

tation of undirected networks. a) An undirected network with a clique involving three

variables A, B, and C. b) A factor graph corresponding to the same undirected model.

This factor graph has one factor over all three variables. c) Another valid factor graph

for the same undirected model. This factor graph has three factors, each over only

two variables. Note that representation, inference, and learning are all asymptotically

cheaper in (c) compared to (b), even though both require the same undirected graph to

represent.

show reduced cost of sampling, but ﬁrst reader needs to know about ancestral and gibbs

sampling.... TODO: beneﬁt of separating representation from learning and inference

14.4 Learning About Dependencies

We consider here two types of random variables: observed or “visible” variables v and

latent or “hidden” variables h. The observed variables v correspond to the variables

actually provided in the data set during training. h consists of variables that are in-

troduced to the model in order to help it explain the structure in v. Generally the

exact semantics of h depend on the model parameters and are created by the learning

algorithm. The motivation for this is twofold.

14.4.1 Latent Variables Versus Structure Learning

Often the diﬀerent elements of v are highly dependent on each other. A good model of

v which did not contain any latent variables would need to have very large numbers of

parents per node in a Bayesian network or very large cliques in a Markov network. Just

representing these higher order interactions is costly–both in a computational sense,

because the number of parameters that must be stored in memory scales exponentially

with the number of members in a clique, but also in a statistical sense, because this

exponential number of parameters requires a wealth of data to estimate accurately.

There is also the problem of learning which variables need to be in such large cliques.

An entire ﬁeld of machine learning called structure learning is devoted to this problem

. For a good reference on structure learning, see (Koller and Friedman, 2009). Most

273

structure learning techniques are a form of greedy search. A structure is proposed, a

model with that structure is trained, then given a score. The score rewards high training

set accuracy and penalizes model complexity. Candidate structures with a small number

of edges added or removed are then proposed as the next step of the search, and the

search proceeds to a new structure that is expected to increase the score.

Using latent variables instead of adaptive structure avoids the need to perform dis-

crete searches and multiple rounds of training. A ﬁxed structure over visible and hidden

variables can use direct interactions between visible and hidden units to impose indirect

interactions between visible units. Using simple parameter learning techniques we can

learn a model with a ﬁxed structure that imputes the right structure on the marginal

p(v).

14.4.2 Latent Variables for Feature Learning

Another advantage of using latent variables is that they often develop useful semantics.

As discussed in section 3.10.5, the mixture of Gaussians model learns a latent variable

that corresponds to which category of examples the input was drawn from. This means

that the latent variable in a mixture of Gaussians model can be used to do classiﬁcation.

In Chapter 16 we saw how simple probabilistic models like sparse coding learn latent

variables that can be used as input features for a classiﬁer, or as coordinates along a

manifold. Other models can be used in this same way, but deeper models and models

with diﬀerent kids of interactions can create even richer descriptions of the input. Most

of the approaches mentioned in sec. 14.4.2 accomplish feature learning by learning latent

variables. Often, given some model of v and h, it turns out that E[h | v] TODO: uh-oh,

is there a collision between set notation and expectation notation? or argmax

p(h, v)

is a good feature mapping for v.

TODO: appropriate links to Monte Carlo methods chapter spun oﬀ from here

14.5 Inference and Approximate Inference Over Latent

Variables

As soon as we introduce latent variables in a graphical model, this raises the question:

how to choose values of the latent variables h given values of the visible variables x?

This is what we call inference, in particular inference over the latent variables. The

general question of inference is to guess some variables given others.

TODO: inference has deﬁnitely been introduced above... TODO: mention loopy BP,

show how it is very expensive for DBMs

TODO: brieﬂy explain what variational inference is and reference approximate in-

ference chapter

14.5.1 Reparametrization Trick

Sometimes, in order to estimate the stochastic gradient of an expected loss over some

random variable h, with respect to parameters that inﬂuence h, we would like to compute

274

gradients through h, i.e., on the parameters that inﬂuenced the probability distribution

from which h was sampled. If h is continuous-valued, this is generally possible by using

the reparametrization trick, i.e., rewriting

h ∼ p(h|θ) (14.2)

h = f(θ, η) (14.3)

where η is some independent noise source of the appropriate dimension with density

p(η), and f is a continuous (diﬀerentiable almost everywhere) function. Basically, the

reparametrization trick is the idea that if the random variable to be integrated over is

continuous, we can back-propagate through the process that gave rise to it in order to

ﬁgure how to change that process.

For example, let us suppose we want to estimate the expected gradient

∂

∂θ

L(h)p(h|θ)dh (14.4)

where the parameters θ inﬂuences the random variable h which in term inﬂuence our loss

L. A very eﬃcient (Kingma and Welling, 2014b; Rezende et al., 2014) way to achieve

this is to perform the reparametrization in Eq. 14.3 and the corresponding change of

variable in the integral of Eq. 14.4, integrating over η rather than h:

∂

∂θ

L(f(θ, η))p(eta)dη. (14.5)

We can now more easily enter the derivative in the integral, getting

g =

∂L(f(θ, η))

∂θ

p(eta)dη.

Finally, we get a stochastic gradient estimator

ˆg =

∂L(f(θ, η))

∂θ

where we sampled η ∼ p(η) and E[ˆg] = g.

This trick was used by Bengio (2013); Bengio et al. (2013a) to train a neural network

with stochastic hidden units. It was described at the same time by Kingma (2013), but

see the further developments in Kingma and Welling (2014b). It was used to train gen-

erative stochastic networks (GSNs) (Bengio et al., 2014a,b), described in Section 21.10,

which can be viewed as recurrent networks with noise injected both in input and hidden

units (with each time step corresponding to one step of a generative Markov chain). The

reparametrization trick was also used to estimate the parameter gradient in variational

auto-encoders (Kingma and Welling, 2014a; Rezende et al., 2014; Kingma et al., 2014),

which are described in Section 21.8.2.

compared to approaches that do not back-propagate through the generation of h

275

14.6 The Deep Learning Approach to Structured Proba-

bilistic Modeling

Deep learning practictioners generally use the same basic computational tools as other

machine learning practitioners who work with structured probabilistic models. However,

in the context of deep learning, we usually make diﬀerent design decisions about how to

combine these tools, resulting in overall algorithms and models that have a very diﬀerent

ﬂavor from more traditional graphical models.

The most striking diﬀerence between the deep learning style of graphical model

design and the traditional style of graphical model design is that the deep learning style

heavily emphasizes the use of latent variables. Deep learning models typically have more

latent variables than observed variables. Moreover, the practitioner typically does not

intend for the latent variables to take on any speciﬁc semantics ahead of time— the

training algorithm is free to invent the concepts it needs to model a particular dataset.

The latent variables are usually not very easy for a human to interpret after the fact,

though visualization techniques may allow some rough characterization of what they

represent. Complicated non-linear interactions between variables are accomplished via

indirect connections that ﬂow through multiple latent variables. By contrast, traditional

graphical models usually contain variables that are at least occasionally observed, even if

many of the variables are missing at random from some training examples. Complicated

non-linear interactions between variables are modeled by using higher-order terms, with

structure learning algorithms used to prune connections and control model capacity.

When latent variables are used, they are often designed with some speciﬁc semantics

in mind—the topic of a document, the intelligence of a student, the disease causing a

patient’s symptoms, etc. These models are often much more interpretable by human

practitioners and often have more theoretical guarantees, yet are less able to scale to

complex problems and are not re-useable in as many diﬀerent contexts as deep models.

Another obvious diﬀerence is the kind of graph structure typically used in the deep

learning approach. This is tightly linked with the choice of inference algorithm. Tradi-

tional approaches to graphical models typically aim to maintain the tractability of exact

inference. When this constraint is too limiting, a popular exact inference algorithm is

loopy belief propagation. Both of these approaches often work well with very sparsely

connected graphs. By comparison, very few interesting deep models admit exact infer-

ence, and loopy belief propagation is almost never used for deep learning. Most deep

models are designed to make Gibbs sampling or variational inference algorithms, rather

than loopy belief propagation, eﬃcient. Another consideration is that deep learning

models contain a very large number of latent variables, making eﬃcient numerical code

essential. As a result of these design constraints, most deep learning models are orga-

nized into regular repeating patterns of units grouped into layers, but neighboring layers

may be fully connected to each other. When sparse connections are used, they usually

follow a regular pattern, such as the block connections used in convolutional models.

Finally, the deep learning approach to graphical modeling is characterized by a

marked tolerance of the unknown. Rather than simplifying the model until all quantities

we might want can be computed exactly, we increase the power of the model until it is

276

Figure 14.12: An example RBM drawn as a Markov network

just barely possible to train or use. We often use models whose marginal distributions

cannot be computed, and are satisﬁed simply to draw approximate samples from these

models. We often train models with an intractable objective function that we cannot

even approximate in a reasonable amount of time, but we are still able to approximately

train the model if we can eﬃciently obtain an estimate of the gradient of such a func-

tion. The deep learning approach is often to ﬁgure out what the minimum amount

of information we absolutely need is, and then to ﬁgure out how to get a reasonable

approximation of that information as quickly as possible.

14.6.1 Example: The Restricted Boltzmann Machine

TODO: rework this section. Add pointer to Chapter 21.1. TODO what do we want to

exemplify here?

The restricted Boltzmann machine (RBM) (Smolensky, 1986) or harmonium is an

example of a model that TODO what do we want to exemplify here?

It is an energy-based model with binary visible and hidden units. Its energy function

E(v, h) = −b

v − c

h −v

W h

where b, c, and W are unconstrained, real-valued, learnable parameters. The model

is depicted graphically in Fig. 14.12. As this ﬁgure makes clear, an important aspect

of this model is that there are no direct interactions between any two visible units or

between any two hidden units (hence the “restricted,” a general Boltzmann machine

may have arbitrary connections).

The restrictions on the RBM structure yield the nice properties

p(h | v) = Π

p(h

| v)

and

p(v | h) = Π

p(v

| h).

277

The individual conditionals are simple to compute as well, for example

p(h

= 1 | v) = σ



:,i

+ b



Together these properties allow for eﬃcient block Gibbs sampling, alternating be-

tween sampling all of h simultaneously and sampling all of v simultaneously.

Since the energy function itself is just a linear function of the parameters, it is easy

to take the needed derivatives. For example,

∂

∂W

i,j

v,h

E(v, h) = −v

These two properties–eﬃcient Gibbs sampling and eﬃcient derivatives– make it pos-

sible to train the RBM with stochastic approximations to ∇

log Z.

278