Chapter 16

Structured Probabilistic Models

for Deep Learning

Deep learning draws upon many modeling formalisms that researchers can use to

guide their design eﬀorts and describe their algorithms. One of these formalisms is

the idea of structured probabilistic models. We have already discussed structured

probabilistic models brieﬂy in Sec. 3.14. That brief presentation was suﬃcient to

understand how to use structured probabilistic models as a language to describe

some of the algorithms in Part II. Now, in Part III, structured probabilistic models

are a key ingredient of many of the most important research topics in deep learning.

In order to prepare to discuss these research ideas, this chapter describes structured

probabilistic models in much greater detail. This chapter is intended to be self-

contained; the reader does not need to review the earlier introduction before

continuing with this chapter.

A structured probabilistic model is a way of describing a probability distribution,

using a graph to describe which random variables in the probability distribution

interact with each other directly. Here we use “graph” in the graph theory sense—a

set of vertices connected to one another by a set of edges. Because the structure of

the model is deﬁned by a graph, these models are often also referred to as graphical

models.

The graphical models research community is large and has developed many

diﬀerent models, training algorithms, and inference algorithms. In this chapter, we

provide basic background on some of the most central ideas of graphical models,

with an emphasis on the concepts that have proven most useful to the deep learning

research community. If you already have a strong background in graphical models,

you may wish to skip most of this chapter. However, even a graphical model expert

561

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

may beneﬁt from reading the ﬁnal section of this chapter, Sec. 16.7, in which we

highlight some of the unique ways that graphical models are used for deep learning

algorithms. Deep learning practitioners tend to use very diﬀerent model structures,

learning algorithms and inference procedures than are commonly used by the rest

of the graphical models research community. In this chapter, we identify these

diﬀerences in preferences and explain the reasons for them.

In this chapter we ﬁrst describe the challenges of building large-scale proba-

bilistic models. Next, we describe how to use a graph to describe the structure

of a probability distribution. While this approach allows us to overcome many

challenges, it is not without its own complications. One of the major diﬃculties in

graphical modeling is understanding which variables need to be able to interact

directly, i.e., which graph structures are most suitable for a given problem. We

outline two approaches to resolving this diﬃculty by learning about the dependen-

cies in Sec. 16.5. Finally, we close with a discussion of the unique emphasis that

deep learning practitioners place on speciﬁc approaches to graphical modeling in

Sec. 16.7.

16.1 The Challenge of Unstructured Modeling

The goal of deep learning is to scale machine learning to the kinds of challenges

needed to solve artiﬁcial intelligence. This means being able to understand high-

dimensional data with rich structure. For example, we would like AI algorithms to

be able to understand natural images,

audio waveforms representing speech, and

documents containing multiple words and punctuation characters.

Classiﬁcation algorithms can take an input from such a rich high-dimensional

distribution and summarize it with a categorical label—what object is in a photo,

what word is spoken in a recording, what topic a document is about. The process

of classiﬁcation discards most of the information in the input and produces a

single output (or a probability distribution over values of that single output). The

classiﬁer is also often able to ignore many parts of the input. For example, when

recognizing an object in a photo, it is usually possible to ignore the background of

the photo.

It is possible to ask probabilistic models to do many other tasks. These tasks are

often more expensive than classiﬁcation. Some of them require producing multiple

output values. Most require a complete understanding of the entire structure of

A natural image is an image that might be captured by a camera in a reasonably ordinary

environment, as opposed to a synthetically rendered image, a screenshot of a web page, etc.

562

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

the input, with no option to ignore sections of it. These tasks include the following:

•

Density estimation: given an input

, the machine learning system returns an

estimate of the true density

(

) under the data generating distribution. This

requires only a single output, but it does require a complete understanding

of the entire input. If even one element of the vector is unusual, the system

must assign it a low probability.

•

Denoising: given a damaged or incorrectly observed input

, the machine

learning system returns an estimate of the original or correct

. For example,

the machine learning system might be asked to remove dust or scratches

from an old photograph. This requires multiple outputs (every element of the

estimated clean example

) and an understanding of the entire input (since

even one damaged area will still reveal the ﬁnal estimate as being damaged).

•

Missing value imputation: given the observations of some elements of

the model is asked to return estimates of or a probability distribution over

some or all of the unobserved elements of

. This requires multiple outputs.

Because the model could be asked to restore any of the elements of

, it

must understand the entire input.

•

Sampling: the model generates new samples from the distribution

(

Applications include speech synthesis, i.e. producing new waveforms that

sound like natural human speech. This requires multiple output values and a

good model of the entire input. If the samples have even one element drawn

from the wrong distribution, then the sampling process is wrong.

For an example of a sampling task using small natural images, see Fig. 16.1.

Modeling a rich distribution over thousands or millions of random variables is a

challenging task, both computationally and statistically. Suppose we only wanted

to model binary variables. This is the simplest possible case, and yet already it

seems overwhelming. For a small, 32

32 pixel color (RGB) image, there are 2

3072

possible binary images of this form. This number is over 10

800

times larger than

the estimated number of atoms in the universe.

In general, if we wish to model a distribution over a random vector

containing

discrete variables capable of taking on

values each, then the naive approach of

representing

(

) by storing a lookup table with one probability value per possible

outcome requires k

parameters!

This is not feasible for several reasons:

563

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Figure 16.1: Probabilistic modeling of natural images. (Top) Example 32

32 pixel color

images from the CIFAR-10 dataset (Krizhevsky and Hinton, 2009). (Bottom) Samples

drawn from a structured probabilistic model trained on this dataset. Each sample appears

at the same position in the grid as the training example that is closest to it in Euclidean

space. This comparison allows us to see that the model is truly synthesizing new images,

rather than memorizing the training data. Contrast of both sets of images has been

adjusted for display. Figure reproduced with permission from Courville et al. (2011).

564

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

• Memory: the cost of storing the representation

: For all but very

small values of

and

, representing the distribution as a table will require

too many values to store.

• Statistical eﬃciency

: As the number of parameters in a model increases,

so does the amount of training data needed to choose the values of those

parameters using a statistical estimator. Because the table-based model

has an astronomical number of parameters, it will require an astronomically

large training set to ﬁt accurately. Any such model will overﬁt the training

set very badly unless additional assumptions are made linking the diﬀerent

entries in the table (for example, like in back-oﬀ or smoothed

-gram models,

Sec. 12.4.1).

• Runtime: the cost of inference

: Suppose we want to perform an inference

task where we use our model of the joint distribution

(

) to compute some

other distribution, such as the marginal distribution

(

) or the conditional

distribution

(

| x

). Computing these distributions will require summing

across the entire table, so the runtime of these operations is as high as the

intractable memory cost of storing the model.

• Runtime: the cost of sampling

: Likewise, suppose we want to draw a

sample from the model. The naive way to do this is to sample some value

u ∼ U

1), then iterate through the table adding up the probability values

until they exceed

and return the outcome whose probability value was

added last. This requires reading through the whole table in the worst case,

so it has the same exponential cost as the other operations.

The problem with the table-based approach is that we are explicitly modeling

every possible kind of interaction between every possible subset of variables. The

probability distributions we encounter in real tasks are much simpler than this.

Usually, most variables inﬂuence each other only indirectly.

For example, consider modeling the ﬁnishing times of a team in a relay race.

Suppose the team consists of three runners: Alice, Bob and Carol. At the start of

the race, Alice carries a baton and begins running around a track. After completing

her lap around the track, she hands the baton to Bob. Bob then runs his own

lap and hands the baton to Carol, who runs the ﬁnal lap. We can model each of

their ﬁnishing times as a continuous random variable. Alice’s ﬁnishing time does

not depend on anyone else’s, since she goes ﬁrst. Bob’s ﬁnishing time depends

on Alice’s, because Bob does not have the opportunity to start his lap until Alice

has completed hers. If Alice ﬁnishes faster, Bob will ﬁnish faster, all else being

565

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

equal. Finally, Carol’s ﬁnishing time depends on both her teammates. If Alice is

slow, Bob will probably ﬁnish late too. As a consequence, Carol will have quite a

late starting time and thus is likely to have a late ﬁnishing time as well. However,

Carol’s ﬁnishing time depends only indirectly on Alice’s ﬁnishing time via Bob’s.

If we already know Bob’s ﬁnishing time, we will not be able to estimate Carol’s

ﬁnishing time better by ﬁnding out what Alice’s ﬁnishing time was. This means

we can model the relay race using only two interactions: Alice’s eﬀect on Bob and

Bob’s eﬀect on Carol. We can omit the third, indirect interaction between Alice

and Carol from our model.

Structured probabilistic models provide a formal framework for modeling only

direct interactions between random variables. This allows the models to have

signiﬁcantly fewer parameters which can in turn be estimated reliably from less

data. These smaller models also have dramatically reduced computational cost

in terms of storing the model, performing inference in the model, and drawing

samples from the model.

16.2 Using Graphs to Describe Model Structure

Structured probabilistic models use graphs (in the graph theory sense of “nodes” or

“vertices” connected by edges) to represent interactions between random variables.

Each node represents a random variable. Each edge represents a direct interaction.

These direct interactions imply other, indirect interactions, but only the direct

interactions need to be explicitly modeled.

There is more than one way to describe the interactions in a probability

distribution using a graph. In the following sections we describe some of the most

popular and useful approaches. Graphical models can be largely divided into

two categories: models based on directed acyclic graphs, and models based on

undirected graphs.

16.2.1 Directed Models

One kind of structured probabilistic model is the directed graphical model, otherwise

known as the belief network or Bayesian network

(Pearl, 1985).

Directed graphical models are called “directed” because their edges are directed,

Judea Pearl suggested using the term Bayes Network when one wishes to “emphasize the

judgmental” nature of the values computed by the network, i.e. to highlight that they usually

represent degrees of belief rather than frequencies of events.

566

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Alice Bob Carol

Figure 16.2: A directed graphical model depicting the relay race example. Alice’s ﬁnishing

time

inﬂuences Bob’s ﬁnishing time

, because Bob does not get to start running until

Alice ﬁnishes. Likewise, Carol only gets to start running after Bob ﬁnishes, so Bob’s

ﬁnishing time t

directly inﬂuences Carol’s ﬁnishing time t

that is, they point from one vertex to another. This direction is represented in

the drawing with an arrow. The direction of the arrow indicates which variable’s

probability distribution is deﬁned in terms of the other’s. Drawing an arrow from

means that we deﬁne the probability distribution over

via a conditional

distribution, with

as one of the variables on the right side of the conditioning

bar. In other words, the distribution over b depends on the value of a.

Continuing with the relay race example from Sec. 16.1, suppose we name Alice’s

ﬁnishing time

, Bob’s ﬁnishing time

, and Carol’s ﬁnishing time

. As we saw

earlier, our estimate of

depends on

. Our estimate of

depends directly on

but only indirectly on

. We can draw this relationship in a directed graphical

model, illustrated in Fig. 16.2.

Formally, a directed graphical model deﬁned on variables

is deﬁned by a

directed acyclic graph

whose vertices are the random variables in the model, and

a set of local conditional probability distributions

(

| P a

(

)) where

P a

(

)

gives the parents of x

in G. The probability distribution over x is given by

p(x) = Π

p(x

| P a

)). (16.1)

In our relay race example, this means that, using the graph drawn in Fig. 16.2,

p(t

, t

) = p(t

)p(t

| t

)p(t

| t

). (16.2)

This is our ﬁrst time seeing a structured probabilistic model in action. We

can examine the cost of using it, in order to observe how structured modeling has

many advantages relative to unstructured modeling.

Suppose we represented time by discretizing time ranging from minute 0 to

minute 10 into 6 second chunks. This would make

and

each be discrete

variables with 100 possible values. If we attempted to represent

(

, t

) with a

table, it would need to store 999,999 values (100 values of

100 values of

, minus 1, since the probability of one of the conﬁgurations is made

567

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

redundant by the constraint that the sum of the probabilities be 1). If instead, we

only make a table for each of the conditional probability distributions, then the

distribution over

requires 99 values, the table deﬁning

given

requires 9900

values, and so does the table deﬁning

and

. This comes to a total of 19,899

values. This means that using the directed graphical model reduced our number of

parameters by a factor of more than 50!

In general, to model

discrete variables each having

values, the cost of the

single table approach scales like

(

), as we have observed before. Now suppose

we build a directed graphical model over these variables. If

is the maximum

number of variables appearing (on either side of the conditioning bar) in a single

conditional probability distribution, then the cost of the tables for the directed

model scales like

(

). As long as we can design a model such that

m << n

, we

get very dramatic savings.

In other words, so long as each variable has few parents in the graph, the

distribution can be represented with very few parameters. Some restrictions on

the graph structure, such as requiring it to be a tree, can also guarantee that

operations like computing marginal or conditional distributions over subsets of

variables are eﬃcient.

It is important to realize what kinds of information can and cannot be encoded in

the graph. The graph encodes only simplifying assumptions about which variables

are conditionally independent from each other. It is also possible to make other

kinds of simplifying assumptions. For example, suppose we assume Bob always

runs the same regardless of how Alice performed. (In reality, Alice’s performance

probably inﬂuences Bob’s performance—depending on Bob’s personality, if Alice

runs especially fast in a given race, this might encourage Bob to push hard and

match her exceptional performance, or it might make him overconﬁdent and lazy).

Then the only eﬀect Alice has on Bob’s ﬁnishing time is that we must add Alice’s

ﬁnishing time to the total amount of time we think Bob needs to run. This

observation allows us to deﬁne a model with

(

) parameters instead of

(

However, note that

and

are still directly dependent with this assumption,

because

represents the absolute time at which Bob ﬁnishes, not the total time

he himself spends running. This means our graph must still contain an arrow from

. The assumption that Bob’s personal running time is independent from

all other factors cannot be encoded in a graph over

, and

. Instead, we

encode this information in the deﬁnition of the conditional distribution itself. The

conditional distribution is no longer a

k ×k −

1 element table indexed by

and

but is now a slightly more complicated formula using only

k −

1 parameters. The

directed graphical model syntax does not place any constraint on how we deﬁne

568

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

our conditional distributions. It only deﬁnes which variables they are allowed to

take in as arguments.

16.2.2 Undirected Models

Directed graphical models give us one language for describing structured proba-

bilistic models. Another popular language is that of undirected models, otherwise

known as Markov random ﬁelds (MRFs) or Markov networks (Kindermann, 1980).

As their name implies, undirected models use graphs whose edges are undirected.

Directed models are most naturally applicable to situations where there is

a clear reason to draw each arrow in one particular direction. Often these are

situations where we understand the causality and the causality only ﬂows in one

direction. One such situation is the relay race example. Earlier runners aﬀect the

ﬁnishing times of later runners; later runners do not aﬀect the ﬁnishing times of

earlier runners.

Not all situations we might want to model have such a clear direction to their

interactions. When the interactions seem to have no intrinsic direction, or to

operate in both directions, it may be more appropriate to use an undirected model.

As an example of such a situation, suppose we want to model a distribution

over three binary variables: whether or not you are sick, whether or not your

coworker is sick, and whether or not your roommate is sick. As in the relay race

example, we can make simplifying assumptions about the kinds of interactions

that take place. Assuming that your coworker and your roommate do not know

each other, it is very unlikely that one of them will give the other a disease such as

a cold directly. This event can be seen as so rare that it is acceptable not to model

it. However, it is reasonably likely that either of them could give you a cold, and

that you could pass it on to the other. We can model the indirect transmission of

a cold from your coworker to your roommate by modeling the transmission of the

cold from your coworker to you and the transmission of the cold from you to your

roommate.

In this case, it is just as easy for you to cause your roommate to get sick as

it is for your roommate to make you sick, so there is not a clean, uni-directional

narrative on which to base the model. This motivates using an undirected model.

As with directed models, if two nodes in an undirected model are connected by an

edge, then the random variables corresponding to those nodes interact with each

other directly. Unlike directed models, the edge in an undirected model has no

arrow, and is not associated with a conditional probability distribution.

We denote the random variable representing your health as

, the random

569

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Figure 16.3: An undirected graph representing how your roommate’s health

, your

health

, and your work colleague’s health

aﬀect each other. You and your roommate

might infect each other with a cold, and you and your work colleague might do the same,

but assuming that your roommate and your colleague do not know each other, they can

only infect each other indirectly via you.

variable representing your roommate’s health as

, and the random variable

representing your colleague’s health as

. See Fig. 16.3 for a drawing of the graph

representing this scenario.

Formally, an undirected graphical model is a structured probabilistic model

deﬁned on an undirected graph

. For each clique

in the graph,

a factor

(

)

(also called a clique potential) measures the aﬃnity of the variables in that clique

for being in each of their possible joint states. The factors are constrained to be

non-negative. Together they deﬁne an unnormalized probability distribution

˜p(x) = Π

C∈G

φ(C). (16.3)

The unnormalized probability distribution is eﬃcient to work with so long as

all the cliques are small. It encodes the idea that states with higher aﬃnity are

more likely. However, unlike in a Bayesian network, there is little structure to the

deﬁnition of the cliques, so there is nothing to guarantee that multiplying them

together will yield a valid probability distribution. See Fig. 16.4 for an example of

reading factorization information from an undirected graph.

Our example of the cold spreading between you, your roommate, and your

colleague contains two cliques. One clique contains

and

. The factor for this

clique can be deﬁned by a table, and might have values resembling these:

= 0 h

= 1

= 0 2 1

= 1 1 10

A state of 1 indicates good health, while a state of 0 indicates poor health

(having been infected with a cold). Both of you are usually healthy, so the

A clique of the graph is a subset of nodes that are all connected to each other by an edge of

the graph.

570

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

corresponding state has the highest aﬃnity. The state where only one of you is

sick has the lowest aﬃnity, because this is a rare state. The state where both of

you are sick (because one of you has infected the other) is a higher aﬃnity state,

though still not as common as the state where both are healthy.

To complete the model, we would need to also deﬁne a similar factor for the

clique containing h

and h

16.2.3 The Partition Function

While the unnormalized probability distribution is guaranteed to be non-negative

everywhere, it is not guaranteed to sum or integrate to 1. To obtain a valid

probability distribution, we must use the corresponding normalized probability

distribution:

p(x) =

˜p(x) (16.4)

where

is the value that results in the probability distribution summing or

integrating to 1:

Z =



˜p(x)dx. (16.5)

You can think of

as a constant when the

functions are held constant. Note

that if the

functions have parameters, then

is a function of those parameters.

It is common in the literature to write

with its arguments omitted to save space.

The normalizing constant

is known as the partition function, a term borrowed

from statistical physics.

Since

is an integral or sum over all possible joint assignments of the state

it is often intractable to compute. In order to be able to obtain the normalized

probability distribution of an undirected model, the model structure and the

deﬁnitions of the

functions must be conducive to computing

eﬃciently. In

the context of deep learning,

is usually intractable. Due to the intractability

of computing

exactly, we must resort to approximations. Such approximate

algorithms are the topic of Chapter 18.

One important consideration to keep in mind when designing undirected models

is that it is possible for

not to exist. This happens if some of the variables in

the model are continuous and the integral of

˜p

over their domain diverges. For

example, suppose we want to model a single scalar variable

x ∈ R

with a single

A distribution deﬁned by normalizing a product of clique potentials is also called a Gibbs

distribution.

571

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

clique potential φ(x) = x

. In this case,

Z =



dx. (16.6)

Since this integral diverges, there is no probability distribution corresponding to

this choice of

(

). Sometimes the choice of some parameter of the

functions

determines whether the probability distribution is deﬁned. For example, for

(

;

) =

exp



−βx



, the

parameter determines whether

exists. Positive

results in a Gaussian distribution over

but all other values of

make

impossible

to normalize.

One key diﬀerence between directed modeling and undirected modeling is that

directed models are deﬁned directly in terms of probability distributions from

the start, while undirected models are deﬁned more loosely by

functions that

are then converted into probability distributions. This changes the intuitions one

must develop in order to work with these models. One key idea to keep in mind

while working with undirected models is that the domain of each of the variables

has dramatic eﬀect on the kind of probability distribution that a given set of

functions corresponds to. For example, consider an

-dimensional vector-valued

random variable

and an undirected model parametrized by a vector of biases

. Suppose we have one clique for each element of

(i)

(

) =

exp

(

). What

kind of probability distribution does this result in? The answer is that we do

not have enough information, because we have not yet speciﬁed the domain of

x ∈ R

, then the integral deﬁning

diverges and no probability distribution

exists. If

x ∈ {

}

, then

(

) factorizes into

independent distributions, with

(

= 1) =

sigmoid (b

)

. If the domain of

is the set of elementary basis vectors

(

{

, . .. ,

}

) then

(

) =

softmax

(

), so a large

value of

actually reduces

(

= 1) for

j 

. Often, it is possible to leverage

the eﬀect of a carefully chosen domain of a variable in order to obtain complicated

behavior from a relatively simple set of

functions. We will explore a practical

application of this idea later, in Sec. 20.6.

16.2.4 Energy-Based Models

Many interesting theoretical results about undirected models depend on the assump-

tion that

∀x, ˜p

(

)

0. A convenient way to enforce this to use an energy-based

model (EBM) where

˜p(x) = exp(−E(x)) (16.7)

and

(

) is known as the energy function. Because

exp

(

) is positive for all

, this

guarantees that no energy function will result in a probability of zero for any state

572

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a b c

d e f

Figure 16.4: This graph implies that

(

a, b, c, d, e, f

) can be written as

a,b

(

a, b

)

b,c

(

b, c

)

a,d

(

a, d

)

b,e

(

b, e

)

e,f

(

e, f

) for an appropriate choice of the

func-

tions.

Being completely free to choose the energy function makes learning simpler. If we

learned the clique potentials directly, we would need to use constrained optimization

to arbitrarily impose some speciﬁc minimal probability value. By learning the

energy function, we can use unconstrained optimization.

The probabilities in an

energy-based model can approach arbitrarily close to zero but never reach it.

Any distribution of the form given by Eq. 16.7 is an example of a Boltzmann

distribution. For this reason, many energy-based models are called Boltzmann

machines (Fahlman et al., 1983; Ackley et al., 1985; Hinton et al., 1984; Hinton

and Sejnowski, 1986) There is no accepted guideline for when to call a model an

energy-based model and when to call it a Boltzmann machine. The term Boltzmann

machine was ﬁrst introduced to describe a model with exclusively binary variables,

but today many models such as the mean-covariance restricted Boltzmann machine

incorporate real-valued variables as well. While Boltzmann machines were originally

deﬁned to encompass both models with and without latent variables, the term

Boltzmann machine is today most often used to designate models with latent

variables, while Boltzmann machines without latent variables are more often called

Markov random ﬁelds or log-linear models.

Cliques in an undirected graph correspond to factors of the unnormalized

probability function. Because

exp

(

)

exp

(

) =

exp

(

), this means that diﬀerent

cliques in the undirected graph correspond to the diﬀerent terms of the energy

function. In other words, an energy-based model is just a special kind of Markov

network: the exponentiation makes each term in the energy function correspond

to a factor for a diﬀerent clique. See Fig. 16.5 for an example of how to read the

form of the energy function from an undirected graph structure. One can view an

energy-based model with multiple terms in its energy function as being a product

of experts (Hinton, 1999). Each term in the energy function corresponds to another

For some models, we may still need to use constrained optimization to make sure Z exists.

573

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a b c

d e f

Figure 16.5: This graph implies that

(

a, b, c, d, e, f

) can be written as

a,b

(

a, b

) +

b,c

(

b, c

) +

a,d

(

a, d

) +

b,e

(

b, e

) +

e,f

(

e, f

) for an appropriate choice of the per-clique

energy functions. Note that we can obtain the

functions in Fig. 16.4 by setting each

to the exponential of the corresponding negative energy, e.g.,

a,b

(

a, b

) =

exp (−E(a, b))

factor in the probability distribution. Each term of the energy function can be

thought of as an “expert” that determines whether a particular soft constraint

is satisﬁed. Each expert may only enforce one constraint that concerns only

a low-dimensional projection of the random variables, but when combined by

multiplication of probabilities, the experts together enforce a complicated high-

dimensional constraint.

One part of the deﬁnition of an energy-based model serves no functional

purpose from a machine learning point of view: the

−

sign could be incorporated into the deﬁnition of

, or for many functions

the learning algorithm could simply learn parameters with opposite sign. The

−

sign is present primarily to preserve compatibility between the machine learning

literature and the physics literature. Many advances in probabilistic modeling

were originally developed by statistical physicists, for whom

refers to actual,

physical energy and does not have arbitrary sign. Terminology such as “energy”

and “partition function” remains associated with these techniques, even though

their mathematical applicability is broader than the physics context in which they

were developed. Some machine learning researchers (e.g., Smolensky (1986), who

referred to negative energy as harmony) have chosen to emit the negation, but this

is not the standard convention.

Many algorithms that operate on probabilistic models do not need to compute

model

(

) but only

log ˜p

model

(

). For energy-based models with latent variables

these algorithms are sometimes phrased in terms of the negative of this quantity,

called the free energy:

F(x) = −log



exp (−E(x, h)) . (16.8)

In this book, we usually prefer the more general log ˜p

model

(x) formulation.

574

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a s b

(a) (b)

Figure 16.6: (a) The path between random variable

and random variable

through

active, because

is not observed. This means that

and

are not separated. (b) Here

is shaded in, to indicate that it is observed. Because the only path between

and

through

, and that path is inactive, we can conclude that

and

are separated given

16.2.5 Separation and D-Separation

The edges in a graphical model tell us which variables directly interact. We often

need to know which variables indirectly interact. Some of these indirect interactions

can be enabled or disabled by observing other variables. More formally, we would

like to know which subsets of variables are conditionally independent from each

other, given the values of other subsets of variables.

Identifying the conditional independences in a graph is very simple in the case

of undirected models. In this case, conditional independence implied by the graph

is called separation. We say that a set of variables

is separated from another set

of variables

given a third set of variables

if the graph structure implies that

is independent from

given

. If two variables

and

are connected by a path

involving only unobserved variables, then those variables are not separated. If no

path exists between them, or all paths contain an observed variable, then they are

separated. We refer to paths involving only unobserved variables as “active” and

paths including an observed variable as “inactive.”

When we draw a graph, we can indicate observed variables by shading them

in. See Fig. 16.6 for a depiction of how active and inactive paths in an undirected

model look when drawn in this way. See Fig. 16.7 for an example of reading

separation from an undirected graph.

Similar concepts apply to directed models, except that in the context of

directed models, these concepts are referred to as d-separation. The “d” stands for

“dependence.” D-separation for directed graphs is deﬁned the same as separation

for undirected graphs: We say that a set of variables

is d-separated from another

set of variables

given a third set of variables

if the graph structure implies

that A is independent from B given S.

As with undirected models, we can examine the independences implied by the

graph by looking at what active paths exist in the graph. As before, two variables

are dependent if there is an active path between them, and d-separated if no such

575

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

b c

Figure 16.7: An example of reading separation properties from an undirected graph. Here

is shaded to indicate that it is observed. Because observing

blocks the only path from

, we say that

and

are separated from each other given

. The observation of

also blocks one path between

and

, but there is a second, active path between them.

Therefore, a and d are not separated given b.

path exists. In directed nets, determining whether a path is active is somewhat

more complicated. See Fig. 16.8 for a guide to identifying active paths in a directed

model. See Fig. 16.9 for an example of reading some properties from a graph.

It is important to remember that separation and d-separation tell us only

about those conditional independences that are implied by the graph. There is no

requirement that the graph imply all independences that are present. In particular,

it is always legitimate to use the complete graph (the graph with all possible edges)

to represent any distribution. In fact, some distributions contain independences

that are not possible to represent with existing graphical notation. Context-speciﬁc

independences are independences that are present dependent on the value of some

variables in the network. For example, consider a model of three binary variables:

and

. Suppose that when

is 0,

and

are independent, but when

is 1,

is deterministically equal to

. Encoding the behavior when

= 1 requires an edge

connecting

and

. The graph then fails to indicate that

and

are independent

when a = 0.

In general, a graph will never imply that an independence exists when it does

not. However, a graph may fail to encode an independence.

16.2.6 Converting between Undirected and Directed Graphs

We often refer to a speciﬁc machine learning model as being undirected or directed.

For example, we typically refer to RBMs as undirected and sparse coding as directed.

This choice of wording can be somewhat misleading, because no probabilistic

model is inherently directed or undirected. Instead, some models are most easily

described

using a directed graph, or most easily described using an undirected

graph.

576

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a s b

b a s b

a s b

(a)

(b)

Figure 16.8: All of the kinds of active paths of length two that can exist between random

variables

and

. (a) Any path with arrows proceeding directly from

or vice versa.

This kind of path becomes blocked if

is observed. We have already seen this kind of

path in the relay race example. (b)

and

are connected by a common cause

. For

example, suppose

is a variable indicating whether or not there is a hurricane and

and

measure the wind speed at two diﬀerent nearby weather monitoring outposts. If we

observe very high winds at station

, we might expect to also see high winds at

. This

kind of path can be blocked by observing

. If we already know there is a hurricane, we

expect to see high winds at

, regardless of what is observed at

. A lower than expected

wind at

(for a hurricane) would not change our expectation of winds at

(knowing

there is a hurricane). However, if

is not observed, then

and

are dependent, i.e.,

the path is active. (c)

and

are both parents of

. This is called a V-structure or the

collider case. The V-structure causes

and

to be related by the explaining away eﬀect.

In this case, the path is actually active when

is observed. For example, suppose

is a

variable indicating that your colleague is not at work. The variable

represents her being

sick, while

represents her being on vacation. If you observe that she is not at work,

you can presume she is probably sick or on vacation, but it is not especially likely that

both have happened at the same time. If you ﬁnd out that she is on vacation, this fact

is suﬃcient to explain her absence. You can infer that she is probably not also sick. (d)

The explaining away eﬀect happens even if any descendant of

is observed! For example,

suppose that

is a variable representing whether you have received a report from your

colleague. If you notice that you have not received the report, this increases your estimate

of the probability that she is not at work today, which in turn makes it more likely that

she is either sick or on vacation. The only way to block a path through a V-structure is

to observe none of the descendants of the shared child.

577

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a b

d e

Figure 16.9: From this graph, we can read out several d-separation properties. Examples

include:

• a and b are d-separated given the empty set.

• a and e are d-separated given c.

• d and e are d-separated given c.

We can also see that some variables are no longer d-separated when we observe some

variables:

• a and b are not d-separated given c.

• a and b are not d-separated given d.

578

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Figure 16.10: Examples of complete graphs, which can describe any probability distribution.

Here we show examples with four random variables. (Left) The complete undirected graph.

In the undirected case, the complete graph is unique. (Right) A complete directed graph.

In the directed case, there is not a unique complete graph. We choose an ordering of the

variables and draw an arc from each variable to every variable the comes after it in the

ordering. There are thus a factorial number of complete graphs for every set of random

variables. In this example we order the variables from left to right, top to bottom.

Directed models and undirected models both have their advantages and disad-

vantages. Neither approach is clearly superior and universally preferred. Instead,

we should choose which language to use for each task. This choice will partially

depend on which probability distribution we wish to describe. We may choose to

use either directed modeling or undirected modeling based on which approach can

capture the most independences in the probability distribution or which approach

uses the fewest edges to describe the distribution. There are other factors that

can aﬀect the decision of which language to use. Even while working with a single

probability distribution, we may sometimes switch between diﬀerent modeling

languages. Sometimes a diﬀerent language becomes more appropriate if we observe

a certain subset of variables, or if we wish to perform a diﬀerent computational

task. For example, the directed model description often provides a straightforward

approach to eﬃciently draw samples from the model (described in Sec. 16.3) while

the undirected model formulation is often useful for deriving approximate inference

procedures (as we will see in Chapter 19, where the role of undirected models is

highlighted in Eq. 19.44).

Every probability distribution can be represented by either a directed model

or by an undirected model. In the worst case, one can always represent any

distribution by using a “complete graph.” In the case of a directed model, the

complete graph is any directed acyclic graph where we impose some ordering on

the random variables, and each variable has all other variables that precede it in

the ordering as its ancestors in the graph. For an undirected model, the complete

graph is simply a graph containing a single clique encompassing all of the variables.

See Fig. 16.10 for an example.

579

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Of course, the utility of a graphical model is that the graph implies that some

variables do not interact directly. The complete graph is not very useful because it

does not imply any independences.

When we represent a probability distribution with a graph, we want to choose

a graph that implies as many independences as possible, without implying any

independences that do not actually exist.

From this point of view, some distributions can be represented more eﬃciently

using directed models, while other distributions can be represented more eﬃciently

using undirected models. In other words, directed models can encode some

independences that undirected models cannot encode, and vice versa.

Directed models are able to use one speciﬁc kind of substructure that undirected

models cannot represent perfectly. This substructure is called an immorality. The

structure occurs when two random variables

and

are both parents of a third

random variable

, and there is no edge directly connecting

and

in either

direction. (The name “immorality” may seem strange; it was coined in the graphical

models literature as a joke about unmarried parents) To convert a directed model

with graph

into an undirected model, we need to create a new graph

. For

every pair of variables

and

, we add an undirected edge connecting

and

if there is a directed edge (in either direction) connecting

and

or if

and

are both parents in

of a third variable

. The resulting

is known as

a moralized graph. See Fig. 16.11 for examples of converting directed models to

undirected models via moralization.

Likewise, undirected models can include substructures that no directed model

can represent perfectly. Speciﬁcally, a directed graph D cannot capture all of the

conditional independences implied by an undirected graph

contains a loop

of length greater than three, unless that loop also contains a chord. A loop is

a sequence of variables connected by undirected edges, with the last variable in

the sequence connected back to the ﬁrst variable in the sequence. A chord is a

connection between any two non-consecutive variables in the sequence deﬁning a

loop. If

has loops of length four or greater and does not have chords for these

loops, we must add the chords before we can convert it to a directed model. Adding

these chords discards some of the independence information that was encoded in

. The graph formed by adding chords to

is known as a chordal or triangulated

graph, because all the loops can now be described in terms of smaller, triangular

loops. To build a directed graph

from the chordal graph, we need to also assign

directions to the edges. When doing so, we must not create a directed cycle in

, or the result does not deﬁne a valid directed probabilistic model. One way

to assign directions to the edges in

is to impose an ordering on the random

580

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a b

Figure 16.11: Examples of converting directed models (top row) to undirected models

(bottom row) by constructing moralized graphs. (Left) This simple chain can be converted

to a moralized graph merely by replacing its directed edges with undirected edges. The

resulting undirected model implies exactly the same set of independences and conditional

independences. (Center) This graph is the simplest directed model that cannot be

converted to an undirected model without losing some independences. This graph consists

entirely of a single immorality. Because

and

are parents of

, they are connected by an

active path when

is observed. To capture this dependence, the undirected model must

include a clique encompassing all three variables. This clique fails to encode the fact that

a⊥b

. (Right) In general, moralization may add many edges to the graph, thus losing many

implied independences. For example, this sparse coding graph requires adding moralizing

edges between every pair of hidden units, thus introducing a quadratic number of new

direct dependences.

581

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

a b

d c

a b

d c

a b

d c

Figure 16.12: Converting an undirected model to a directed model. (Left) This undirected

model cannot be converted directed to a directed model because it has a loop of length four

with no chords. Speciﬁcally, the undirected model encodes two diﬀerent independences that

no directed model can capture simultaneously:

a⊥c | {b, d}

and

b⊥d | {a, c}

. (Center)

To convert the undirected model to a directed model, we must triangulate the graph,

by ensuring that all loops of greater than length three have a chord. To do so, we can

either add an edge connecting

and

or we can add an edge connecting

and

. In this

example, we choose to add the edge connecting

and

. (Right) To ﬁnish the conversion

process, we must assign a direction to each edge. When doing so, we must not create any

directed cycles. One way to avoid directed cycles is to impose an ordering over the nodes,

and always point each edge from the node that comes earlier in the ordering to the node

that comes later in the ordering. In this example, we use the variable names to impose

alphabetical order.

variables, then point each edge from the node that comes earlier in the ordering to

the node that comes later in the ordering. See Fig. 16.12 for a demonstration.

16.2.7 Factor Graphs

Factor graphs are another way of drawing undirected models that resolve an

ambiguity in the graphical representation of standard undirected model syntax.

In an undirected model, the scope of every

function must be a subset of some

clique in the graph. However, it is not necessary that there exist any

whose

scope contains the entirety of every clique. Factor graphs explicitly represent the

scope of each

function. Speciﬁcally, a factor graph is a graphical representation

of an undirected model that consists of a bipartite undirected graph. Some of the

nodes are drawn as circles. These nodes correspond to random variables as in a

standard undirected model. The rest of the nodes are drawn as squares. These

nodes correspond to the factors

of the unnormalized probability distribution.

Variables and factors may be connected with undirected edges. A variable and a

factor are connected in the graph if and only if the variable is one of the arguments

to the factor in the unnormalized probability distribution. No factor may be

connected to another factor in the graph, nor can a variable be connected to a

582

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

variable. See Fig. 16.13 for an example of how factor graphs can resolve ambiguity

in the interpretation of undirected networks.

a b

Figure 16.13: An example of how a factor graph can resolve ambiguity in the interpretation

of undirected networks. (Left) An undirected network with a clique involving three

variables:

and

. (Center) A factor graph corresponding to the same undirected

model. This factor graph has one factor over all three variables. (Right) Another valid

factor graph for the same undirected model. This factor graph has three factors, each over

only two variables. Representation, inference, and learning are all asymptotically cheaper

in (c) compared to (b), even though both require the same undirected graph to represent.

16.3 Sampling from Graphical Models

Graphical models also facilitate the task of drawing samples from a model.

One advantage of a directed graphical model for the joint distribution between

a set of random variables is that there is a very simple way to sample from that

joint, called ancestral sampling.

It exploits the decomposition of the joint probability into an ordered product

of conditional probabilities

(

| P a

(

)) for each random variable

given its

parents

P a

(

) in the directed graphical model, Eq. 16.1. Each such conditional

distribution is parametrized and it is assumed that it is easy to sample from any

one of them (given the values of the conditioning variables). We can read Eq.16.1

as a prescription for sampling from the joint distribution, as follows, in which we

consider the variables ordered so that if variable

is a parent of variable

, then

i < j:

sample the ﬁrst variable in the list,

, from its marginal

(

) (it should

have no parents because otherwise it could not be the ﬁrst variable in the

list),

iterating through the other variables in order, sample from

(

| P a

(

))

583

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

(noting that in some cases P a

)) may be an empty set).

Note that several orders are generally possible which are compatible with the graph

structure.

Ancestral sampling is generally very fast (assuming sampling from each con-

ditional is easy) and convenient, but unfortunately it is not always possible to

apply it. To use it, we must have a directed graphical model and must not be

conditioning on variables that prevent sampling in topological order. When we

wish to sample from a subset of the variables in a directed graphical model, given

some other variables, we often require that all the conditioning variables come

earlier than the variables to be sampled in the ordered graph. In this case, we

can sample from the local conditional probability distributions speciﬁed by the

model distribution. Otherwise, the conditional distributions we need to sample

from are the posterior distributions given the observed variables. These posterior

distributions are usually not explicitly speciﬁed and parametrized in the model.

Inferring these posterior distributions can be a costly inference procedure. In

models where this is the case, ancestral sampling is no longer eﬃcient.

Unfortunately, ancestral sampling is only applicable to directed models. We

can sample from undirected models by converting them to directed models, but this

often requires solving intractable inference problems (to determine the marginal

distribution over the root nodes of the new directed graph) or requires introducing

so many edges that the resulting directed model becomes intractable. Sampling

from an undirected model without ﬁrst converting it to a directed model seems to

require resolving cyclical dependencies. Every variable interacts with every other

variable, so there is no clear beginning point for the sampling process. Unfortunately,

drawing samples from an undirected graphical model is an expensive, multi-pass

process. The conceptually simplest approach is Gibbs sampling. Suppose we

have a graphical model over an

-dimensional vector of random variables

. We

iteratively visit each variable

and draw a sample conditioned on all of the other

variables, from

(

| x

−i

). Due to the separation properties of the graphical

model, we can equivalently condition on only the neighbors of

. Unfortunately,

after we have made one pass through the graphical model and sampled all

variables, we still do not have a fair sample from

(

). Instead, we must repeat the

process and resample all

variables using the updated values of their neighbors.

Asymptotically, after many repetitions, this process converges to sampling from

the correct distribution. It can be diﬃcult to determine when the samples have

reached a suﬃciently accurate approximation of the desired distribution. Sampling

techniques for undirected models are an advanced topic, covered in more detail in

Chapter 17.

584

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

16.4 Advantages of Structured Modeling

The primary advantage of using structured probabilistic models is that they allow

us to dramatically reduce the cost of representing probability distributions as well

as learning and inference. Sampling is also accelerated in the case of directed

models, while the situation can be complicated with undirected models. The

primary mechanism that allows all of these operations to use less runtime and

memory is choosing to not model certain interactions. Graphical models convey

information by leaving edges out. Anywhere there is not an edge, the model

speciﬁes the assumption that we do not need to model a direct interaction.

A less quantiﬁable beneﬁt of using structured probabilistic models is that

they allow us to explicitly separate representation of knowledge from learning of

knowledge or inference given existing knowledge. This makes our models easier to

develop and debug. We can design, analyze, and evaluate learning algorithms and

inference algorithms that are applicable to broad classes of graphs. Independently,

we can design models that capture the relationships we believe are important in our

data. We can then combine these diﬀerent algorithms and structures and obtain

a Cartesian product of diﬀerent possibilities. It would be much more diﬃcult to

design end-to-end algorithms for every possible situation.

16.5 Learning about Dependencies

We describe here two types of random variables: observed or “visible” variables

and latent or “hidden” variables

. The observed variables

correspond to the

variables actually provided in the data set during training.

consists of variables

that are introduced to the model in order to help it explain the structure in

Generally the exact semantics of

depend on the model parameters and are

created by the learning algorithm. The motivation for this is twofold.

16.5.1 Latent Variables Versus Structure Learning

Often the diﬀerent elements of

are highly dependent on each other. A good

model of

which did not contain any latent variables would need to have very

large numbers of parents per node in a Bayesian network or very large cliques in a

Markov network. Just representing these higher order interactions is costly—both

in a computational sense, because the number of parameters that must be stored

in memory scales exponentially with the number of members in a clique, but also

in a statistical sense, because this exponential number of parameters requires a

585

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

wealth of data to estimate accurately.

There is also the problem of learning which variables need to be in such large

cliques. An entire ﬁeld of machine learning called structure learning is devoted to

this problem . For a good reference on structure learning, see (Koller and Friedman,

2009). Most structure learning techniques are a form of greedy search. A structure

is proposed, a model with that structure is trained, then given a score. The score

rewards high training set accuracy and penalizes model complexity. Candidate

structures with a small number of edges added or removed are then proposed as the

next step of the search. The search proceeds to a new structure that is expected

to increase the score.

Using latent variables instead of adaptive structure avoids the need to perform

discrete searches and multiple rounds of training. A ﬁxed structure over visible

and hidden variables can use direct interactions between visible and hidden units

to impose indirect interactions between visible units. Using simple parameter

learning techniques we can learn a model with a ﬁxed structure that imputes the

right structure on the marginal p(v).

16.5.2 Latent Variables for Feature Learning

Another advantage of using latent variables is that they often develop useful

semantics.

As discussed in Sec. 3.9.6, the mixture of Gaussians model learns a latent

variable that corresponds to which category of examples the input was drawn from.

This means that the latent variable in a mixture of Gaussians model can be used

to do classiﬁcation.

In Chapter 14 we saw how simple probabilistic models like sparse coding learn

latent variables that can be used as input features for a classiﬁer, or as coordinates

along a manifold. Other models can be used in this same way, but deeper models

and models with diﬀerent kinds of interactions can create even richer descriptions

of the input. Many approaches accomplish feature learning by learning latent

variables. Often, given some model of

and

, experimental observations show

that E[h | v] or argmax

p(h, v) is a good feature mapping for v.

16.6 Inference and Approximate Inference

One of the main ways we can use a probabilistic model is to ask questions about

how variables are related to each other. Given a set of medical tests, we can ask

586

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

what disease a patient might have. In a latent variable model, we might want to

extract features

[

h | v

] describing the observed variables

. Sometimes we need

to solve such problems in order to perform other tasks. We often train our models

using the principle of maximum likelihood. Because

log p(v) = E

h∼p(h|v)

[log p(h, v) −log p(h | v)] , (16.9)

we often want to compute

(

h | v

) in order to implement a learning rule. All of

these are examples of inference problems in which we must predict the value of

some variables given other variables, or predict the probability distribution over

some variables given the value of other variables.

Unfortunately, for most interesting deep models, these inference problems are

intractable, even when we use a structured graphical model to simplify them. The

graph structure allows us to represent complicated, high-dimensional distributions

with a reasonable number of parameters, but the graphs used for deep learning are

usually not restrictive enough to also allow eﬃcient inference.

It is straightforward to see that computing the marginal probability of a general

graphical model is #P hard. The complexity class #P is a generalization of the

complexity class NP. Problems in NP require only determining whether a problem

has a solution and ﬁnding a solution if one exists. Problems in #P require counting

the number of solutions. To construct a worst-case graphical model, imagine that

we deﬁne a graphical model over the binary variables in a 3-SAT problem. We

can impose a uniform distribution over these variables. We can then add one

binary latent variable per clause that indicates whether each clause is satisﬁed.

We can then add another latent variable indicating whether all of the clauses are

satisﬁed. This can be done without making a large clique, by building a reduction

tree of latent variables, with each node in the tree reporting whether two other

variables are satisﬁed. The leaves of this tree are the variables for each clause.

The root of the tree reports whether the entire problem is satisﬁed. Due to the

uniform distribution over the literals, the marginal distribution over the root of the

reduction tree speciﬁes what fraction of assignments satisfy the problem. While

this is a contrived worst-case example, NP hard graphs commonly arise in practical

real-world scenarios.

This motivates the use of approximate inference. In the context of deep

learning, this usually refers to variational inference, in which we approximate the

true distribution

(

h | v

) by seeking an approximate distribution

(

h|v

) that is as

close to the true one as possible. This and other techniques are described in depth

in Chapter 19.

587

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

16.7 The Deep Learning Approach to Structured Prob-

abilistic Models

Deep learning practitioners generally use the same basic computational tools as

other machine learning practitioners who work with structured probabilistic models.

However, in the context of deep learning, we usually make diﬀerent design decisions

about how to combine these tools, resulting in overall algorithms and models that

have a very diﬀerent ﬂavor from more traditional graphical models.

The most striking diﬀerence between the deep learning style of graphical model

design and the traditional style of graphical model design is, unsurprisingly, the

depth of the model. In the context of graphical models, we can deﬁne the depth

of a model in terms of the graphical model graph rather than the computational

graph. We can think of a latent variable

as being at depth

if the shortest

path from

to an observed variable is

steps. We usually describe the depth

of the model as being the greatest depth of any such

. This kind of depth is

diﬀerent from the depth induced by the computational graph. A graphical model

with only one hidden layer has depth 1 in the graphical models sense, but the

computational graphs used to perform inference in such a model can have arbitrary

depth. For example, MAP inference in the sparse coding model typically requires

many iterations, each of which corresponds to deepening the computational graph,

but the graphical model contains only one hidden layer. In the context of deep

learning, the goal is usually to train models with two or more layers of latent

variables. The main exception is when we use fully observed graphical models that

involve deep computation within the local conditional probability distributions.

Even shallow models used for deep learning purposes (such as pretraining shallow

models that will later be composed to form deep ones) nearly always have a single,

large layer of latent variables. Deep learning models typically have more latent

variables than observed variables. Complicated nonlinear interactions between

variables are accomplished via indirect connections that ﬂow through multiple

latent variables. By contrast, traditional graphical models usually contain mostly

variables that are at least occasionally observed, even if many of the variables

are missing at random from some training examples. Complicated nonlinear

interactions between variables are mostly modeled by using higher-order terms,

with structure learning algorithms used to prune connections and control model

capacity. If there are latent variables, they are usually few in number.

The way that latent variables are designed also diﬀers in deep learning. The

deep learning practitioner typically does not intend for the latent variables to

take on any speciﬁc semantics ahead of time—the training algorithm is free to

588

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

invent the concepts it needs to model a particular dataset. The latent variables are

usually not very easy for a human to interpret after the fact, though visualization

techniques may allow some rough characterization of what they represent. When

latent variables are used in the context of traditional graphical models, they are

often designed with some speciﬁc semantics in mind—the topic of a document,

the intelligence of a student, the disease causing a patient’s symptoms, etc. These

models are often much more interpretable by human practitioners and often have

more theoretical guarantees, yet are less able to scale to complex problems and are

not reusable in as many diﬀerent contexts as deep models.

Another obvious diﬀerence is the kind of connectivity typically used in the deep

learning approach. Deep graphical models typically have large groups of units that

are all connected to other groups of units, so that the interactions between two

groups may be described by a single matrix. Traditional graphical models have very

few connections and the choice of connections for each variable may be individually

designed. This is tightly linked with the choice of inference algorithm. Traditional

approaches to graphical models typically aim to maintain the tractability of exact

inference. When this constraint is too limiting, a popular approximate inference

algorithm is an algorithm called loopy belief propagation. Both of these approaches

often work well with very sparsely connected graphs. By comparison, models used

in deep learning tend to make use of the idea of distributed representations. This

means that each observed variable

is connected to many latent variables

that

provide a distributed representation of

(and probably several other observed

variables too). Distributed representations have many advantages, but from the

point of view of graphical models and computational complexity, distributed

representations have the disadvantage of usually yielding graphs that are not

sparse enough for the traditional techniques of exact inference and loopy belief

propagation to be relevant. As a consequence, one of the most striking diﬀerences

between the larger graphical models community and the deep graphical models

community is that loopy belief propagation is almost never used for deep learning.

Most deep models are instead designed to make Gibbs sampling or variational

inference algorithms eﬃcient. Another consideration is that deep learning models

contain a very large number of latent variables, making eﬃcient numerical code

essential. This provides an additional motivation, besides the choice of high-level

inference algorithm, for grouping the units into layers with a matrix describing the

interaction between two layers. This allows the individual steps of the algorithm

to be implemented with eﬃcient matrix product operations, or sparsely connected

generalizations, like block diagonal matrix products or convolutions.

Finally, the deep learning approach to graphical modeling is characterized by

a marked tolerance of the unknown. Rather than simplifying the model until

589

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

all quantities we might want can be computed exactly, we increase the power of

the model until it is just barely possible to train or use. We often use models

whose marginal distributions cannot be computed, and are satisﬁed simply to draw

approximate samples from these models. We often train models with an intractable

objective function that we cannot even approximate in a reasonable amount of

time, but we are still able to approximately train the model if we can eﬃciently

obtain an estimate of the gradient of such a function. The deep learning approach

is often to ﬁgure out what the minimum amount of information we absolutely

need is, and then to ﬁgure out how to get a reasonable approximation of that

information as quickly as possible.

16.7.1 Example: The Restricted Boltzmann Machine

The restricted Boltzmann machine (RBM) (Smolensky, 1986) or harmonium is the

quintessential example of how graphical models are used for deep learning. The

RBM is not itself a deep model. Instead, it has a single layer of latent variables

that may be used to learn a representation for the input. In Chapter 20, we will

see how RBMs can be used to build many deeper models. Here, we show how the

RBM exempliﬁes many of the practices used in a wide variety of deep graphical

models: its units are organized into large groups called layers, the connectivity

between layers is described by a matrix, the connectivity is relatively dense, the

model is designed to allow eﬃcient Gibbs sampling, and the emphasis of the model

design is on freeing the training algorithm to learn latent variables whose semantics

were not speciﬁed by the designer. Later, in Sec. 20.2, we will revisit the RBM in

more detail.

The canonical RBM is an energy-based model with binary visible and hidden

units. Its energy function is

E(v, h) = −b



v −c



h − v



W h, (16.10)

where

, and

are unconstrained, real-valued, learnable parameters. We can

see that the model is divided into two groups of units:

and

, and the interaction

between them is described by a matrix

. The model is depicted graphically

in Fig. 16.14. As this ﬁgure makes clear, an important aspect of this model is

that there are no direct interactions between any two visible units or between any

two hidden units (hence the “restricted,” a general Boltzmann machine may have

arbitrary connections).

The restrictions on the RBM structure yield the nice properties

p(h | v) = Π

p(h

| v) (16.11)

590

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Figure 16.14: An RBM drawn as a Markov network.

and

p(v | h) = Π

p(v

| h). (16.12)

The individual conditionals are simple to compute as well. Depending on the

particular value of

, we obtain its conditional probability by normalizing the

joint over the possible values of h

. For the binary RBM we obtain:

P (h

= 1 | v) = σ





:,i

+ b



P (h

= 0 | v) = 1 − σ





:,i

+ b



Together these properties allow for eﬃcient block Gibbs sampling, which alternates

between sampling all of

simultaneously and sampling all of

simultaneously.

Samples generated by Gibbs sampling from an RBM model are shown in Fig.

16.15.

Since the energy function itself is just a linear function of the parameters, it is

easy to take derivatives of the energy function. For example,

∂

∂W

i,j

E(v, h) = −v

. (16.13)

These two properties—eﬃcient Gibbs sampling and eﬃcient derivatives—make

training convenient. In Chapter 18, we will see that undirected models may be

trained by computing such derivatives applied to samples from the model.

Training the model induces a representation

of the data

. We can often use

h∼p(h|v)

[h] as a set of features to describe v.

Overall, the RBM demonstrates the typical deep learning approach to graph-

ical models: representation learning accomplished via layers of latent variables,

combined with eﬃcient interactions between layers parametrized by matrices.

The language of graphical models provides an elegant, ﬂexible and clear language

for describing probabilistic models. In the chapters ahead, we use this language,

among other perspectives, to describe a wide variety of deep probabilistic models.

591

CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING

Figure 16.15: Samples from a trained RBM and its weights. Image reproduced with

permission from LISA (2008). (Left) Samples from a model trained on MNIST, drawn

using Gibbs sampling. Each column is a separate Gibbs sampling process. Each row

represents the output of another 1,000 steps of Gibbs sampling. Successive samples are

highly correlated with one another. (Right) The corresponding weight vectors. Compare

this to the samples and weights of a linear factor model, shown in Fig. 13.2. The samples

here are much better because the RBM prior

(

) is not constrained to be factorial. The

RBM can learn which features should appear together when sampling. On the other hand,

the RBM posterior p(h | v) is factorial, while the sparse coding posterior p(h | v) is not,

so the sparse coding model may be better for feature extraction. Other models are able

to have both a non-factorial p(h) and a non-factorial p(h | v).

592