Chapter 9
Structured Probabilistic Models:
A Deep Learning Perspective
Deep learning draws upon many modeling formalisms that researchers can use to guide
their design efforts and describe their algorithms. One of these formalisms is the idea of
structured probabilistic models . A structured probabilistic model is a way of describing
a probability distribution, using a graph to describe which random variables in the
probability distribution interact with each other directly. Here we use “graph” in the
graph theory sense–a set of vertices connected to one another by a set of edges. Because
the structure of the model is defined by a graph, these models are often also referred to
as graphical models .
The graphical models research community is large and has developed many different
models, training algorithms, and inference algorithms. In this chapter, we provide basic
background on some of the most central ideas of graphical models, with an emphasis on
the concepts that have proven most useful to the deep learning research community. If
you already have a strong background in graphical models, you may wish to skip most
of this chapter. However, even a graphical model expert may benefit from reading the
final section of this chapter, section 9.7, in which we highlight some of the unique ways
that graphical models are used for deep learning algorithms. Deep learning practitioners
tend to use very different model structures, learning algorithms, and inference procedures
than are commonly used by the rest of the graphical models research community. In this
chapter, we identify these differences in preferences and explain the reasons for them.
In this chapter we first describe the challenges of building large-scale probabilistic
models in section 9.1. Next, we describe how to use a graph to describe the structure
of a probability distribution in section 9.2. We then revisit the challenges we described
in section 9.1 and show how the structured approach to probabilistic modeling can
overcome these challenges in section 9.3. One of the major difficulties in graphical
modeling is understanding which variables need to be able to interact directly, i.e, which
graph structures are most suitable for a given problem. We outline two approaches to
resolving this difficulty by learning about the dependencies in section 9.4. Finally, we
close with a discussion of the unique emphasis that deep learning practitioners place on
158
specific approaches to graphical modeling in section 9.7.
9.1 The Challenge of Unstructured Modeling
The goal of deep learning is to scale machine learning to the kinds of challenges needed to
solve artificial intelligence. The feature spaces we work with are often very large, because
they are intended to be reasonable approximations of the kind of features that people
and animals observe through their various sensory organs. For example, many deep
learning researchers are interested in modeling the distribution over natural images. By
“natural images” we mean images that might be observed by a camera in a reasonably
ordinary environment, as opposed to synthetically rendered images, screen shots of web
pages, etc. Of course, natural images are very high dimensional. Even though much
research into natural image modeling focuses on modeling only small patches, a 32 ×32
pixel image with three color channels has 3,072 features. See Fig. 9.1 for an example of
this kind of data, and the kind of samples a machine learning model is able to generate
after learning to represent this distribution.
Modeling a rich distribution over thousands or millions of random variables is a
challenging task, both computationally and statistically. Suppose we only wanted to
model binary variables. This is the simplest possible case, and yet already it seems
overwhelming. There are 2
3072
possible binary images of this form. This number is over
10
800
times larger than the estimated number of atoms in the universe.
In general, if we wish to model a distribution over a random vector x containing
n discrete variables capable of taking on k values each, then the naive approach of
representing P (x) by storing a lookup table with one probability value per possible
outcome requires k
n
parameters!
This is not feasible for several reasons:
Memory: the cost of storing the representation : For all but very small
values of n and k, representing the distribution as a table will require too many
values to store.
Statistical efficiency: in order to avoid overfitting, a reasonable rule of thumb
is that one should have about ten times more training examples than parameters
in the model. No dataset with even a tiny fraction of this amount of examples is
available for the table-based model. Any such model will overfit the training set
very badly, unless some very strong regularizer sufficiently cripples the model in
other ways.
Runtime: the cost of inference: Suppose we want to perform an inference
task where we use our model of P (x) to compute some other distribution, such as
P (x
1
) or P (x
2
| x
1
). Computing these distributions will require summing across
the entire table, so the runtime of these operations is as high as the intractable
memory cost of storing the model.
159
Figure 9.1: Probabilistic modeling of natural images. Top: Example 32 ×32 pixel color
images from the CIFAR-10 dataset (Krizhevsky and Hinton, 2009). Bottom: Samples
drawn from a structured probabilistic model trained on this dataset. Each sample
appears at the same position in the grid as the training example that is closest to it in
Euclidean space. This comparison allows us to see that the model is truly synthesizing
new images, rather than memorizing the training data. Contrast of both sets of images
has been adjusted for display. Figure reproduced with permission from (Courville et al.,
2011).
160
Runtime: the cost of sampling: Likewise, suppose we want to draw a sample
from the model. The naive way to do this is to sample some value u U(0, 1),
then iterate through the table adding up the probability values until they exceed
u and return the outcome whose probability value was added last. This requires
reading through the whole table in the worst case, so it has the same exponential
cost as the other operations.
The problem with the table-based approach is that we are allowing every possible
kind of interaction between every possible subset of variables. The probability distri-
butions we encounter in real tasks are much simpler than this. Usually, most variables
influence each other only indirectly.
For example, consider modeling the finishing times of a team in a relay race. Suppose
the team consists of three runners, Alice, Bob, and Carol. At the start of the race, Alice
carries a baton and begins running around a track. After completing her lap around
the track, she hands the baton to Bob. Bob then runs his own lap and hands the
baton to Carol, who runs the final lap. We can model each of their finishing times as
a continuous random variable. Alice’s finishing time does not depend on anyone else’s,
since she goes first. Bob’s finishing time depends on Alice’s, because Bob does not
have the opportunity to start his lap until Alice has completed hers. If Alice finishes
faster, Bob will finish faster, all else being equal. Finally, Carol’s finishing time depends
on both her teammates. If Alice is slow, Bob will probably finish late too, and Carol
will have quite a late starting time and thus is likely to have a late finishing time as
well. However, Carol’s finishing time depends only indirectly on Alice’s finishing time
via Bob’s. If we already know Bob’s finishing time, we won’t be able to estimate Carol’s
finishing time better by finding out what Alice’s finishing time was. This means we can
model the relay race using only two interactions: Alice’s effect on Bob, and Bob’s effect
on Carol. We can omit the third, indirect interaction between Alice and Carol from our
model.
Structured probabilistic models provide a formal framework for modeling only direct
interactions between random variables. This allows the models to have significantly
fewer parameters which can in turn be estimated reliably from less data. These smaller
models also have dramatically reduced computation cost in terms of storing the model,
performing inference in the model, and drawing samples from the model.
9.2 A Graphical Syntax for Describing Model Structure
The graphs used by structured probabilistic models represent random variables using
vertices (also called nodes). These graphs represent direct interactions between random
variables using edges between nodes. Different types of graphs assign different exact
meanings to these edges.
161
t
0
t
1
t
2
Alice Bob Carol
Figure 9.2: A directed graphical model depicting the relay race example. Alice’s finishing
time t
0
influences Bob’s finishing time t
1
, because Bob does not get to start running
until Alice finishes. Likewise, Carol only gets to start running after Bob finishes, so
Bob’s finishing time t
1
influences Carol’s finishing time t
2
.
9.2.1 Directed Models
One kind of structured probabilistic model is the directed graphical model otherwise
known as the belief network or Bayesian network
1
(Pearl, 1985). Directed graphical
models are called “directed” because the edges in them are directed, that is, they point
from one vertex to another. This direction is represented in the drawing with an arrow.
In a directed graphical model, the arrow indicates which variable’s probability distri-
bution is defined in terms of the other’s. If the probability distribution over a random
variable b is defined in terms of the state of a random variable a, then we draw an arrow
from a to b.
Let’s continue with the relay race example from Section 9.1. Suppose we name
Alice’s finishing time t
0
, Bob’s finishing time t
1
, and Carol’s finishing time t
2
. As we
saw earlier, our estimate of t
1
depends on t
0
. Our estimate of t
2
depends directly on t
1
but only indirectly on t
0
. We can draw this relationship in a directed graphical model,
illustrated in Fig. 9.2.
Formally, a directed graphical model defined on variables x is defined by a directed
acyclic graph G whose vertices are the random variables in the model, and a set of local
conditional probability distributions p(x
i
| P a
G
(x
i
)) where P a
G
(x
i
) gives the parents of
x
i
in G. The probability distribution over x is given by
p(x) = Π
i
p(x
i
| P a
G
(x
i
)).
In our relay race example, this means that, using the graph drawn in Fig. 9.2,
p(t
0
, t
1
, t
2
) = p(t
0
)p(t
1
| t
0
)p(t
2
| t
1
).
Suppose we represented time by discretizing time ranging from minute 0 to minute
10 into 6 second chunks. This would make t
0
, t
1
, and t
2
each be discrete variables with
100 possible values. If we attempted to represent p(t
0
, t
1
, t
2
) with a table, it would need
to store 999,999 values (100 values of t
0
× 100 values of t
1
× 100 values of t
2
, minus 1,
1
Judea Pearl suggested using the term Bayes Network when one wishes to “emphasize the judgmental
nature” of the values computed by the network, i.e. to highlight that they usually represent degrees of
belief rather than frequencies of events.
162
since the probability of one of the configurations is made redundant by the constraint
that the sum of the probabilities be 1). If instead, we only make a table for each of the
conditional probability distributions, then the distribution over t
0
requires 99 values,
the table over t
0
and t
1
requires 9900 values, and so does the table over t
1
and t
2
. This
comes to a total of 19,899 values. This means that using the directed graphical model
reduced our number of parameters by a factor of more than 50!
In general, to model n discrete variables each having k values, the cost of the single
table approach scales like O(k
n
), as we’ve observed before. Now suppose we build a
directed graphical model over these variables. If m is the maximum number of variables
appearing in a single conditional probability distribution, then the cost of the tables
for the directed model scales like O(k
m
). As long as we can design a model such that
m << n, we get very dramatic savings.
In other words, so long as each variable has few parents in the graph, the distri-
bution can be represented with very few parameters. Some restrictions on the graph
structure (e.g. it is a tree) can also guarantee that operations like computing marginal
or conditional distributions over subsets of variables are efficient.
It’s important to realize what kinds of information can be encoded in the graph, and
what can’t be. The graph just encodes simplifying assumptions about which variables
are conditionally independent from each other. It’s also possible to make other kinds of
simplifying assumptions. For example, suppose we assume Bob always runs the same
regardless of how Alice performed. (In reality, he might get overconfident and lazy and
run slower if Alice was especially fast, or he might try especially hard to make up time
if Alice was slower than usual). Then the only effect Alice has on Bob’s finishing time
is we must add Alice’s finishing time to the total amount of time we think Bob needs
to run. This could let us make a model with O(k) efficiency instead of O(k
2
). However,
note that t
0
and t
1
are still directly dependent with this assumption. It is a useful
assumption, but not one that can be encoded in a graph. It’s just part of our definition
of the conditional probability distribution p(t
1
| t
0
).
9.2.2 Undirected Models
Directed graphical models make a lot of intuitive sense in many situations. Usually
these are situations where we understand the causality, and the causality only flows in
one direction. (Such as in the relay race example: earlier runners cause effects on the
finishing times of later runners; later runners do not cause any effects on the finishing
times of earlier runners) Not all situations fit this model, so directed models are not
always the most natural formalism to use.
Suppose we want to model a distribution over three binary variables–whether or not
you are sick, whether or not your coworker is sick, and whether or not your roommate is
sick. As in the relay race example, we can still make simplifying assumptions about the
kinds of interactions that take place: assuming that your coworker and your roommate
do not know each other, it is very unlikely that one of them will give the other a disease
such as a cold directly. This event can be seen as so rare that it is acceptable not to
model it. However, it is reasonably likely that either of them could give you a cold, and
163
h
r
h
y
h
c
Figure 9.3: An undirected graph representing how your roommate’s health h
r
, your
health rh
y
, and your work colleague’s health rh
c
affect each other. You and your room-
mate might infect each other with a cold, and you and your work colleague might do
the same, but assuming that your roommate and your colleague don’t know each other,
they can only infect each other indirectly via you.
that you could pass it on to the other. We can model the indirect transmission of a cold
from your coworker to your roommate by modeling the transmission of the cold from
your coworker to you and the transmission of the cold from you to your roommate.
In this case, it’s just as easy for you to cause your roommate to get sick as it is for
your roommate to make you sick, so there is not a clean, uni-directional narrative on
which to base the model. To handle this kind of situation, we can use another kind of
model.
Undirected models, otherwise known as Markov random fields (MRFs) or Markov
networks (Kindermann, 1980), are graphical models in which the edges are not directed.
If two nodes are connected by an edge, then the random variables corresponding to those
nodes interact with each other directly.
Let’s call the random variable representing your health h
y
, the random variable
representing your roommate’s health h
r
, and the random variable representing your
colleague’s health h
c
. See Fig. 9.3 for a drawing of the graph representing this scenario.
Formally, an undirected graphical model is a structured probabilistic model defined
on an undirected graph G. For each clique C in the graph
2
, a factor φ(C) (also called a
clique potential) measures the affinity of the variables in that clique for being in each
of their possible states. The factors are constrained to be non-negative. Together they
define an unnormalized probability distribution
˜p(x) = Π
C∈G
φ(C).
The unnormalized probability distribution is efficient to work with so long as all the
cliques are small. It encodes the idea that states with higher affinity are more likely.
However, unlike in a Bayesian network, there is little structure to the definition of the
cliques, so there is nothing to guarantee that multiplying them together will yield a valid
probability distribution. See Fig. 9.4 for an example of reading factorization information
from an undirected graph.
9.2.3 The Partition Function
While the unnormalized probability distribution is guaranteed to be non-negative ev-
erywhere, it is not guaranteed to sum or integrate to 1. To obtain a valid probability
2
A clique of the graph is a subset of nodes that are all connected to each other by an arc of the graph.
164
distribution, we must use the corresponding normalized probability distribution,
3
p(x) =
1
Z
˜p(x)
where Z is the value that results in the probability distribution summing or integrating
to 1,
Z =
˜p(x)dx.
You can think of Z as a constant when the φ functions are held constant. Note
that if the φ functions have parameters, then Z is a function of those parameters. It
is common in the literature to write Z with its arguments omitted to save space. Z is
known as the partition function, a term borrowed from statistical physics.
Since Z is an integral or sum over all possible joint assignments of the state x it is
often intractable to compute. In order to be able to obtain the normalized probability
distribution of an undirected model, the model structure and the definitions of the φ
functions must be conducive to computing Z efficiently.
Note that for p(x) to exist, Z must exist. The generic definition of Z does not
guarantee that it exist in general. For example, suppose we want to model a single
scalar variable x R with a single clique potential φ(x) = x
2
. In this case,
Z =
−∞
x
2
dx.
Since this integral diverges, there is no probability distribution corresponding to this
choice of φ(x). Sometimes the choice of some parameter of the φ functions determines
whether the probability distribution is defined. For example, for φ(x) = exp
βx
2
, the
β parameter determines whether Z exists. Negative β results in a Gaussian distribution
over x but all other values of β make φ impossible to normalize.
One key difference between directed modeling and undirected modeling is that di-
rected models are defined directly in terms of probability distributions from the start,
while undirected models are defined more loosely by φ functions that are then converted
into probability distributions. This changes the intuitions one must develop in order to
work with these models. One key idea to keep in mind while working with undirected
models is that the domain of each of the variables has a dramatic consequence on the
kind of probability distribution that a given set of φ functions results in. For example,
consider an n-dimensional vector-valued random variable x and an undirected model
parameterized by a vector of biases b. Suppose we have one clique for each element of
x, φ
i
(x
i
) = exp(b
i
x
i
). What kind of probability distribution does this result in? The
answer is that we don’t have enough information, because we have not yet specified
the domain of x. If x R
n
, then the integral defining Z diverges and no probability
distribution exists. If x {0, 1}
n
, then p(x) factorizes into n independent distributions,
with p(x
i
= 1) = sigmoid (b
i
). If the domain of x is the set of elementary basis vectors
3
A distribution defined by normalizing a product of clique potentials is also called a Gibbs distribution.
165
A B C
D E F
Figure 9.4: This graph implies that p(A, B, C, D, E, F) can be written as
1
Z
φ
A,B
(A, B)φ
B,C
(B, C)φ
A,D
(A, D)φ
B,E
(B, E)φ
E,F
(E, F ) for an appropriate choice of
the φ functions.
({[1, 0, . . . , 0], [0, 1, . . . , 0], . . . , [0, 0, . . . , 1]} ) then p(x) = softmax(b), so a large value of
b
i
actually reduces p(x
j
= 1) for j = i. Often, it is possible to leverage the effect of a
carefully chosen domain of a variable in order to obtain complicated behavior
9.2.4 Energy-Based Models
Many interesting theoretical results about undirected models depend on the assumption
that x, ˜p(x) > 0. A convenient way to enforce this to use an energy-based model (EBM)
where
˜p(x) = exp(E(x)) (9.1)
and E(x) is known as the energy function. Because exp(z) is positive for all z, this
guarantees that no energy function will result in a probability of zero for any state
x. Being completely free to choose the energy function makes learning simpler. If we
learned the clique potentials directly, we would need to use constrained optimization,
and we would need to impose some specific minimal probability value. By learning the
energy function, we can use unconstrained optimization
4
, and the probabilities in the
model can approach arbitrarily close to zero but never reach it.
Any distribution of the form given by equation 9.1 is an example of a Boltzmann
distribution. For this reason, many energy-based models are called Boltzmann machines.
There is no accepted guideline for when to call a model an energy-based model and when
to call it a Boltzmann machines. The term Boltzmann machine was first introduced to
describe a model with exclusively binary variables, but today many models such as the
mean-covariance restricted Boltzmann machine incorporate real-valued variables as well.
Cliques in an undirected graph correspond to factors of the unnormalized probabil-
ity function. Because exp(a) exp(b) = exp(a + b), this means that different cliques in
the undirected graph correspond to the different terms of the energy function. In other
words, an energy-based model is just a special kind of Markov network: the exponen-
tiation makes each term in the energy function correspond to a factor for a different
clique. See Fig. 9.5 for an example of how to read the form of the energy function from
an undirected graph structure.
4
For some models, we may still need to use constrained optimization to make sure Z exists.
166
A B C
D E F
Figure 9.5: This graph implies that E(A, B, C, D, E, F) can be written as E
A,B
(A, B)+
E
B,C
(B, C) + E
A,D
(A, D) + E
B,E
(B, E) + E
E,F
(E, F ) for an appropriate choice of the
per-clique energy functions. Note that we can obtain the φ functions in Fig. 9.4 by
setting each φ to the exp of the corresponding negative energy, e.g, φ
A,B
(A, B) =
exp (E(A, B)).
As a historical note, observe that the sign in 9.1 does not change the represen-
tational power of the energy-based model. From a machine learning point of view, the
negation serves no purpose. Some machine learning researchers (e.g., Smolensky (1986),
who referred to negative energy as harmony) have worked on related ideas that omit the
negation. However, in the field of statistical physics, energy is a useful concept because
it refers to real energy of physical particles. Many advances in probabilistic modeling
were originally developed by statistical physicists, and terminology such as “energy” and
“partition function” remains associated with these techniques, even though their math-
ematical applicability is broader than the physics context in which they were developed.
9.2.5 Separation and D-Separation
The main purpose of a graphical model is to specify which interactions do not occur in a
given probability distribution so that we can save computational resources and estimate
the model with greater statistical efficiency. The edges in the graph show which variables
directly interact, but it can also be useful to know which variable indirectly interact,
and in what context.
Identifying the independences in a graph is very simple in the case of undirected
models. In the context of undirected models, this independence implied by the graph
is called separation. We say that a set of variables A is separated from another set
of variables B given a third set of variables S if the graph structure implies that A is
independent from B given S. Determining which variables are separated is simple. If two
variables A and B are connected by a path involving only unobserved variables, then
those variables are not separated. If these variables cannot be shown to depend indirectly
on each other in this manner, then they are separated. We refer to paths involving only
unobserved variables as “active” and paths including an observed variable as “inactive.”
When we draw a graph, we can indicate observed variables by shading them in. See
Fig. 9.6 for a depiction of what an active and an inactive path looks like when drawn in
this way. See Fig. 9.7 for an example of reading separation from a graph.
Similar concepts apply to directed models, except that in the context of directed
167
A S B A S B
A S B A S B
(a) (b)
Figure 9.6: a) The path between random variable A and random variable B through
S is active, because S is not observed. This means that A and B are not separated.
b) Here S is shaded in, to indicate that it is observed. Because the only path between
A and B is through S, and that path is inactive, we can conclude that A and B are
separated given S.
A
B C
D
Figure 9.7: An example of reading separation properties from an undirected graph. B
is shaded to indicate that it is observed. Because observing B blocks the only path from
A to C, we say that A and C are separated from each other given B. The observation
of B also blocks one path between A and D, but there is a second, active path between
them. Therefore, A and D are not separated given B.
168
models, these concepts are referred to as d-separation. The “d” stands for “dependence.”
D-separation for directed graphs is defined the same as separation for undirected graphs:
We say that a set of variables A is d-separated from another set of variables B given
a third set of variables S if the graph structure implies that A is independent from B
given S.
As with undirected models, we can examine the independences implied by the graph
by looking at what active paths exist in the graph. As before, two variables are depen-
dent if there is an active path between them, and d-separated if no such path exists. In
directed nets, determining whether a path is active is somewhat more complicated. See
Fig. 9.8 for a guide to identifying active paths in a directed model. See Fig. 9.9 for an
example of reading some properties from a graph.
It is important to remember that separation and d-separation tell us only about
those conditional independences that are implied by the graph. There is no requirement
that the graph imply all independences that are present. In particular, it is always
legitimate to use the complete graph (the graph with all possible edges) to represent
any distribution. In fact, some distributions contain independences that are not possible
to represent with graphical notation. Context-specific independences are independences
that are present dependent on the value of some variables in the network. For example,
consider a model of three binary variables, A, B, and C. Suppose that when A is 0, B
and C are independent, but when A is 1, B is deterministically equal to C. Encoding
the behavior when A = 1 requires an edge connecting B and C. The graph then fails
to indicate that B and C are independent when A = 1.
In general, a graph will never imply that an independence exists when it does not.
However, a graph may fail to encode an independence.
9.2.6 Operations on a Graph
TODO: conversion between directed and undirected models
TODO: marginalizing variables out of a graph
9.2.7 Factor Graphs
Factor graphs are another way of drawing undirected models that resolve an ambiguity
in the graphical representation of standard undirected model syntax. In an undirected
model, the scope of every φ function must be a subset of some clique in the graph.
However, it is not necessary that there exist any φ whose scope contains the entirety of
every clique. Factor graphs explicitly represent the scope of each φ function. Specifically,
a factor graph is a graphical representation of an undirected model that consists of
a bipartite undirected graph. Some of the nodes are drawn as circles. These nodes
correspond to random variables as in a standard undirected model. The rest of the
nodes are drawn as squares. These nodes correspond to the factors φ of the unnormalized
probability distribution. Variables and factors may be connected with undirected edges.
A variable and a factor are connected in the graph if and only if the variable is one
of the arguments to the factor in the unnormalized probability distribution. No factor
169
A S B
A S B
A
S
B A S B
A S B
C
A S B
A S B
A
S
B A S B
A S B
C
(a) (b)
A S B
A S B
A
S
B A S B
A S B
C
A S B
A S B
A
S
B A S B
A S B
C
(c) (d)
Figure 9.8: All of the kinds of active paths of length two that can exist between random
variables A and B. a) Any path with arrows proceeding directly from A to B or vice
versa. This kind of path becomes blocked if S is observed. We have already seen this
kind of path in the relay race example. b) A and B are connected by a common cause S.
For example, suppose S is a variable indicating whether or not there is a hurricane and
A and B measure the wind speed at two different nearby weather monitoring outposts.
If we observe very high winds at station A, we might expect to also see high winds at
B. This kind of path can be blocked by observing S. If we already know there is a
hurricane, we expect to see high winds at B, regardless of what is observed at A. A
lower than expected wind at A (for a hurricane) would not change our expectation of
winds at B (knowing there is a hurricane). However, if S is not observed, then A and
B are dependent, i.e., the path is inactive. c) A and B are both parents of S. This is
called a V-structure or the collider case. A and B are related by the explaining away
effect. In this case, the path is actually active when S is observed. For example, suppose
S is a variable indicating that your colleague is not at work. A represents her being
sick, while B represents her being on vacation. If you observe that she is not at work,
you can presume she is probably sick or on vacation, but it’s not especially likely that
both have happened at the same time. If you find out that she is on vacation, this fact
is sufficient to explain her absence, and you can infer that she is probably not also sick.
d) The explaining away effect happens even if any descendant of the S is observed! For
example, suppose that C is a variable representing whether you have received a report
from your colleague. If you notice that you have not received the report, this increases
your estimate of the probability that she is not at work today, which in turn makes it
more likely that she is either sick or on vacation. The only way to block a path through
a V-structure is to observe none of the descendants of the shared child.
170
A B
C
D E
Figure 9.9: From this graph, we can read out several d-separation properties. Examples
include: A and B are d-separated given the empty set; A and E are d-separated given
C; D and E are d-separated given C. Note that there are a few d-separations that do
not occur: A and B are not d-separated given C; A and B are still not d-separated
given D. TODO: format this caption better, evidently latex does not support itemize
or endlines in captions
may be connected to another factor in the graph, nor can a variable be connected to a
variable. See Fig. 9.10 for an example of how factor graphs can resolve ambiguity in the
interpretation of undirected networks.
9.3 Advantages of Structured Modeling
TODO: hammer point that graphical models convey information by leaving edges out
TODO: revisit each of the three challenges from sec:unstructured TODO: don’t forget
to teach about ancestral and Gibbs sampling while showing the reduced cost of sampling
TODO: benefit of separating representation from learning and inference
9.4 Learning about Dependencies
We consider here two types of random variables: observed or “visible” variables v and
latent or “hidden” variables h. v corresponds to the variables actually provided in the
data set during training. h consists of variables that are introduced to the model in
order to help it explain the structure in v. Generally the exact semantics of h depend
on the model parameters and are created by the learning algorithm. The motivation for
this is twofold.
9.4.1 Latent Variables Versus Structure Learning
Often the different elements of v are highly dependent on each other. A good model of
v which did not contain any latent variables would need to have very large numbers of
parents per node in a Bayesian network or very large cliques in a Markov network. Just
representing these higher order interactions is costly–both in a computational sense,
171
A B
C
A B
C
f
1
A B
C
f
1
f
2
f
3
A B
C
A B
C
f
1
A B
C
f
1
f
2
f
3
A B
C
A B
C
f
1
A B
C
f
1
f
2
f
3
(a) (b) (c)
Figure 9.10: An example of how a factor graph can resolve ambiguity in the interpre-
tation of undirected networks. a) An undirected network with a clique involving three
variables A, B, and C. b) A factor graph corresponding to the same undirected model.
This factor graph has one factor over all three variables. c) Another valid factor graph
for the same undirected model. This factor graph has three factors, each over only
two variables. Note that representation, inference, and learning are all asymptotically
cheaper in (c) compared to (b), even though both require the same undirected graph to
represent.
because the number of parameters that must be stored in memory scales exponentially
with the number of members in a clique, but also in a statistical sense, because this
exponential number of parameters requires a wealth of data to estimate accurately.
There is also the problem of learning which variables need to be in such large cliques.
An entire field of machine learning called structure learning is devoted to this problem
(Koller and Friedman, 2009). Most structure learning techniques involve fitting a model
with a specific structure to the data, assigning it some score that rewards high training
set accuracy and penalizes model complexity, then greedily adding or subtracting an
edge from the graph in a way that is expected to increase the score.
Using latent variables instead avoids this whole problem. A fixed structure over
visible and hidden variables can use direct interactions between visible and hidden units
to impose indirect interactions between visible units. Using simple parameter learning
techniques we can learn a model with a fixed structure that imputes the right structure
on the marginal p(v).
9.4.2 Latent Variables for Feature Learning
Another advantage of using latent variables is that they often develop useful semantics.
As discussed in section 3.10.5, the mixture of Gaussians model learns a latent variable
that corresponds to which category of examples the input was drawn from. Other more
sophisticated models with more latent variables can create even richer descriptions of
the input. Most of the approaches mentioned in sec. 9.4.2 accomplish feature learning
by learning latent variables. Often, given some model of v and h, it turns out that
E[h | v] or argmax
h
p(h, v) is a good feature mapping for v.
TODO: structure learning
172
TODO: latent variables
9.5 Markov Chain Monte Carlo Methods
Drawing a sample x from the probability distribution p(x) defined by a structured model
is an important operation. The following techniques are described in (Koller and Fried-
man, 2009).
Sampling from an energy-based model is not straightforward. Suppose we have an
EBM defining a distribution p(a, b). In order to sample a, we must draw it from p(a | b),
and in order to sample b, we must draw it from p(b | a). It seems to be an intractable
chicken-and-egg problem. Directed models avoid this because their G is directed and
acyclical. In ancestral sampling one simply samples each of the variables in topological
order, conditioning on each variable’s parents, which are guaranteed to have already
been sampled. This defines an efficient, single-pass method of obtaining a sample.
In an EBM, it turns out that we can get around this chicken and egg problem by
sampling using a Markov chain. A Markov chain is defined by a state x and a transition
distribution T (x
| x). Running the Markov chain means repeatedly updating the state
x to a value x
sampled from T (x
| x).
Under certain distributions, a Markov chain is eventually guaranteed to draw x from
an equilibrium distribution π(x
), defined by the condition
x
, π(x
) =
x
T (rvx
| x)π(x).
TODO– this vector / matrix view needs a whole lot more exposition only literally
a vector / matrix when the state is discrete unpack into multiple sentences, the paren-
thetical is hard to parse is the term “stochastic matrix” defined anywhere? make sure
it’s in the index at least whoever finishes writing this section should also finish making
the math notation consistent
We can think of π as a vector (with the probability for each possible value x in the
element indexed by x, π(x)) and T as a corresponding stochastic matrix (with row index
x
and column index x), i.e., with non-negative entries that sum to 1 over elements of a
column. Then, the above equation becomes
T π = π
an eigenvector equation that says that π is the eigenvector of T with eigenvalue 1. It can
be shown (Perron-Frobenius theorem) that this is the largest possible eigenvalue, and
the only one with value 1 under mild conditions (for example T (x
| x) > 0). We can also
see this equation as a fixed point equation for the update of the distribution associated
with each step of the Markov chain. If we start a chain by picking x
0
p
0
, then we
get a distribution p
1
= Tp
0
after one step, and p
t
= Tp
t1
= T
t
p
0
after t steps. If this
recursion converges (the chain has a so-called stationary distribution), then it converges
to a fixed point which is precisely p
t
= π for t , and the dynamical systems view
meets and agrees with the eigenvector view.
173
This condition guarantees that repeated applications of the transition sampling pro-
cedure don’t change the distribution over the state of the Markov chain. Running the
Markov chain until it reaches its equilibrium distribution is called “burning in the
Markov chain.
Unfortunately, there is no theory to predict how many steps the Markov chain must
run before reaching its equilibrium distribution, nor any way to tell for sure that this
event has happened. Also, even though successive samples come from the same distri-
bution, they are highly correlated with each other, so to obtain multiple samples one
should run the Markov chain for many steps between collecting each sample. Markov
chains tend to get stuck in a single mode of π(x) for several steps. The speed with which
a Markov chain moves from mode to mode is called its mixing rate. Since burning in
a Markov chain and getting it to mix well may take several sampling steps, sampling
correctly from an EBM is still a somewhat costly procedure.
TODO: mention Metropolis-Hastings
Of course, all of this depends on ensuring π(x) = p(x) . Fortunately, this is easy
so long as p(x) is defined by an EBM. The simplest method is to use Gibbs sampling,
in which sampling from T(x
| x) is accomplished by selecting one variable x
i
and
sampling it from p conditioned on its neighbors in G. It is also possible to sample
several variables at the same time so long as they are conditionally independent given
all of their neighbors.
TODO: discussion of mixing example with 2 binary variables that prefer to both
have the same state IG’s graphic from lecture on adversarial nets
TODO: add section on MCMC, it needs to be developed here so both the generative
autoencoders and the advanced deep nets can refer back to it TODO: there is some
discussion of Markov chains already when describing how to sample from an EBM,
determine how to present content. NOTE: there seems to be stuff about MCMC in
section 7.3 already
TODO: refer to this figure in the ext:
TODO: refer to this figure in the text
9.6 Inference and Approximate Inference Over Latent Vari-
ables
As soon as we introduce latent variables in a graphical model, this raises the question:
how to choose values of the latent variables h given values of the visible variables x?
This is what we call inference, in particular inference over the latent variables. The
general question of inference is to guess some variables given others.
TODO: mention loopy BP, show how it is very expensive for DBMs
TODO: briefly explain what variational inference is and reference approximate in-
ference chapter
174
Figure 9.11: Paths followed by Gibbs sampling for three distributions, with the Markov
chain initialized at the mode in both cases. Left) A multivariate normal distribution
with two independent variables. Gibbs sampling mixes well because the variables are
independent. Center) A multivariate normal distribution with highly correlated vari-
ables. The correlation between variables makes it difficult for the Markov chain to mix.
Because each variable must be updated conditioned on the other, the correlation reduces
the rate at which the Markov chain can move away from the starting point. Right) A
mixture of Gaussians with widely separated modes that are not axis-aligned. Gibbs
sampling mixes very slowly because it is difficult to change modes while altering only
one variable at a time.
Figure 9.12: An illustration of the slow mixing problem in deep probabilistic models.
Each panel should be read left to right, top to bottom. Left) Consecutive samples
from Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST
dataset. Consecutive samples are similar to each other. Because the Gibbs sampling is
performed in a deep graphical model, this similarity is based more on semantic rather
than raw visual features, but it is still difficult for the Gibbs chain to transition from
one mode of the distribution to another, for example by changing the digit identity.
Right) Consecutive ancestral samples from a generative adversarial network. Because
ancestral sampling generates each sample independently from the others, there is no
mixing problem.
175
9.7 The Deep Learning Approach to Structured Proba-
bilistic Modeling
Deep learning practictioners generally use the same basic computational tools as other
machine learning practitioners who work with structured probabilistic models. However,
in the context of deep learning, we usually make different design decisions about how to
combine these tools, resulting in overall algorithms and models that have a very different
flavor from more traditional graphical models.
The most striking difference between the deep learning style of graphical model de-
sign and the traditional style of graphical model design is that the deep learning style
heavily emphasizes the use of latent variables. Deep learning models typically have more
latent variables than observed variables. Moreover, the practitioner typically does not
intend for the latent variables to take on any specific semantics ahead of time— the
training algorithm is free to invent the concepts it needs to model a particular dataset.
The latent variables are usually not very easy for a human to interpret after the fact,
though visualization techniques may allow some rough characterization of what they
represent. Complicated non-linear interactions between variables are accomplished via
indirect connections that flow through multiple latent variables. By contrast, traditional
graphical models usually contain variables that are at least occasionally observed, even
if many of the variables are missing at random from some training examples. Com-
plicated non-linear interactions between variables are modeled by using higher-order
terms, with structure learning algorithms used to prune connections and control model
capacity. When latent variables are used, they are often designed to represent some
with some specific semantics in mind—the topic of a document, the intelligence of a
student, the disease causing a patient’s symptoms, etc. These models are often much
more interpretable by human practitioners and often have more theoretical guarantees,
yet are less able to scale to complex problems and are not re-useable in as many different
contexts as deep models.
Another obvious difference is the kind of graph structure typically used in the deep
learning approach. This is tightly linked with the choice of inference algorithm. Tradi-
tional approaches to graphical models typically aim to maintain the tractability of exact
inference. When this constraint is too limiting, a popular exact inference algorithm is
loopy belief propagation. Both of these approaches often work well with very sparsely
connected graphs. By comparison, very few interesting deep models admit exact infer-
ence, and loopy belief propagation is almost never used for deep learning. Most deep
models are designed to make Gibbs sampling or variational inference algorithms, rather
than loopy belief propagation, efficient. Another consideration is that deep learning
models contain a very large number of latent variables, making efficient numerical code
essential. As a result of these design constraints, most deep learning models are orga-
nized into regular repeating patterns of units grouped into layers, but neighboring layers
may be fully connected to each other. When sparse connections are used, they usually
follow a regular pattern, such as the block connections used in convolutional models.
Finally, the deep learning approach to graphical modeling is characterized by a
marked tolerance of the unknown. Rather than simplifying the model until all quantities
176
h
1
h
2
h
3
v
1
v
2
v
3
h
4
Figure 9.13: An example RBM drawn as a Markov network
we might want can be computed exactly, we increase the power of the model until it is
just barely possible to train or use. We often use models whose marginal distributions
cannot be computed, and are satisfied simply to draw approximate samples from these
models. We often train models with an intractable objective function that we cannot
even approximate in a reasonable amount of time, but we are still able to approximately
train the model if we can efficiently obtain an estimate of the gradient of such a func-
tion. The deep learning approach is often to figure out what the minimum amount
of information we absolutely need is, and then to figure out how to get a reasonable
approximation of that information as quickly as possible.
9.7.1 Example: The Restricted Boltzmann Machine
TODO: rework this section. Add pointer to Chapter 17.1. TODO what do we want to
exemplify here?
The restricted Boltzmann machine (RBM) (Smolensky, 1986) or harmonium is an
example of a model that TODO what do we want to exemplify here?
It is an energy-based model with binary visible and hidden units. Its energy function
is
E(v, h) = b
v c
h v
W h
where b, c, and W are unconstrained, real-valued, learnable parameters. The model
is depicted graphically in Fig. 9.13. As this gure makes clear, an important aspect
of this model is that there are no direct interactions between any two visible units or
between any two hidden units (hence the “restricted,” a general Boltzmann machine
may have arbitrary connections).
The restrictions on the RBM structure yield the nice properties
p(h | v) = Π
i
p(h
i
| v)
and
p(v | h) = Π
i
p(v
i
| h).
177
The individual conditionals are simple to compute as well, for example
p(h
i
= 1 | v) = σ
v
W
:,i
+ b
i
.
Together these properties allow for efficient block Gibbs sampling, alternating be-
tween sampling all of h simultaneously and sampling all of v simultaneously.
Since the energy function itself is just a linear function of the parameters, it is easy
to take the needed derivatives. For example,
W
i,j
E
v,h
E(v, h) = v
i
h
j
.
These two properties–efficient Gibbs sampling and efficient derivatives– make it pos-
sible to train the RBM with stochastic approximations to
θ
log Z.
TODO– do we want to say anything about Sum Product Networks here? If not
here, where else in the book? cite James Martens’ new paper on their representational
capacity
178