Chapter 19
Approximate inference
TODO: somewhere in this chapter, point out that variational inference implic-
itly defines a recurrent net, stochastic approximate inference implicitly defines a
stochastic recurrent net
Misplaced TODO: discussion of the different directions of the KL divergence,
and the effects on ignoring / preserving modes
Many probabilistic models are difficult to train because it is difficult to perform
inference in them. In the context of deep learning, we usually have a set of visible
variables v and a set of latent variables h. The challenge of inference usually refers
to the difficult problem of computing p(h | v) or taking expectations with respect
to it. Such operations are often necessary for tasks like maximum likelihood
learning.
Many simple graphical models with only one hidden layer, such as restricted
Boltzmann machines and probabilistic PCA are defined in a way that makes in-
ference operations like computing p(h | v) or taking expectations with respect to
it simple. Unfortunately, most graphical models with multiple layers of hidden
variables, such as deep belief networks and deep Boltzmann machines have in-
tractable posterior distributions. Exact inference requires an exponential amount
of time in these models. Even some models with only a single layer, such as sparse
coding, have this problem.
Intractable inference problems usually arise from interactions between latent
variables in a structured graphical model. See Fig. 19.1 for some examples. These
interactions may be due to direct interactions in undirected models or “explaining
away” interactions between mutual ancestors of the same visible unit in directed
models.
502
CHAPTER 19. APPROXIMATE INFERENCE
Figure 19.1: Intractable inference problems are usually the result of interactions between
latent variables in a structured graphical model. These can be due to direct edges,
or due to paths that are activated when the child of a V-structure is observed. Left)
A semi-restricted Boltzmann machine with connections between hidden units. These
direct connections between latent variables make the posterior distribution complicated.
Center) A deep Boltzmann machine, organized into layers of variables without intra-
layer connections, still has an intractable posterior distribution due to the connections
between layers. Right) This directed model has interactions between latent variables
when the visible variables are observed, because ever two latent variables are co-parents.
Note that it is still possible to have these graph structures yet have tractable inference.
For example, probabilistic PCA has the graph structure shown in the right, yet simple
inference due to special properties of the specific conditional distributions it uses (linear-
Gaussian conditionals with mutually orthogonal basis vectors). MISPLACED TODO–
make sure probabilistic PCA is at least defined somewhere in the book
503
CHAPTER 19. APPROXIMATE INFERENCE
19.1 Inference as Optimization
Many approaches to confronting the problem of difficult inference make use of the
observation that exact inference can be described as an optimization problem.
Specifically, assume we have a probabilistic model consisting of observed vari-
ables v and latent variables h. We would like to compute the log probability of
the observed data, log p(v; θ). Sometimes it is too difficult to compute log p(v; θ)
if it is costly to marginalize out h. Instead, we can compute a lower bound on it.
This bound is called the evidence lower bound (ELBO). Other names for this lower
bound include the negative variational free energy and the negative Helmholtz free
energy. Specifically, this lower bound is defined as TODO– figure out why I
was making q be bm in some but not all places
L(v, θ, q) = log p(v; θ) D
KL
(q(h)kp(h | v; θ))
where q is an arbitrary probability distribution over h.
TODO: the below equations framebust It is straightforward to see that this
is a lower bound on log p(v):
ln p(v) = ln p(v) +
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(v, h)
q(h | v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(v, h)
q(h | v)
+
X
h
q(h | v) log p(v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v)
ln
p(v, h)
q(h | v)
ln p(v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(v, h)
p(v)q(h | v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(h | v)
q(h | v)
= L(q) + KL(qkp)
Because the difference log p(v) and L(v, θ, q) is given by the KL-divergence and
because the KL-divergence is always non-negative, we can see that L always has
at most the same value as the desired log probability, and is equal to it if and
only if q is the same distribution as p(h | v).
Surprisingly, L can be considerably easier to compute for some distributions
q. Simple algebra shows that we can rearrange L into a much more convenient
form:
L(v, θ, q) = log p(v; θ) D
KL
(q(h)kp(h | v; θ))
504
CHAPTER 19. APPROXIMATE INFERENCE
= log p(v; θ) E
hq
log q(h)
log p(h | v)
= log p(v; θ) E
hq
log q(h)
log
p(h,v)
p(v)
= log p(v; θ) E
hq
[log q(h) log p(h, v) + log p(v)]
= E
hq
[log q(h) log p(h, v)] .
This yields the more canonical definition of the evidence lower bound,
L(v, θ, q) = E
hq
[log p(h, v)] + H(q). (19.1)
The first term of L is known as the energy term. The second term is known
as the entropy term. For an appropriate choice of q, both terms can be easy to
compute. The only question is how close to p(h | v) the distribution q will be.
This determines how good of an approximation L will be for log p(v).
We can thus think of inference as the procedure for finding the q that max-
imizes L. Exact inference maximizes L perfectly. Throughout this chapter, we
will show how many forms of approximate inference are possible. No matter what
choice of q we use, L will give us a lower bound on the likelihood. We can get
tighter or looser bounds that are cheaper or more expensive to compute depending
on how we choose to approach this optimization problem. We can obtain a poorly
matched q but reduce the computational cost by using an imperfect optimization
procedure, or by using a perfect optimization procedure over a restricted family
of q distributions.
19.2 Expectation Maximization
Expectation maximization (EM) is a popular training algorithm for models with
latent variables. It consists of alternating between two steps until convergence:
The E-step (Expectation step): Set q(h
(i)
) = p(h
(i)
| v
(
i); θ) for all indices i
of the training examples v
(i)
we want to train on (both batch and minibatch
variants are valid). By this we mean q is defined in terms of the current
value of θ; if we vary θ then p(h | v; θ) will change but q(h) will not.
The M-step (Maximization step): Completely or partially maximize
P
i
L(v
(
i), θ, q)
with respect to θ using your optimization algorithm of choice.
This can be viewed as a coordinate ascent algorithm to maximize L. On one
step, we maximize L with respect to q, and on the other, we maximize L with
respect to θ.
505
CHAPTER 19. APPROXIMATE INFERENCE
Stochastic gradient ascent on latent variable models can be seen as a special
case of the EM algorithm where the M step consists of taking a single gradient
step. Other variants of the EM algorithm can make much larger steps. For some
model families, the M step can even be performed analytically, jumping all the
way to the optimal solution given the current q.
Even though the E-step involves exact inference, we can think of the EM
algorithm as using approximate inference in some sense. Specifically, the M-step
assumes that the same value of q can be used for all values of θ. This will introduce
a gap between L and the true log p(v) as the M-step moves further and further.
Fortunately, the E-step reduces the gap to zero again as we enter the loop for the
next time.
The EM algorithm is a workhorse of classical machine learning, and it can be
considered to be used in deep learning in the sense that stochastic gradient ascent
can be seen as EM with a very simple and small M step. However, because L
can not be analytically maximized for many interesting deep models, the more
general EM framework as a whole is typically not explored in the deep learning
research community.
TODO–cite the emview paper
19.3 MAP Inference: Sparse Coding as a Probabilis-
tic Model
TODO synch up with other sections on sparse coding
Many versions of sparse coding can be cast as probabilistic models. For ex-
ample, suppose we encode visible data v R
n
with latent variables h R
m
. We
can use a prior to encourage our latent code variables to be sparse:
p(h) = T ODO.
We can define the visible units to be Gaussian with an affine transformation from
the code to the mean of the Gaussian:
v N(v | µ + W h, β
1
)
where β is a diagonal precision matrix to maintain tractability.
Computing p(h | v) is difficult. TODO explain why
One operation that we can do is perform maximum a posteriori (MAP) infer-
ence, which means solving the following optimization problem:
h
= arg max p(h | v).
This yields the familiar optimization problem
506
CHAPTER 19. APPROXIMATE INFERENCE
TODO synch with other sparse coding sections, make sure the other sections
talk about using gradient descent, feature sign, ISTA, etc.
This shows that the popular feature extraction strategy for sparse coding can
be justified as having a probabilistic interpretation–it may be MAP inference in
this probabilistic model ( there are other probabilistic models that yield the same
optimization problem, so we cannot positively identify this specific model from
the feature extraction process ).
Excitingly, MAP inference of h given v also has an interpretation in terms
of maximizing the evidence lower bound. Specifically, MAP inference maximizes
L with respect to q under the constraint that q take the form form of a Dirac
distribution. During learning of sparse coding, we alternate between using convex
optimization to extract the codes, and using convex optimization to update W
to achieve the optimal reconstruction given the codes. This turns out to be
equivalent to maximizing L with respect to θ for the q that was obtained from
MAP inference. The learning algorithm can be thought of as EM restricted to
using a Dirac posterior. In other words, rather than performing learning exactly
using standard inference, we learn to maximize a bound on the true likelihood,
using exact MAP inference.
19.4 Variational Inference and Learning
One common difficulty in probabilistic modeling is that the posterior distribution
p(h | v) is infeasible to compute for many models with hidden variables h and
visible variables v. Expectations with respect to this distribution may also be
intractable.
Consider as an example the binary sparse coding model. In this model, the
input v R
n
is formed by adding Gaussian noise to the sum of m different
components which can each be present or absent. Each component is switched
on or off by the corresponding hidden unit in h {0, 1}
m
:
p(h
i
= 1) = σ(b
i
)
p(v | h) = N(v | W h, β
1
)
where b is a learn-able set of biases, W is a learn-able weight matrix, and β
is a learn-able, diagonal precision matrix.
Training this model with maximum likelihood requires taking the derivative
with respect to the parameters. Consider the derivative with respect to one of
the biases:
b
i
log p(v)
507
CHAPTER 19. APPROXIMATE INFERENCE
=
b
i
p(v)
p(v)
=
b
i
P
h
p(h, v)
p(v)
=
b
i
P
h
p(h)p(v | h)
p(v)
=
P
h
p(v | h)
b
i
p(h)
p(v)
=
X
h
p(h | v)
b
i
p(h)
p(h)
=
X
h
p(h | v)
b
i
p(h)
p(h)
= E
h p(h|v)
b
i
log p(h).
This requires computing expectations with respect to p(h | v). Unfortunately,
p(h | v) is a complicated distribution. See Fig. 19.2 for the graph structure of
p(h, v) and p(h | v). The posterior distribution corresponds to the complete
graph over the hidden units, so variable elimination algorithms do not help us to
compute the required expectations any faster than brute force.
One solution to this problem is to use variational methods. Variational meth-
ods involve using a simple distribution q(h) to approximate the true, complicated
posterior p(h | v). The name “variational” derives from their frequent use of a
branch of mathematics called calculus of variations. However, not all variational
methods use calculus of variations.
TODO variational inference involves maximization of a BOUND TODO vari-
ational inference also usually involves a restriction on the function family
TODO
19.4.1 Discrete Latent Variables
TODO– BSC example
508
CHAPTER 19. APPROXIMATE INFERENCE
h
1
h
2
h
3
v
1
v
2
v
3
h
4
h
1
h
2
h
3
h
4
Figure 19.2: The graph structure of a binary sparse coding model with four hidden units.
Left) The graph structure of p(h, v). Note that the edges are directed, and that every
two hidden units co-parents of every visible unit. Right) The graph structure of p(h | v).
In order to account for the active paths between co-parents, the posterior distribution
needs an edge between all of the hidden units.
19.4.2 Calculus of Variations
Many machine learning techniques are based on minimizing a function J(θ) by
finding the input vector θ R
n
for which it takes on its minimal value. This can
be accomplished with multivariate calculus and linear algebra, by solving for the
critical points where
θ
J(θ) = 0. In some cases, we actually want to solve for a
function f(x), such as when we want to find the probability density function over
some random variable. This is what calculus of variations enables us to do.
A function of a function f is known as a functional J[f]. Much as we can
take partial derivatives of a function with respect to elements of its vector-valued
argument, we can take functional derivatives, also known as variational derivatives
of a functional J[f ] with respect to respect to individual values of the function
f(x). The functional derivative of the functional J with respect to the value of
the function f at point x is denoted
δ
δf(x)
J.
A complete formal development of functional derivatives is beyond the scope
of this book. For our purposes, it is sufficient to state that for differentiable
functions f(x) and differentiable functions g(y, x) with continuous derivatives,
that
δ
δf(x)
Z
g (f (x), x) dx =
y
g(f(x), x). (19.2)
To gain some intuition for this identity, one can think of f(x) as being a vector
with uncountably many elements, indexed by a real vector x. In this (somewhat
incomplete view), the identity providing the functional derivatives is the same as
509
CHAPTER 19. APPROXIMATE INFERENCE
we would obtain for a vector θ R
n
indexed by positive integers:
θ
i
X
j
g(θ
j
, j) =
θ
i
g(θ
i
, i).
Many results in other machine learning publications are presented using the more
general Euler-Lagrange equation which allows g to depend on the derivatives of f
as well as the value of f, but we do not need this fully general form for the results
presented in this book.
To optimize a function with respect to a vector, we take the gradient of the
function with respect to the vector and solve for the point where every element
of the gradient is equal to zero. Likewise, we can optimize a functional by solving
for the function where the functional derivative at every point is equal to zero.
As an example of how this process works, consider the problem of finding the
probability distribution function over x R that has maximal Shannon entropy.
Recall that the entropy of a probability distribution p(x) is defined as
H[p] = E
x
log p(x).
For continuous values, the expectation is an integral:
H[p] =
Z
p(x) log p(x)dx.
We cannot simply maximize H(x) with respect to the function p(x), because
the result might not be a probability distribution. Instead, we need to use La-
grange multipliers, to add a constraint that p(x) integrate to 1. Also, the entropy
increases without bound as the variance increases, so we can only search for the
distribution with maximal entropy for fixed variance σ
2
. Finally, the problem
is underdetermined because the distribution can be shifted arbitrarily without
changing the entropy. To impose a unique solution, we add a constraint that the
mean of the distribution be µ. The Lagrangian functional for this optimization
problem is
L[p] = λ
1
Z
p(x)dx 1
+ λ
2
(E[x] µ) + λ
3
E[(x µ)
2
] σ
2
+ H[p]
=
Z
λ
1
p(x) + λ
2
p(x)x + λ
3
p(x)(x µ)
2
p(x) log p(x)
dx λ
1
µλ
2
σ
2
λ
3
.
To minimize the Lagrangian with respect to p, we set the functional derivatives
equal to 0:
x,
δ
δp(x)
L = λ
1
+ λ
2
x + λ
3
(x µ)
2
1 log p(x) = 0.
510
CHAPTER 19. APPROXIMATE INFERENCE
This condition now tells us the functional form of p(x). By algebraically re-
arranging the equation, we obtain
p(x) = exp
λ
1
λ
2
x + λ
3
(x µ)
2
+ 1
.
We never assumed directly that p(x) would take this functional form; we
obtained the expression itself by analytically minimizing a functional. To finish
the minimization problem, we must choose the λ values to ensure that all of our
constraints are satisfied. We are free to choose any λ values, because the gradient
of the Lagrangian with respect to the λ variables is zero so long as the constraints
are satisfied. To satisfy all of the constraints, we may set λ
1
= log σ
2π, λ
2
= 0,
and λ
3
=
1
2σ
2
to obtain
p(x) = N(x | µ, σ
2
).
This is one reason for using the normal distribution when we do not know the
true distribution. Because the normal distribution has the maximum entropy, we
impose the least possible amount of structure by making this assumption.
What about the probability distribution function that minimizes the entropy?
It turns out that there is no specific function that achieves minimal entropy. As
functions place more mass on x = µ±σ and less on all other values of x, they lose
entropy. However, any function placing exactly zero mass on all but two points
does not integrate to one, and is not a valid probability distribution. There thus
is no single minimal entropy probability distribution function, much as there is
no single minimal positive real number.
19.4.3 Continuous Latent Variables
TODO: Gaussian example from IG’s thesis? TODO: S3C example
19.5 Stochastic Inference
TODO: Charlie Tang’s SFNNs? Is there anything else where sampling-based
inference actually gets used?
19.6 Learned Approximate Inference
TODO: wake-sleep algorithm
In chapter 18.2 we saw that one possible explanation for the role of dream
sleep in human beings and animals is that dreams could provide the negative
phase samples that Monte Carlo training algorithms use to approximate the neg-
ative gradient of the log partition function of undirected models. Another possible
511
CHAPTER 19. APPROXIMATE INFERENCE
explanation for biological dreaming is that it is providing samples from p(h, v)
which can be used to train an inference network to predict h given v. In some
senses, this explanation is more satisfying than the partition function explanation.
Monte Carlo algorithms generally do not perform well if they are run using only
the positive phase of the gradient for several steps then with only the negative
phase of the gradient for several steps. Human beings and animals are usually
awake for several consecutive hours then asleep for several consecutive hours, and
it is not readily apparent how this schedule could support Monte Carlo training
of an undirected model. Learning algorithms based on maximizing L can be run
with prolonged periods of improving q and prolonged periods of improving θ,
however. If the role of biological dreaming is to train networks for predicting q,
then this explains how animals are able to remain awake for several hours (the
longer they are awake, the greater the gap between L and log p(v), but L will re-
main a lower bound) and to remain asleep for several hours (the generative model
itself is not modified during sleep) without damaging their internal models. Of
course, these ideas are purely speculative, and there is no hard evidence to sug-
gest that dreaming accomplishes either of these goals. Dreaming may also serve
reinforcement learning rather than probabilistic modeling, by sampling synthetic
experiences from the animal’s transition model, on which to train the animal’s
policy. Or sleep may serve some other purpose not yet anticipated by the machine
learning community.
TODO: DARN and NVIL? TODO: fast DBM inference
512