Chapter 20
Approximate inference
Misplaced TODO: discussion of the different directions of the KL divergence, and the
effects on ignoring / preserving modes
Many probabilistic models are difficult to train because it is difficult to perform
inference in them. In the context of deep learning, we usually have a set of visible
variables v and a set of latent variables h. The challenge of inference usually refers to
the difficult problem of computing p(h | v) or taking expectations with respect to it.
Such operations are often necessary for tasks like maximum likelihood learning.
Many simple graphical models with only one hidden layer, such as restricted Boltz-
mann machines and probabilistic PCA are defined in a way that makes inference oper-
ations like computing p(h | v) or taking expectations with respect to it simple. Unfor-
tunately, most graphical models with multiple layers of hidden variables, such as deep
belief networks and deep Boltzmann machines have intractable posterior distributions.
Exact inference requires an exponential amount of time in these models. Even some
models with only a single layer, such as sparse coding, have this problem.
Intractable inference problems usually arise from interactions between latent vari-
ables in a structured graphical model. See Fig. 20.1 for some examples. These inter-
actions may be due to direct interactions in undirected models or “explaining away”
interactions between mutual ancestors of the same visible unit in directed models.
20.1 Inference as Optimization
Many approaches to confronting the problem of difficult inference make use of the ob-
servation that exact inference can be described as an optimization problem.
Specifically, assume we have a probabilistic model consisting of observed variables v
and latent variables h. We would like to compute the log probability of the observed
data, log p(v; θ). Sometimes it is too difficult to compute log p(v; θ) if it is costly to
marginalize out h. Instead, we can compute a lower bound on it. This bound is called
the evidence lower bound (ELBO). Other names for this lower bound include the negative
variational free energy and the negative Helmholtz free energy. Specifically, this lower
bound is defined as TODO– figure out why I was making q be bm in some but not all
372
Figure 20.1: Intractable inference problems are usually the result of interactions between
latent variables in a structured graphical model. These can be due to direct edges, or
due to paths that are activated when the child of a V-structure is observed. Left)
A semi-restricted Boltzmann machine with connections between hidden units. These
direct connections between latent variables make the posterior distribution complicated.
Center) A deep Boltzmann machine, organized into layers of variables without intra-
layer connections, still has an intractable posterior distribution due to the connections
between layers. Right) This directed model has interactions between latent variables
when the visible variables are observed, because ever two latent variables are co-parents.
Note that it is still possible to have these graph structures yet have tractable inference.
For example, probabilistic PCA has the graph structure shown in the right, yet simple
inference due to special properties of the specific conditional distributions it uses (linear-
Gaussian conditionals with mutually orthogonal basis vectors). MISPLACED TODO–
make sure probabilistic PCA is at least defined somewhere in the book
373
places
L(v, θ, q) = log p(v; θ) D
KL
(q(h)kp(h | v; θ))
where q is an arbitrary probability distribution over h.
It is straightforward to see that this is a lower bound on log p(v):
ln p(v) = ln p(v) +
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(v, h)
q(h | v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(v, h)
q(h | v)
+
X
h
q(h | v) log p(v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v)
ln
p(v, h)
q(h | v)
ln p(v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(v, h)
p(v)q(h | v)
=
X
h
q(h | v) ln
p(v, h)
q(h | v)
X
h
q(h | v) ln
p(h | v)
q(h | v)
= L(q) + KL(qkp)
Because the difference log p(v) and L(v, θ, q) is given by the KL-divergence and because
the KL-divergence is always non-negative, we can see that L always has at most the
same value as the desired log probability, and is equal to it if and only if q is the same
distribution as p(h | v).
Surprisingly, L can be considerably easier to compute for some distributions q. Sim-
ple algebra shows that we can rearrange L into a much more convenient form:
L(v, θ, q) = log p(v; θ) D
KL
(q(h)kp(h | v; θ))
= log p(v; θ) E
hq
log q(h)
log p(h | v)
= log p(v; θ) E
hq
log q(h)
log
p(h,v)
p(v)
= log p(v; θ) E
hq
[log q(h) log p(h, v) + log p(v)]
= E
hq
[log q(h) log p(h, v)] .
This yields the more canonical definition of the evidence lower bound,
L(v, θ, q) = E
hq
[log p(h, v)] + H(q). (20.1)
The first term of L is known as the energy term. The second term is known as the
entropy term. For an appropriate choice of q, both terms can be easy to compute. The
only question is how close to p(h | v) the distribution q will be. This determines how
good of an approximation L will be for log p(v).
374
We can thus think of inference as the procedure for finding the q that maximizes
L. Exact inference maximizes L perfectly. Throughout this chapter, we will show how
many forms of approximate inference are possible. No matter what choice of q we use,
L will give us a lower bound on the likelihood. We can get tighter or looser bounds that
are cheaper or more expensive to compute depending on how we choose to approach this
optimization problem. We can obtain a poorly matched q but reduce the computation
cost by using an imperfect optimization procedure, or by using a perfect optimization
procedure over a restricted family of q distributions.
20.2 Expectation Maximization
Expectation maximization (EM) is a popular training algorithm for models with latent
variables. It consists of alternating between two steps until convergence:
The E-step (Expectation step): Set q(h
(i)
) = p(h
(i)
| v
(
i); θ) for all indices i of the
training examples v
(i)
we want to train on (both batch and minibatch variants are
valid). By this we mean q is defined in terms of the current value of θ; if we vary
θ then p(h | v; θ) will change but q(h) will not.
The M-step (Maximization step): Completely or partially maximize
P
i
L(v
(
i), θ, q)
with respect to θ using your optimization algorithm of choice.
This can be viewed as a coordinate ascent algorithm to maximize L. On one step,
we maximize L with respect to q, and on the other, we maximize L with respect to θ.
Stochastic gradient ascent on latent variable models can be seen as a special case
of the EM algorithm where the M step consists of taking a single gradient step. Other
variants of the EM algorithm can make much larger steps. For some model families, the
M step can even be performed analytically, jumping all the way to the optimal solution
given the current q.
Even though the E-step involves exact inference, we can think of the EM algorithm
as using approximate inference in some sense. Specifically, the M-step assumes that the
same value of q can be used for all values of θ. This will introduce a gap between L
and the true log p(v) as the M-step moves further and further. Fortunately, the E-step
reduces the gap to zero again as we enter the loop for the next time.
The EM algorithm is a workhorse of classical machine learning, and it can be con-
sidered to be used in deep learning in the sense that stochastic gradient ascent can be
seen as EM with a very simple and small M step. However, because L can not be ana-
lytically maximized for many interesting deep models, the more general EM framework
as a whole is typically not explored in the deep learning research community.
TODO–cite the emview paper
20.3 MAP Inference: Sparse Coding as a Probabilistic
Model
TODO synch up with other sections on sparse coding
375
Many versions of sparse coding can be cast as probabilistic models. For example,
suppose we encode visible data v R
n
with latent variables h R
m
. We can use a
prior to encourage our latent code variables to be sparse:
p(h) = T ODO.
We can define the visible units to be Gaussian with an affine transformation from the
code to the mean of the Gaussian:
v N(v | µ + W h, β
1
)
where β is a diagonal precision matrix to maintain tractability.
Computing p(h | v) is difficult. TODO explain why
One operation that we can do is perform maximum a posteriori (MAP) inference,
which means solving the following optimization problem:
h
= arg max p(h | v).
This yields the familiar optimization problem
TODO synch with other sparse coding sections, make sure the other sections talk
about using gradient descent, feature sign, ISTA, etc.
This shows that the popular feature extraction strategy for sparse coding can be
justified as having a probabilistic interpretation–it may be MAP inference in this prob-
abilistic model ( there are other probabilistic models that yield the same optimization
problem, so we cannot positively identify this specific model from the feature extraction
process ).
Excitingly, MAP inference of h given v also has an interpretation in terms of maxi-
mizing the evidence lower bound. Specifically, MAP inference maximizes L with respect
to q under the constraint that q take the form form of a Dirac distribution. During
learning of sparse coding, we alternate between using convex optimization to extract
the codes, and using convex optimization to update W to achieve the optimal recon-
struction given the codes. This turns out to be equivalent to maximizing L with respect
to θ for the q that was obtained from MAP inference. The learning algorithm can be
thought of as EM restricted to using a Dirac posterior. In other words, rather than
performing learning exactly using standard inference, we learn to maximize a bound on
the true likelihood, using exact MAP inference.
20.4 Variational Inference and Learning
One common difficulty in probabilistic modeling is that the posterior distribution p(h | v)
is infeasible to compute for many models with hidden variables h and visible variables
v. Expectations with respect to this distribution may also be intractable.
Consider as an example the binary sparse coding model. In this model, the input
v R
n
is formed by adding Gaussian noise to the sum of m different components
which can each be present or absent. Each component is switched on or off by the
corresponding hidden unit in h {0, 1}
m
:
376
p(h
i
= 1) = σ(b
i
)
p(v | h) = N(v | W h, β
1
)
where b is a learn-able set of biases, W is a learn-able weight matrix, and β is a
learn-able, diagonal precision matrix.
Training this model with maximum likelihood requires taking the derivative with
respect to the parameters. Consider the derivative with respect to one of the biases:
b
i
log p(v)
=
b
i
p(v)
p(v)
=
b
i
P
h
p(h, v)
p(v)
=
b
i
P
h
p(h)p(v | h)
p(v)
=
P
h
p(v | h)
b
i
p(h)
p(v)
=
X
h
p(h | v)
b
i
p(h)
p(h)
=
X
h
p(h | v)
b
i
p(h)
p(h)
= E
h p(h|v)
b
i
log p(h).
This requires computing expectations with respect to p(h | v). Unfortunately, p(h |
v) is a complicated distribution. See Fig. 20.2 for the graph structure of p(h, v) and p(h |
v). The posterior distribution corresponds to the complete graph over the hidden units,
so variable elimination algorithms do not help us to compute the required expectations
any faster than brute force.
One solution to this problem is to use variational methods. Variational methods
involve using a simple distribution q(h) to approximate the true, complicated posterior
p(h | v). The name “variational” derives from their frequent use of a branch of math-
ematics called calculus of variations. However, not all variational methods use calculus
of variations.
TODO variational inference involves maximization of a BOUND TODO variational
inference also usually involves a restriction on the function family
TODO
377
h
1
h
2
h
3
v
1
v
2
v
3
h
4
h
1
h
2
h
3
h
4
Figure 20.2: The graph structure of a binary sparse coding model with four hidden
units. Left) The graph structure of p(h, v). Note that the edges are directed, and that
every two hidden units co-parents of every visible unit. Right) The graph structure of
p(h | v). In order to account for the active paths between co-parents, the posterior
distribution needs an edge between all of the hidden units.
20.4.1 Discrete Latent Variables
TODO– BSC example
20.4.2 Calculus of Variations
Many machine learning techniques are based on minimizing a function J(θ) by finding
the input vector θ R
n
for which it takes on its minimal value. This can be accomplished
with multivariate calculus and linear algebra, by solving for the critical points where
θ
J(θ) = 0. In some cases, we actually want to solve for a function f(x), such as when
we want to find the probability density function over some random variable. This is
what calculus of variations enables us to do.
A function of a function f is known as a functional J[f]. Much as we can take
partial derivatives of a function with respect to elements of its vector-valued argument,
we can take functional derivatives, also known as variational derivatives of a functional
J[f] with respect to respect to individual values of the function f(x). The functional
derivative of the functional J with respect to the value of the function f at point x is
denoted
δ
δf(x)
J.
A complete formal development of functional derivatives is beyond the scope of this
book. For our purposes, it is sufficient to state that for differentiable functions f(x) and
differentiable functions g(y, x) with continuous derivatives, that
δ
δf(x)
Z
g (f(x), x) dx =
y
g(f(x), x). (20.2)
To gain some intuition for this identity, one can think of f (x) as being a vector with
uncountably many elements, indexed by a real vector x. In this (somewhat incomplete
378
view), the identity providing the functional derivatives is the same as we would obtain
for a vector θ R
n
indexed by positive integers:
θ
i
X
j
g(θ
j
, j) =
θ
i
g(θ
i
, i).
Many results in other machine learning publications are presented using the more general
Euler-Lagrange equation which allows g to depend on the derivatives of f as well as the
value of f, but we do not need this fully general form for the results presented in this
book.
To optimize a function with respect to a vector, we take the gradient of the function
with respect to the vector and solve for the point where every element of the gradient is
equal to zero. Likewise, we can optimize a functional by solving for the function where
the functional derivative at every point is equal to zero.
As an example of how this process works, consider the problem of finding the proba-
bility distribution function over x R that has maximal Shannon entropy. Recall that
the entropy of a probability distribution p(x) is defined as
H[p] = E
x
log p(x).
For continuous values, the expectation is an integral:
H[p] =
Z
p(x) log p(x)dx.
We cannot simply maximize H(x) with respect to the function p(x), because the
result might not be a probability distribution. Instead, we need to use Lagrange multi-
pliers, to add a constraint that p(x) integrate to 1. Also, the entropy increases without
bound as the variance increases, so we can only search for the distribution with maximal
entropy for fixed variance σ
2
. Finally, the problem is underdetermined because the dis-
tribution can be shifted arbitrarily without changing the entropy. To impose a unique
solution, we add a constraint that the mean of the distribution be µ. The Lagrangian
functional for this optimization problem is
L[p] = λ
1
Z
p(x)dx 1
+ λ
2
(E[x] µ) + λ
3
E[(x µ)
2
] σ
2
+ H[p]
=
Z
λ
1
p(x) + λ
2
p(x)x + λ
3
p(x)(x µ)
2
p(x) log p(x)
dx λ
1
µλ
2
σ
2
λ
3
.
To minimize the Lagrangian with respect to p, we set the functional derivatives equal
to 0:
x,
δ
δp(x)
L = λ
1
+ λ
2
x + λ
3
(x µ)
2
1 log p(x) = 0.
This condition now tells us the functional form of p(x). By algebraically re-arranging
the equation, we obtain
p(x) = exp
λ
1
λ
2
x + λ
3
(x µ)
2
+ 1
.
379
We never assumed directly that p(x) would take this functional form; we obtained
the expression itself by analytically minimizing a functional. To finish the minimization
problem, we must choose the λ values to ensure that all of our constraints are satisfied.
We are free to choose any λ values, because the gradient of the Lagrangian with respect
to the λ variables is zero so long as the constraints are satisfied. To satisfy all of the
constraints, we may set λ
1
= log σ
2π, λ
2
= 0, and λ
3
=
1
2σ
2
to obtain
p(x) = N(x | µ, σ
2
).
This is one reason for using the normal distribution when we do not know the true
distribution. Because the normal distribution has the maximum entropy, we impose the
least possible amount of structure by making this assumption.
What about the probability distribution function that minimizes the entropy? It
turns out that there is no specific function that achieves minimal entropy. As functions
place more mass on x = µ ± σ and less on all other values of x, they lose entropy.
However, any function placing exactly zero mass on all but two points does not integrate
to one, and is not a valid probability distribution. There thus is no single minimal
entropy probability distribution function, much as there is no single minimal positive
real number.
20.4.3 Continuous Latent Variables
TODO: Gaussian example from IG’s thesis? TODO: S3C example
20.5 Stochastic Inference
TODO: Charlie Tang’s SFNNs? Is there anything else where sampling-based inference
actually gets used?
20.6 Learned Approximate Inference
TODO: wake-sleep algorithm
In chapter 19.2 we saw that one possible explanation for the role of dream sleep
in human beings and animals is that dreams could provide the negative phase samples
that Monte Carlo training algorithms use to approximate the negative gradient of the
log partition function of undirected models. Another possible explanation for biolog-
ical dreaming is that it is providing samples from p(h, v) which can be used to train
an inference network to predict h given v. In some senses, this explanation is more
satisfying than the partition function explanation. Monte Carlo algorithms generally
do not perform well if they are run using only the positive phase of the gradient for
several steps then with only the negative phase of the gradient for several steps. Hu-
man beings and animals are usually awake for several consecutive hours then asleep
for several consecutive hours, and it is not readily apparent how this schedule could
support Monte Carlo training of an undirected model. Learning algorithms based on
380
maximizing L can be run with prolonged periods of improving q and prolonged periods
of improving theta, however. If the role of biological dreaming is to train networks
for predicting q, then this explains how animals are able to remain awake for several
hours (the longer they are awake, the greater the gap between L and log p(v), but L
will remain a lower bound) and to remain asleep for several hours (the generative model
itself is not modified during sleep) without damaging their internal models. Of course,
these ideas are purely speculative, and there is no hard evidence to suggest that dream-
ing accomplishes either of these goals. Dreaming may also serve reinforcement learning
rather than probabilistic modeling, by sampling synthetic experiences from the animal’s
transition model, on which to train the animal’s policy. Or sleep may serve some other
purpose not yet anticipated by the machine learning community.
TODO: DARN and NVIL? TODO: fast DBM inference
381