Chapter 20

Approximate inference

Misplaced TODO: discussion of the diﬀerent directions of the KL divergence, and

the eﬀects on ignoring / preserving modes

Many probabilistic models are diﬃcult to train because it is diﬃcult to perform

inference in them. In the context of deep learning, we usually have a set of visible

variables v and a set of latent variables h. The challenge of inference usually refers

to the diﬃcult problem of computing p(h | v) or taking expectations with respect

to it. Such operations are often necessary for tasks like maximum likelihood

learning.

Many simple graphical models with only one hidden layer, such as restricted

Boltzmann machines and probabilistic PCA are deﬁned in a way that makes in-

ference operations like computing p(h | v) or taking expectations with respect to

it simple. Unfortunately, most graphical models with multiple layers of hidden

variables, such as deep belief networks and deep Boltzmann machines have in-

tractable posterior distributions. Exact inference requires an exponential amount

of time in these models. Even some models with only a single layer, such as sparse

coding, have this problem.

Intractable inference problems usually arise from interactions between latent

variables in a structured graphical model. See Fig. 20.1 for some examples. These

interactions may be due to direct interactions in undirected models or “explaining

away” interactions between mutual ancestors of the same visible unit in directed

models.

20.1 Inference as Optimization

Many approaches to confronting the problem of diﬃcult inference make use of the

observation that exact inference can be described as an optimization problem.

Speciﬁcally, assume we have a probabilistic model consisting of observed vari-

467

CHAPTER 20. APPROXIMATE INFERENCE

Figure 20.1: Intractable inference problems are usually the result of interactions between

latent variables in a structured graphical model. These can be due to direct edges,

or due to paths that are activated when the child of a V-structure is observed. Left)

A semi-restricted Boltzmann machine with connections between hidden units. These

direct connections between latent variables make the posterior distribution complicated.

Center) A deep Boltzmann machine, organized into layers of variables without intra-

layer connections, still has an intractable posterior distribution due to the connections

between layers. Right) This directed model has interactions between latent variables

when the visible variables are observed, because ever two latent variables are co-parents.

Note that it is still possible to have these graph structures yet have tractable inference.

For example, probabilistic PCA has the graph structure shown in the right, yet simple

inference due to special properties of the speciﬁc conditional distributions it uses (linear-

Gaussian conditionals with mutually orthogonal basis vectors). MISPLACED TODO–

make sure probabilistic PCA is at least deﬁned somewhere in the book

468

CHAPTER 20. APPROXIMATE INFERENCE

ables v and latent variables h. We would like to compute the log probability of

the observed data, log p(v; θ). Sometimes it is too diﬃcult to compute log p(v; θ)

if it is costly to marginalize out h. Instead, we can compute a lower bound on it.

This bound is called the evidence lower bound (ELBO). Other names for this lower

bound include the negative variational free energy and the negative Helmholtz free

energy. Speciﬁcally, this lower bound is deﬁned as TODO– ﬁgure out why I

was making q be bm in some but not all places

L(v, θ, q) = log p(v; θ) − D

(q(h)kp(h | v; θ))

where q is an arbitrary probability distribution over h.

It is straightforward to see that this is a lower bound on log p(v):

ln p(v) = ln p(v) +

q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(v, h)

q(h | v)



q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(v, h)

q(h | v)



q(h | v) log p(v)

q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v)





p(v, h)

q(h | v)



− ln p(v)



q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(v, h)

p(v)q(h | v)



q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(h | v)

q(h | v)



= L(q) + KL(qkp)

Because the diﬀerence log p(v) and L(v, θ, q) is given by the KL-divergence and

because the KL-divergence is always non-negative, we can see that L always has

at most the same value as the desired log probability, and is equal to it if and

only if q is the same distribution as p(h | v).

Surprisingly, L can be considerably easier to compute for some distributions

q. Simple algebra shows that we can rearrange L into a much more convenient

form:

L(v, θ, q) = log p(v; θ) − D

(q(h)kp(h | v; θ))

= log p(v; θ) − E

h∼q

log q(h)

log p(h | v)

= log p(v; θ) − E

h∼q

log q(h)

log

p(h,v)

p(v)

469

CHAPTER 20. APPROXIMATE INFERENCE

= log p(v; θ) − E

h∼q

[log q(h) − log p(h, v) + log p(v)]

= −E

h∼q

[log q(h) − log p(h, v)] .

This yields the more canonical deﬁnition of the evidence lower bound,

L(v, θ, q) = E

h∼q

[log p(h, v)] + H(q). (20.1)

The ﬁrst term of L is known as the energy term. The second term is known

as the entropy term. For an appropriate choice of q, both terms can be easy to

compute. The only question is how close to p(h | v) the distribution q will be.

This determines how good of an approximation L will be for log p(v).

We can thus think of inference as the procedure for ﬁnding the q that max-

imizes L. Exact inference maximizes L perfectly. Throughout this chapter, we

will show how many forms of approximate inference are possible. No matter what

choice of q we use, L will give us a lower bound on the likelihood. We can get

tighter or looser bounds that are cheaper or more expensive to compute depending

on how we choose to approach this optimization problem. We can obtain a poorly

matched q but reduce the computation cost by using an imperfect optimization

procedure, or by using a perfect optimization procedure over a restricted family

of q distributions.

20.2 Expectation Maximization

Expectation maximization (EM) is a popular training algorithm for models with

latent variables. It consists of alternating between two steps until convergence:

• The E-step (Expectation step): Set q(h

(i)

) = p(h

(i)

| v

(

i); θ) for all indices i

of the training examples v

(i)

we want to train on (both batch and minibatch

variants are valid). By this we mean q is deﬁned in terms of the current

value of θ; if we vary θ then p(h | v; θ) will change but q(h) will not.

• The M-step (Maximization step): Completely or partially maximize

L(v

(

i), θ, q)

with respect to θ using your optimization algorithm of choice.

This can be viewed as a coordinate ascent algorithm to maximize L. On one

step, we maximize L with respect to q, and on the other, we maximize L with

respect to θ.

Stochastic gradient ascent on latent variable models can be seen as a special

case of the EM algorithm where the M step consists of taking a single gradient

step. Other variants of the EM algorithm can make much larger steps. For some

model families, the M step can even be performed analytically, jumping all the

way to the optimal solution given the current q.

470

CHAPTER 20. APPROXIMATE INFERENCE

Even though the E-step involves exact inference, we can think of the EM

algorithm as using approximate inference in some sense. Speciﬁcally, the M-step

assumes that the same value of q can be used for all values of θ. This will introduce

a gap between L and the true log p(v) as the M-step moves further and further.

Fortunately, the E-step reduces the gap to zero again as we enter the loop for the

next time.

The EM algorithm is a workhorse of classical machine learning, and it can be

considered to be used in deep learning in the sense that stochastic gradient ascent

can be seen as EM with a very simple and small M step. However, because L

can not be analytically maximized for many interesting deep models, the more

general EM framework as a whole is typically not explored in the deep learning

research community.

TODO–cite the emview paper

20.3 MAP Inference: Sparse Coding as a Probabilis-

tic Model

TODO synch up with other sections on sparse coding

Many versions of sparse coding can be cast as probabilistic models. For ex-

ample, suppose we encode visible data v ∈ R

with latent variables h ∈ R

. We

can use a prior to encourage our latent code variables to be sparse:

p(h) = T ODO.

We can deﬁne the visible units to be Gaussian with an aﬃne transformation from

the code to the mean of the Gaussian:

v ∼ N(v | µ + W h, β

−1

)

where β is a diagonal precision matrix to maintain tractability.

Computing p(h | v) is diﬃcult. TODO explain why

One operation that we can do is perform maximum a posteriori (MAP) infer-

ence, which means solving the following optimization problem:

∗

= arg max p(h | v).

This yields the familiar optimization problem

TODO synch with other sparse coding sections, make sure the other sections

talk about using gradient descent, feature sign, ISTA, etc.

This shows that the popular feature extraction strategy for sparse coding can

be justiﬁed as having a probabilistic interpretation–it may be MAP inference in

this probabilistic model ( there are other probabilistic models that yield the same

471

CHAPTER 20. APPROXIMATE INFERENCE

optimization problem, so we cannot positively identify this speciﬁc model from

the feature extraction process ).

Excitingly, MAP inference of h given v also has an interpretation in terms

of maximizing the evidence lower bound. Speciﬁcally, MAP inference maximizes

L with respect to q under the constraint that q take the form form of a Dirac

distribution. During learning of sparse coding, we alternate between using convex

optimization to extract the codes, and using convex optimization to update W

to achieve the optimal reconstruction given the codes. This turns out to be

equivalent to maximizing L with respect to θ for the q that was obtained from

MAP inference. The learning algorithm can be thought of as EM restricted to

using a Dirac posterior. In other words, rather than performing learning exactly

using standard inference, we learn to maximize a bound on the true likelihood,

using exact MAP inference.

20.4 Variational Inference and Learning

One common diﬃculty in probabilistic modeling is that the posterior distribution

p(h | v) is infeasible to compute for many models with hidden variables h and

visible variables v. Expectations with respect to this distribution may also be

intractable.

Consider as an example the binary sparse coding model. In this model, the

input v ∈ R

is formed by adding Gaussian noise to the sum of m diﬀerent

components which can each be present or absent. Each component is switched

on or oﬀ by the corresponding hidden unit in h ∈ {0, 1}

p(h

= 1) = σ(b

)

p(v | h) = N(v | W h, β

−1

)

where b is a learn-able set of biases, W is a learn-able weight matrix, and β

is a learn-able, diagonal precision matrix.

Training this model with maximum likelihood requires taking the derivative

with respect to the parameters. Consider the derivative with respect to one of

the biases:

∂

∂b

log p(v)

∂

∂b

p(v)

∂

∂b

p(h, v)

p(v)

472

CHAPTER 20. APPROXIMATE INFERENCE

Figure 20.2: The graph structure of a binary sparse coding model with four hidden units.

Left) The graph structure of p(h, v). Note that the edges are directed, and that every

two hidden units co-parents of every visible unit. Right) The graph structure of p(h | v).

In order to account for the active paths between co-parents, the posterior distribution

needs an edge between all of the hidden units.

∂

∂b

p(h)p(v | h)

p(v)

p(v | h)

∂

∂b

p(h)

p(v)

p(h | v)

∂

∂b

p(h)

p(h | v)

∂

∂b

p(h)

= E

h p(h|v)

∂

∂b

log p(h).

This requires computing expectations with respect to p(h | v). Unfortunately,

p(h | v) is a complicated distribution. See Fig. 20.2 for the graph structure of

p(h, v) and p(h | v). The posterior distribution corresponds to the complete

graph over the hidden units, so variable elimination algorithms do not help us to

compute the required expectations any faster than brute force.

One solution to this problem is to use variational methods. Variational meth-

ods involve using a simple distribution q(h) to approximate the true, complicated

posterior p(h | v). The name “variational” derives from their frequent use of a

branch of mathematics called calculus of variations. However, not all variational

methods use calculus of variations.

473

CHAPTER 20. APPROXIMATE INFERENCE

TODO variational inference involves maximization of a BOUND TODO vari-

ational inference also usually involves a restriction on the function family

TODO

20.4.1 Discrete Latent Variables

TODO– BSC example

20.4.2 Calculus of Variations

Many machine learning techniques are based on minimizing a function J(θ) by

ﬁnding the input vector θ ∈ R

for which it takes on its minimal value. This can

be accomplished with multivariate calculus and linear algebra, by solving for the

critical points where ∇

J(θ) = 0. In some cases, we actually want to solve for a

function f(x), such as when we want to ﬁnd the probability density function over

some random variable. This is what calculus of variations enables us to do.

A function of a function f is known as a functional J[f ]. Much as we can

take partial derivatives of a function with respect to elements of its vector-valued

argument, we can take functional derivatives, also known as variational derivatives

of a functional J[f] with respect to respect to individual values of the function

f(x). The functional derivative of the functional J with respect to the value of

the function f at point x is denoted

δf(x)

A complete formal development of functional derivatives is beyond the scope

of this book. For our purposes, it is suﬃcient to state that for diﬀerentiable

functions f(x) and diﬀerentiable functions g(y, x) with continuous derivatives,

that

δf(x)

g (f(x), x) dx =

∂

∂y

g(f(x), x). (20.2)

To gain some intuition for this identity, one can think of f(x) as being a vector

with uncountably many elements, indexed by a real vector x. In this (somewhat

incomplete view), the identity providing the functional derivatives is the same as

we would obtain for a vector θ ∈ R

indexed by positive integers:

∂

∂θ

g(θ

, j) =

∂

∂θ

g(θ

, i).

Many results in other machine learning publications are presented using the more

general Euler-Lagrange equation which allows g to depend on the derivatives of f

as well as the value of f, but we do not need this fully general form for the results

presented in this book.

To optimize a function with respect to a vector, we take the gradient of the

function with respect to the vector and solve for the point where every element

474

CHAPTER 20. APPROXIMATE INFERENCE

of the gradient is equal to zero. Likewise, we can optimize a functional by solving

for the function where the functional derivative at every point is equal to zero.

As an example of how this process works, consider the problem of ﬁnding the

probability distribution function over x ∈ R that has maximal Shannon entropy.

Recall that the entropy of a probability distribution p(x) is deﬁned as

H[p] = −E

log p(x).

For continuous values, the expectation is an integral:

H[p] = −

p(x) log p(x)dx.

We cannot simply maximize H(x) with respect to the function p(x), because

the result might not be a probability distribution. Instead, we need to use La-

grange multipliers, to add a constraint that p(x) integrate to 1. Also, the entropy

increases without bound as the variance increases, so we can only search for the

distribution with maximal entropy for ﬁxed variance σ

. Finally, the problem

is underdetermined because the distribution can be shifted arbitrarily without

changing the entropy. To impose a unique solution, we add a constraint that the

mean of the distribution be µ. The Lagrangian functional for this optimization

problem is

L[p] = λ



p(x)dx −1



+ λ

(E[x] −µ) + λ



E[(x −µ)

] −σ



+ H[p]



p(x) + λ

p(x)x + λ

p(x)(x −µ)

− p(x) log p(x)



dx −λ

− µλ

− σ

To minimize the Lagrangian with respect to p, we set the functional derivatives

equal to 0:

∀x,

δp(x)

L = λ

+ λ

x + λ

(x −µ)

− 1 −log p(x) = 0.

This condition now tells us the functional form of p(x). By algebraically re-

arranging the equation, we obtain

p(x) = exp



−λ

− λ

x + λ

(x −µ)

+ 1



We never assumed directly that p(x) would take this functional form; we

obtained the expression itself by analytically minimizing a functional. To ﬁnish

the minimization problem, we must choose the λ values to ensure that all of our

constraints are satisﬁed. We are free to choose any λ values, because the gradient

of the Lagrangian with respect to the λ variables is zero so long as the constraints

475

CHAPTER 20. APPROXIMATE INFERENCE

are satisﬁed. To satisfy all of the constraints, we may set λ

= log σ

√

2π, λ

= 0,

and λ

= −

2σ

to obtain

p(x) = N(x | µ, σ

This is one reason for using the normal distribution when we do not know the

true distribution. Because the normal distribution has the maximum entropy, we

impose the least possible amount of structure by making this assumption.

What about the probability distribution function that minimizes the entropy?

It turns out that there is no speciﬁc function that achieves minimal entropy. As

functions place more mass on x = µ±σ and less on all other values of x, they lose

entropy. However, any function placing exactly zero mass on all but two points

does not integrate to one, and is not a valid probability distribution. There thus

is no single minimal entropy probability distribution function, much as there is

no single minimal positive real number.

20.4.3 Continuous Latent Variables

TODO: Gaussian example from IG’s thesis? TODO: S3C example

20.5 Stochastic Inference

TODO: Charlie Tang’s SFNNs? Is there anything else where sampling-based

inference actually gets used?

20.6 Learned Approximate Inference

TODO: wake-sleep algorithm

In chapter 19.2 we saw that one possible explanation for the role of dream

sleep in human beings and animals is that dreams could provide the negative

phase samples that Monte Carlo training algorithms use to approximate the neg-

ative gradient of the log partition function of undirected models. Another possible

explanation for biological dreaming is that it is providing samples from p(h, v)

which can be used to train an inference network to predict h given v. In some

senses, this explanation is more satisfying than the partition function explanation.

Monte Carlo algorithms generally do not perform well if they are run using only

the positive phase of the gradient for several steps then with only the negative

phase of the gradient for several steps. Human beings and animals are usually

awake for several consecutive hours then asleep for several consecutive hours, and

it is not readily apparent how this schedule could support Monte Carlo training

of an undirected model. Learning algorithms based on maximizing L can be run

with prolonged periods of improving q and prolonged periods of improving θ,

476

CHAPTER 20. APPROXIMATE INFERENCE

however. If the role of biological dreaming is to train networks for predicting q,

then this explains how animals are able to remain awake for several hours (the

longer they are awake, the greater the gap between L and log p(v), but L will re-

main a lower bound) and to remain asleep for several hours (the generative model

itself is not modiﬁed during sleep) without damaging their internal models. Of

course, these ideas are purely speculative, and there is no hard evidence to sug-

gest that dreaming accomplishes either of these goals. Dreaming may also serve

reinforcement learning rather than probabilistic modeling, by sampling synthetic

experiences from the animal’s transition model, on which to train the animal’s

policy. Or sleep may serve some other purpose not yet anticipated by the machine

learning community.

TODO: DARN and NVIL? TODO: fast DBM inference

477