Chapter 20

Approximate inference

Misplaced TODO: discussion of the diﬀerent directions of the KL divergence, and the

eﬀects on ignoring / preserving modes

Many probabilistic models are diﬃcult to train because it is diﬃcult to perform

inference in them. In the context of deep learning, we usually have a set of visible

variables v and a set of latent variables h. The challenge of inference usually refers to

the diﬃcult problem of computing p(h | v) or taking expectations with respect to it.

Such operations are often necessary for tasks like maximum likelihood learning.

Many simple graphical models with only one hidden layer, such as restricted Boltz-

mann machines and probabilistic PCA are deﬁned in a way that makes inference oper-

ations like computing p(h | v) or taking expectations with respect to it simple. Unfor-

tunately, most graphical models with multiple layers of hidden variables, such as deep

belief networks and deep Boltzmann machines have intractable posterior distributions.

Exact inference requires an exponential amount of time in these models. Even some

models with only a single layer, such as sparse coding, have this problem.

Intractable inference problems usually arise from interactions between latent vari-

ables in a structured graphical model. See Fig. 20.1 for some examples. These inter-

actions may be due to direct interactions in undirected models or “explaining away”

interactions between mutual ancestors of the same visible unit in directed models.

20.1 Inference as Optimization

Many approaches to confronting the problem of diﬃcult inference make use of the ob-

servation that exact inference can be described as an optimization problem.

Speciﬁcally, assume we have a probabilistic model consisting of observed variables v

and latent variables h. We would like to compute the log probability of the observed

data, log p(v; θ). Sometimes it is too diﬃcult to compute log p(v; θ) if it is costly to

marginalize out h. Instead, we can compute a lower bound on it. This bound is called

the evidence lower bound (ELBO). Other names for this lower bound include the negative

variational free energy and the negative Helmholtz free energy. Speciﬁcally, this lower

bound is deﬁned as TODO– ﬁgure out why I was making q be bm in some but not all

372

Figure 20.1: Intractable inference problems are usually the result of interactions between

latent variables in a structured graphical model. These can be due to direct edges, or

due to paths that are activated when the child of a V-structure is observed. Left)

A semi-restricted Boltzmann machine with connections between hidden units. These

direct connections between latent variables make the posterior distribution complicated.

Center) A deep Boltzmann machine, organized into layers of variables without intra-

layer connections, still has an intractable posterior distribution due to the connections

between layers. Right) This directed model has interactions between latent variables

when the visible variables are observed, because ever two latent variables are co-parents.

Note that it is still possible to have these graph structures yet have tractable inference.

For example, probabilistic PCA has the graph structure shown in the right, yet simple

inference due to special properties of the speciﬁc conditional distributions it uses (linear-

Gaussian conditionals with mutually orthogonal basis vectors). MISPLACED TODO–

make sure probabilistic PCA is at least deﬁned somewhere in the book

373

places

L(v, θ, q) = log p(v; θ) − D

(q(h)kp(h | v; θ))

where q is an arbitrary probability distribution over h.

It is straightforward to see that this is a lower bound on log p(v):

ln p(v) = ln p(v) +

q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(v, h)

q(h | v)



q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(v, h)

q(h | v)



q(h | v) log p(v)

q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v)





p(v, h)

q(h | v)



− ln p(v)



q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(v, h)

p(v)q(h | v)



q(h | v) ln



p(v, h)

q(h | v)



−

q(h | v) ln



p(h | v)

q(h | v)



= L(q) + KL(qkp)

Because the diﬀerence log p(v) and L(v, θ, q) is given by the KL-divergence and because

the KL-divergence is always non-negative, we can see that L always has at most the

same value as the desired log probability, and is equal to it if and only if q is the same

distribution as p(h | v).

Surprisingly, L can be considerably easier to compute for some distributions q. Sim-

ple algebra shows that we can rearrange L into a much more convenient form:

L(v, θ, q) = log p(v; θ) − D

(q(h)kp(h | v; θ))

= log p(v; θ) − E

h∼q

log q(h)

log p(h | v)

= log p(v; θ) − E

h∼q

log q(h)

log

p(h,v)

p(v)

= log p(v; θ) − E

h∼q

[log q(h) − log p(h, v) + log p(v)]

= −E

h∼q

[log q(h) − log p(h, v)] .

This yields the more canonical deﬁnition of the evidence lower bound,

L(v, θ, q) = E

h∼q

[log p(h, v)] + H(q). (20.1)

The ﬁrst term of L is known as the energy term. The second term is known as the

entropy term. For an appropriate choice of q, both terms can be easy to compute. The

only question is how close to p(h | v) the distribution q will be. This determines how

good of an approximation L will be for log p(v).

374

We can thus think of inference as the procedure for ﬁnding the q that maximizes

L. Exact inference maximizes L perfectly. Throughout this chapter, we will show how

many forms of approximate inference are possible. No matter what choice of q we use,

L will give us a lower bound on the likelihood. We can get tighter or looser bounds that

are cheaper or more expensive to compute depending on how we choose to approach this

optimization problem. We can obtain a poorly matched q but reduce the computation

cost by using an imperfect optimization procedure, or by using a perfect optimization

procedure over a restricted family of q distributions.

20.2 Expectation Maximization

Expectation maximization (EM) is a popular training algorithm for models with latent

variables. It consists of alternating between two steps until convergence:

• The E-step (Expectation step): Set q(h

(i)

) = p(h

(i)

| v

(

i); θ) for all indices i of the

training examples v

(i)

we want to train on (both batch and minibatch variants are

valid). By this we mean q is deﬁned in terms of the current value of θ; if we vary

θ then p(h | v; θ) will change but q(h) will not.

• The M-step (Maximization step): Completely or partially maximize

L(v

(

i), θ, q)

with respect to θ using your optimization algorithm of choice.

This can be viewed as a coordinate ascent algorithm to maximize L. On one step,

we maximize L with respect to q, and on the other, we maximize L with respect to θ.

Stochastic gradient ascent on latent variable models can be seen as a special case

of the EM algorithm where the M step consists of taking a single gradient step. Other

variants of the EM algorithm can make much larger steps. For some model families, the

M step can even be performed analytically, jumping all the way to the optimal solution

given the current q.

Even though the E-step involves exact inference, we can think of the EM algorithm

as using approximate inference in some sense. Speciﬁcally, the M-step assumes that the

same value of q can be used for all values of θ. This will introduce a gap between L

and the true log p(v) as the M-step moves further and further. Fortunately, the E-step

reduces the gap to zero again as we enter the loop for the next time.

The EM algorithm is a workhorse of classical machine learning, and it can be con-

sidered to be used in deep learning in the sense that stochastic gradient ascent can be

seen as EM with a very simple and small M step. However, because L can not be ana-

lytically maximized for many interesting deep models, the more general EM framework

as a whole is typically not explored in the deep learning research community.

TODO–cite the emview paper

20.3 MAP Inference: Sparse Coding as a Probabilistic

Model

TODO synch up with other sections on sparse coding

375

Many versions of sparse coding can be cast as probabilistic models. For example,

suppose we encode visible data v ∈ R

with latent variables h ∈ R

. We can use a

prior to encourage our latent code variables to be sparse:

p(h) = T ODO.

We can deﬁne the visible units to be Gaussian with an aﬃne transformation from the

code to the mean of the Gaussian:

v ∼ N(v | µ + W h, β

−1

)

where β is a diagonal precision matrix to maintain tractability.

Computing p(h | v) is diﬃcult. TODO explain why

One operation that we can do is perform maximum a posteriori (MAP) inference,

which means solving the following optimization problem:

∗

= arg max p(h | v).

This yields the familiar optimization problem

TODO synch with other sparse coding sections, make sure the other sections talk

about using gradient descent, feature sign, ISTA, etc.

This shows that the popular feature extraction strategy for sparse coding can be

justiﬁed as having a probabilistic interpretation–it may be MAP inference in this prob-

abilistic model ( there are other probabilistic models that yield the same optimization

problem, so we cannot positively identify this speciﬁc model from the feature extraction

process ).

Excitingly, MAP inference of h given v also has an interpretation in terms of maxi-

mizing the evidence lower bound. Speciﬁcally, MAP inference maximizes L with respect

to q under the constraint that q take the form form of a Dirac distribution. During

learning of sparse coding, we alternate between using convex optimization to extract

the codes, and using convex optimization to update W to achieve the optimal recon-

struction given the codes. This turns out to be equivalent to maximizing L with respect

to θ for the q that was obtained from MAP inference. The learning algorithm can be

thought of as EM restricted to using a Dirac posterior. In other words, rather than

performing learning exactly using standard inference, we learn to maximize a bound on

the true likelihood, using exact MAP inference.

20.4 Variational Inference and Learning

One common diﬃculty in probabilistic modeling is that the posterior distribution p(h | v)

is infeasible to compute for many models with hidden variables h and visible variables

v. Expectations with respect to this distribution may also be intractable.

Consider as an example the binary sparse coding model. In this model, the input

v ∈ R

is formed by adding Gaussian noise to the sum of m diﬀerent components

which can each be present or absent. Each component is switched on or oﬀ by the

corresponding hidden unit in h ∈ {0, 1}

376

p(h

= 1) = σ(b

)

p(v | h) = N(v | W h, β

−1

)

where b is a learn-able set of biases, W is a learn-able weight matrix, and β is a

learn-able, diagonal precision matrix.

Training this model with maximum likelihood requires taking the derivative with

respect to the parameters. Consider the derivative with respect to one of the biases:

∂

∂b

log p(v)

∂

∂b

p(v)

∂

∂b

p(h, v)

p(v)

∂

∂b

p(h)p(v | h)

p(v)

p(v | h)

∂

∂b

p(h)

p(v)

p(h | v)

∂

∂b

p(h)

p(h | v)

∂

∂b

p(h)

= E

h p(h|v)

∂

∂b

log p(h).

This requires computing expectations with respect to p(h | v). Unfortunately, p(h |

v) is a complicated distribution. See Fig. 20.2 for the graph structure of p(h, v) and p(h |

v). The posterior distribution corresponds to the complete graph over the hidden units,

so variable elimination algorithms do not help us to compute the required expectations

any faster than brute force.

One solution to this problem is to use variational methods. Variational methods

involve using a simple distribution q(h) to approximate the true, complicated posterior

p(h | v). The name “variational” derives from their frequent use of a branch of math-

ematics called calculus of variations. However, not all variational methods use calculus

of variations.

TODO variational inference involves maximization of a BOUND TODO variational

inference also usually involves a restriction on the function family

TODO

377

Figure 20.2: The graph structure of a binary sparse coding model with four hidden

units. Left) The graph structure of p(h, v). Note that the edges are directed, and that

every two hidden units co-parents of every visible unit. Right) The graph structure of

p(h | v). In order to account for the active paths between co-parents, the posterior

distribution needs an edge between all of the hidden units.

20.4.1 Discrete Latent Variables

TODO– BSC example

20.4.2 Calculus of Variations

Many machine learning techniques are based on minimizing a function J(θ) by ﬁnding

the input vector θ ∈ R

for which it takes on its minimal value. This can be accomplished

with multivariate calculus and linear algebra, by solving for the critical points where

∇

J(θ) = 0. In some cases, we actually want to solve for a function f(x), such as when

we want to ﬁnd the probability density function over some random variable. This is

what calculus of variations enables us to do.

A function of a function f is known as a functional J[f]. Much as we can take

partial derivatives of a function with respect to elements of its vector-valued argument,

we can take functional derivatives, also known as variational derivatives of a functional

J[f] with respect to respect to individual values of the function f(x). The functional

derivative of the functional J with respect to the value of the function f at point x is

denoted

δf(x)

A complete formal development of functional derivatives is beyond the scope of this

book. For our purposes, it is suﬃcient to state that for diﬀerentiable functions f(x) and

diﬀerentiable functions g(y, x) with continuous derivatives, that

δf(x)

g (f(x), x) dx =

∂

∂y

g(f(x), x). (20.2)

To gain some intuition for this identity, one can think of f (x) as being a vector with

uncountably many elements, indexed by a real vector x. In this (somewhat incomplete

378

view), the identity providing the functional derivatives is the same as we would obtain

for a vector θ ∈ R

indexed by positive integers:

∂

∂θ

g(θ

, j) =

∂

∂θ

g(θ

, i).

Many results in other machine learning publications are presented using the more general

Euler-Lagrange equation which allows g to depend on the derivatives of f as well as the

value of f, but we do not need this fully general form for the results presented in this

book.

To optimize a function with respect to a vector, we take the gradient of the function

with respect to the vector and solve for the point where every element of the gradient is

equal to zero. Likewise, we can optimize a functional by solving for the function where

the functional derivative at every point is equal to zero.

As an example of how this process works, consider the problem of ﬁnding the proba-

bility distribution function over x ∈ R that has maximal Shannon entropy. Recall that

the entropy of a probability distribution p(x) is deﬁned as

H[p] = −E

log p(x).

For continuous values, the expectation is an integral:

H[p] = −

p(x) log p(x)dx.

We cannot simply maximize H(x) with respect to the function p(x), because the

result might not be a probability distribution. Instead, we need to use Lagrange multi-

pliers, to add a constraint that p(x) integrate to 1. Also, the entropy increases without

bound as the variance increases, so we can only search for the distribution with maximal

entropy for ﬁxed variance σ

. Finally, the problem is underdetermined because the dis-

tribution can be shifted arbitrarily without changing the entropy. To impose a unique

solution, we add a constraint that the mean of the distribution be µ. The Lagrangian

functional for this optimization problem is

L[p] = λ



p(x)dx −1



+ λ

(E[x] −µ) + λ



E[(x −µ)

] −σ



+ H[p]



p(x) + λ

p(x)x + λ

p(x)(x −µ)

− p(x) log p(x)



dx −λ

− µλ

− σ

To minimize the Lagrangian with respect to p, we set the functional derivatives equal

to 0:

∀x,

δp(x)

L = λ

+ λ

x + λ

(x −µ)

− 1 − log p(x) = 0.

This condition now tells us the functional form of p(x). By algebraically re-arranging

the equation, we obtain

p(x) = exp



−λ

− λ

x + λ

(x −µ)

+ 1



379

We never assumed directly that p(x) would take this functional form; we obtained

the expression itself by analytically minimizing a functional. To ﬁnish the minimization

problem, we must choose the λ values to ensure that all of our constraints are satisﬁed.

We are free to choose any λ values, because the gradient of the Lagrangian with respect

to the λ variables is zero so long as the constraints are satisﬁed. To satisfy all of the

constraints, we may set λ

= log σ

√

2π, λ

= 0, and λ

= −

2σ

to obtain

p(x) = N(x | µ, σ

This is one reason for using the normal distribution when we do not know the true

distribution. Because the normal distribution has the maximum entropy, we impose the

least possible amount of structure by making this assumption.

What about the probability distribution function that minimizes the entropy? It

turns out that there is no speciﬁc function that achieves minimal entropy. As functions

place more mass on x = µ ± σ and less on all other values of x, they lose entropy.

However, any function placing exactly zero mass on all but two points does not integrate

to one, and is not a valid probability distribution. There thus is no single minimal

entropy probability distribution function, much as there is no single minimal positive

real number.

20.4.3 Continuous Latent Variables

TODO: Gaussian example from IG’s thesis? TODO: S3C example

20.5 Stochastic Inference

TODO: Charlie Tang’s SFNNs? Is there anything else where sampling-based inference

actually gets used?

20.6 Learned Approximate Inference

TODO: wake-sleep algorithm

In chapter 19.2 we saw that one possible explanation for the role of dream sleep

in human beings and animals is that dreams could provide the negative phase samples

that Monte Carlo training algorithms use to approximate the negative gradient of the

log partition function of undirected models. Another possible explanation for biolog-

ical dreaming is that it is providing samples from p(h, v) which can be used to train

an inference network to predict h given v. In some senses, this explanation is more

satisfying than the partition function explanation. Monte Carlo algorithms generally

do not perform well if they are run using only the positive phase of the gradient for

several steps then with only the negative phase of the gradient for several steps. Hu-

man beings and animals are usually awake for several consecutive hours then asleep

for several consecutive hours, and it is not readily apparent how this schedule could

support Monte Carlo training of an undirected model. Learning algorithms based on

380

maximizing L can be run with prolonged periods of improving q and prolonged periods

of improving theta, however. If the role of biological dreaming is to train networks

for predicting q, then this explains how animals are able to remain awake for several

hours (the longer they are awake, the greater the gap between L and log p(v), but L

will remain a lower bound) and to remain asleep for several hours (the generative model

itself is not modiﬁed during sleep) without damaging their internal models. Of course,

these ideas are purely speculative, and there is no hard evidence to suggest that dream-

ing accomplishes either of these goals. Dreaming may also serve reinforcement learning

rather than probabilistic modeling, by sampling synthetic experiences from the animal’s

transition model, on which to train the animal’s policy. Or sleep may serve some other

purpose not yet anticipated by the machine learning community.

TODO: DARN and NVIL? TODO: fast DBM inference

381