
CHAPTER 20. APPROXIMATE INFERENCE
= log p(v; θ) − E
h∼q
[log q(h) − log p(h, v) + log p(v)]
= −E
h∼q
[log q(h) − log p(h, v)] .
This yields the more canonical definition of the evidence lower bound,
L(v, θ, q) = E
h∼q
[log p(h, v)] + H(q). (20.1)
The first term of L is known as the energy term. The second term is known
as the entropy term. For an appropriate choice of q, both terms can be easy to
compute. The only question is how close to p(h | v) the distribution q will be.
This determines how good of an approximation L will be for log p(v).
We can thus think of inference as the procedure for finding the q that max-
imizes L. Exact inference maximizes L perfectly. Throughout this chapter, we
will show how many forms of approximate inference are possible. No matter what
choice of q we use, L will give us a lower bound on the likelihood. We can get
tighter or looser bounds that are cheaper or more expensive to compute depending
on how we choose to approach this optimization problem. We can obtain a poorly
matched q but reduce the computation cost by using an imperfect optimization
procedure, or by using a perfect optimization procedure over a restricted family
of q distributions.
20.2 Expectation Maximization
Expectation maximization (EM) is a popular training algorithm for models with
latent variables. It consists of alternating between two steps until convergence:
• The E-step (Expectation step): Set q(h
(i)
) = p(h
(i)
| v
(
i); θ) for all indices i
of the training examples v
(i)
we want to train on (both batch and minibatch
variants are valid). By this we mean q is defined in terms of the current
value of θ; if we vary θ then p(h | v; θ) will change but q(h) will not.
• The M-step (Maximization step): Completely or partially maximize
P
i
L(v
(
i), θ, q)
with respect to θ using your optimization algorithm of choice.
This can be viewed as a coordinate ascent algorithm to maximize L. On one
step, we maximize L with respect to q, and on the other, we maximize L with
respect to θ.
Stochastic gradient ascent on latent variable models can be seen as a special
case of the EM algorithm where the M step consists of taking a single gradient
step. Other variants of the EM algorithm can make much larger steps. For some
model families, the M step can even be performed analytically, jumping all the
way to the optimal solution given the current q.
470