We can thus think of inference as the procedure for finding the q that maximizes
L. Exact inference maximizes L perfectly. Throughout this chapter, we will show how
many forms of approximate inference are possible. No matter what choice of q we use,
L will give us a lower bound on the likelihood. We can get tighter or looser bounds that
are cheaper or more expensive to compute depending on how we choose to approach this
optimization problem. We can obtain a poorly matched q but reduce the computation
cost by using an imperfect optimization procedure, or by using a perfect optimization
procedure over a restricted family of q distributions.
20.2 Expectation Maximization
Expectation maximization (EM) is a popular training algorithm for models with latent
variables. It consists of alternating between two steps until convergence:
• The E-step (Expectation step): Set q(h
(i)
) = p(h
(i)
| v
(
i); θ) for all indices i of the
training examples v
(i)
we want to train on (both batch and minibatch variants are
valid). By this we mean q is defined in terms of the current value of θ; if we vary
θ then p(h | v; θ) will change but q(h) will not.
• The M-step (Maximization step): Completely or partially maximize
P
i
L(v
(
i), θ, q)
with respect to θ using your optimization algorithm of choice.
This can be viewed as a coordinate ascent algorithm to maximize L. On one step,
we maximize L with respect to q, and on the other, we maximize L with respect to θ.
Stochastic gradient ascent on latent variable models can be seen as a special case
of the EM algorithm where the M step consists of taking a single gradient step. Other
variants of the EM algorithm can make much larger steps. For some model families, the
M step can even be performed analytically, jumping all the way to the optimal solution
given the current q.
Even though the E-step involves exact inference, we can think of the EM algorithm
as using approximate inference in some sense. Specifically, the M-step assumes that the
same value of q can be used for all values of θ. This will introduce a gap between L
and the true log p(v) as the M-step moves further and further. Fortunately, the E-step
reduces the gap to zero again as we enter the loop for the next time.
The EM algorithm is a workhorse of classical machine learning, and it can be con-
sidered to be used in deep learning in the sense that stochastic gradient ascent can be
seen as EM with a very simple and small M step. However, because L can not be ana-
lytically maximized for many interesting deep models, the more general EM framework
as a whole is typically not explored in the deep learning research community.
TODO–cite the emview paper
20.3 MAP Inference: Sparse Coding as a Probabilistic
Model
TODO synch up with other sections on sparse coding
375