
CHAPTER 19. APPROXIMATE INFERENCE
TODO synch with other sparse coding sections, make sure the other sections
talk about using gradient descent, feature sign, ISTA, etc.
This shows that the popular feature extraction strategy for sparse coding can
be justified as having a probabilistic interpretation–it may be MAP inference in
this probabilistic model ( there are other probabilistic models that yield the same
optimization problem, so we cannot positively identify this specific model from
the feature extraction process ).
Excitingly, MAP inference of h given v also has an interpretation in terms
of maximizing the evidence lower bound. Specifically, MAP inference maximizes
L with respect to q under the constraint that q take the form form of a Dirac
distribution. During learning of sparse coding, we alternate between using convex
optimization to extract the codes, and using convex optimization to update W
to achieve the optimal reconstruction given the codes. This turns out to be
equivalent to maximizing L with respect to θ for the q that was obtained from
MAP inference. The learning algorithm can be thought of as EM restricted to
using a Dirac posterior. In other words, rather than performing learning exactly
using standard inference, we learn to maximize a bound on the true likelihood,
using exact MAP inference.
19.4 Variational Inference and Learning
One common difficulty in probabilistic modeling is that the posterior distribution
p(h | v) is infeasible to compute for many models with hidden variables h and
visible variables v. Expectations with respect to this distribution may also be
intractable.
Consider as an example the binary sparse coding model. In this model, the
input v ∈ R
n
is formed by adding Gaussian noise to the sum of m different
components which can each be present or absent. Each component is switched
on or off by the corresponding hidden unit in h ∈ {0, 1}
m
:
p(h
i
= 1) = σ(b
i
)
p(v | h) = N(v | W h, β
−1
)
where b is a learn-able set of biases, W is a learn-able weight matrix, and β
is a learn-able, diagonal precision matrix.
Training this model with maximum likelihood requires taking the derivative
with respect to the parameters. Consider the derivative with respect to one of
the biases:
∂
∂b
i
log p(v)
507