
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
in this chapter, some of which are just 100 lines away. Predictive sparse de-
composition (PSD) is a variant that combines sparse coding and a parametric
encoder (Kavukcuoglu et al., 2008b), i.e., it has both a parametric encoder and
iterative inference. It has been applied to unsupervised feature learning for ob-
ject recognition in images and video (Kavukcuoglu et al., 2009, 2010b; Jarrett
et al., 2009a; Farabet et al., 2011), as well as for audio (Henaff et al., 2011). The
representation is considered to be a free variable (possibly a latent variable if we
choose a probabilistic interpretation) and the training criterion combines a sparse
coding criterion with a term that encourages the optimized sparse representation
h (after inference) to be close to the output of the encoder f(x):
L = arg min
h
||x −g(h)||
2
+ λ|h|
1
+ γ||h − f(x)||
2
(15.14)
where f is the encoder and g is the decoder. Like in sparse coding, for each
example x an iterative optimization is performed in order to obtain a representa-
tion h. However, because the iterations can be initialized from the output of the
encoder, i.e., with h = f(x), only a few steps (e.g. 10) are necessary to obtain
good results. Simple gradient descent on h has been used by the authors. After h
is settled, both g and f are updated towards minimizing the above criterion. The
first two terms are the same as in L1 sparse coding while the third one encourages
f to predict the outcome of the sparse coding optimization, making it a better
choice for the initialization of the iterative optimization. Hence f can be used as
a parametric approximation to the non-parametric encoder implicitly defined by
sparse coding. It is one of the first instances of learned approximate inference (see
also Sec. 19.6). Note that this is different from separately doing sparse coding
(i.e., training g) and then training an approximate inference mechanism f , since
both the encoder and decoder are trained together to be “compatible” with each
other. Hence the decoder will be learned in such a way that inference will tend
to find solutions that can be well approximated by the approximate inference.
TODO: this is probably too much forward reference, when we bring these things
in we can remind people that they resemble PSD, but it doesn’t really help the
reader to say that the thing we are describing now is similar to things they haven’t
seen yet A similar example is the variational auto-encoder, in which the encoder
acts as approximate inference for the decoder, and both are trained jointly (Sec-
tion 20.9.3). See also Section 20.9.4 for a probabilistic interpretation of PSD in
terms of a variational lower bound on the log-likelihood.
In practical applications of PSD, the iterative optimization is only used during
training, and f is used to compute the learned features. It makes computation
fast at recognition time and also makes it easy to use the trained features f as
initialization (unsupervised pre-training) for the lower layers of a deep net. Like
other unsupervised feature learning schemes, PSD can be stacked greedily, e.g.,
422