With regularized auto-encoders such as sparse auto-encoders and contractive auto-
encoders, instead, the regularizer corresponds to a log-prior over the representation, or
over latent variables. In the case of sparse auto-encoders, predictive sparse decomposi-
tion and contractive auto-encoders, the regularizer specifies a preference over functions
of the data, rather than over parameters. This makes such a regularizer data-dependent,
unlike the classical parameter log-prior. Specifically, in the case of the sparse auto-
encoder, it says that we prefer an encoder whose output produces values closer to 0.
Indirectly (when we marginalize over the training distribution), this is also indicating a
preference over parameters, of course.
13.3.2 Predictive Sparse Decomposition
Predictive Sparse Decomposition (PSD) is a variant that combines sparse coding and an
auto-encoder (Kavukcuoglu et al., 2008b). It has been applied to unsupervised feature
learning for object recognition in images and video (Kavukcuoglu et al., 2009, 2010;
Jarrett et al., 2009a; Farabet et al., 2011), as well as for audio (Henaff et al., 2011).
The representation is considered to be a free variable (possibly a latent variable if we
choose a probabilistic interpretation) and the training criterion combines a sparse coding
criterion with a term that encourages the optimized sparse representation h to be close
to the output of the encoder f (x):
L = arg min
h
||x −g(h)||
2
+ λ|h|
1
+ γ||h − f(x)||
2
(13.8)
where f is the encoder and g is the decoder. Like in sparse coding, for each example x
an iterative optimization is performed in order to obtain a representation h. However,
because the iterations can be initialized from the output of the encoder, i.e., with h =
f(x), only a few steps (e.g. 10) are necessary to obtain good results. Simple gradient
descent on h has been used by the authors. After h is settled, both g and f are updated
towards minimizing the above criterion. The first two terms are the same as in L1 sparse
coding while the second one encourages f to predict the outcome of the sparse coding
optimization, making it a better choice for the initialization of the iterative optimization.
Hence f can be used as a parametric approximation to the non-parametric encoder
implicitly defined by sparse coding. It is one of the first instances of learned approximate
inference (see also Sec. 16.6). Note that this is different from separately doing sparse
coding (i.e., training g) and then training an approximate inference mechanism f, since
both the encoder and decoder are trained together to be “compatible” with each other.
Hence the decoder will be learned in such a way that inference will tend to find solutions
that can be well approximated by the approximate inference. A similar example is the
variational auto-encoder, in which the encoder acts as approximate inference for the
decoder, and both are trained jointly (Section 17.7.1). See also Section 17.7.2 for a
probabilistic interpretation of PSD in terms of a variational lower bound on the log-
likelihood.
In practical applications of PSD, the iterative optimization is only used during train-
ing, and f is used to compute the learned features (e.g., to pre-training a supervised
276