Chapter 16
Linear Factor Models and
Auto-Encoders
Linear factor models are generative unsupervised learning models in which we imag-
ine that some unobserved factors h explain the observed variables x through a linear
transformation. Auto-encoders are unsupervised learning methods that learn a repre-
sentation of the data, typically obtained by a non-linear parametric transformation of
the data, i.e., from x to h, typically a feedforward neural network, but not necessarily.
They also learn a transformation going backwards from the representation to the data,
from h to x, like the linear factor models. Linear factor models therefore only specify
a parametric decoder, whereas auto-encoder also specify a parametric encoder. Some
linear factor models, like PCA, actually correspond to an auto-encoder (a linear one),
but for others the encoder is implicitly defined via an inference mechanism that searches
for an h that could have generated the observed x.
The idea of auto-encoders has been part of the historical landscape of neural networks
for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994) but
has really picked up speed in recent years. They remained somewhat marginal for
many years, in part due to what was an incomplete understanding of the mathematical
interpretation and geometrical underpinnings of auto-encoders, which are developed
further in Chapters 18 and 21.10.
An auto-encoder is simply a neural network that tries to copy its input
to its output. The architecture of an auto-encoder is typically decomposed into the
following parts, illustrated in Figure 16.1:
an input, x
an encoder function f
a code or internal representation h = f(x)
a decoder function g
an output, also called reconstruction r = g(h) = g(f(x))
282
input!x!
code!h!
reconstruc,on!r!
Decoder.g!
Encoder.f!
Figure 16.1: General schema of an auto-encoder, mapping an input x to an output
(called reconstruction) r through an internal representation or code h. The auto-encoder
has two components: the encoder f (mapping x to h) and the decoder g (mapping h
to r).
a loss function L computing a scalar L(r, x) measuring how good of a reconstruc-
tion r is of the given input x. The objective is to minimize the expected value of
L over the training set of examples {x}.
.
16.1 Regularized Auto-Encoders
Predicting the input may sound useless: what could prevent the auto-encoder from
simply copying its input into its output? In the 20th century, this was achieved by
constraining the architecture of the auto-encoder to avoid this, by forcing the dimension
of the code h to be smaller than the dimension of the input x.
Figure 16.2 illustrates the two typical cases of auto-encoders: undercomplete vs over-
complete, i.e., with the dimension of the representation h respectively smaller vs larger
than the input x. Whereas early work with auto-encoders, just like PCA, uses the under-
completeness – i.e. a bottleneck in the sequence of layers – to avoid learning the identity
function, more recent work allows overcomplete representations. What we have learned
in recent years is that it is possible to make the auto-encoder meaningfully capture the
structure of the input distribution even if the representation is overcomplete, with other
forms of constraint or regularization. In fact, once you realize that auto-encoders can
283
input!x!
Code*bo,leneck!h:!
undercomplete*
representa4on*
reconstruc4on!r!
Decoder*
Encoder*
Decoder*
Encoder*
Code!h:!
overcomplete*
representa4on*
Figure 16.2: Left: undercomplete representation (dimension of code h is less than di-
mension of input x). Right: overcomplete representation. Overcomplete auto-encoders
require some other form of regularization (instead of the constraint on the dimension of
h) to avoid the trivial solution where r = x for all x.
capture the input distribution (indirectly, not as a an explicit probability function), you
also realize that it should need more capacity as one increases the complexity of the
distribution to be captured (and the amount of data available): it should not be limited
by the input dimension. This is a problem in particular with the shallow auto-encoders,
which have a single hidden layer (for the code). Indeed, that hidden layer size controls
both the dimensionality reduction constraint (the code size at the bottleneck) and the
capacity (which allows to learn a more complex distribution).
Besides the bottleneck constraint, alternative constraints or regularization methods
have been explored and can guarantee that the auto-encoder does something useful and
not just learn some trivial identity-like function:
Sparsity of the representation or of its derivative: even if the intermediate
representation has a very high dimensionality, the effective local dimensionality
(number of degrees of freedom that capture a coordinate system among the prob-
able x’s) could be much smaller if most of the elements of h are zero (or any
other constant, such that ||
h
i
x
|| is close to zero). When ||
h
i
x
|| is close to zero,
h
i
does not participate in encoding local changes in x. There is a geometrical
interpretation of this situation in terms of manifold learning that is discussed in
more depth in Chapter 18. The discussion in Chapter 17 also explains how an
auto-encoder naturally tends towards learning a coordinate system for the actual
284
factors of variation in the data. At least four types of “auto-encoders” clearly fall
in this category of sparse representation:
Sparse coding (Olshausen and Field, 1996) has been heavily studied as an
unsupervised feature learning and feature inference mechanism. It is a linear
factor model rather than an auto-encoder, because it has no explicit para-
metric encoder, and instead uses an iterative inference instead to compute
the code. Sparse coding looks for representations that are both sparse and
explain the input through the decoder. Instead of the code being a para-
metric function of the input, it is instead considered like free variable that is
obtained through an optimization, i.e., a particular form of inference:
h
= f(x) = arg min
h
L(g(h), x)) + λΩ(h) (16.1)
where L is the reconstruction loss, f the (non-parametric) encoder, g the
(parametric) decoder, Ω(h) is a sparsity regularizer, and in practice the min-
imization can be approximate. Sparse coding has a manifold or geometric
interpretation that is discussed in Section 16.7. It also has an interpretation
as a directed graphical model, described in more details in Section 20.3. To
achieve sparsity, the objective function to optimize includes a term that is
minimized when the representation has many zero or near-zero values, such
as the L1 penalty |h|
1
=
P
i
|h
i
|.
An interesting variation of sparse coding combines the freedom to choose the
representation through optimization and a parametric encoder. It is called
predictive sparse decomposition (PSD) (Kavukcuoglu et al., 2008a) and
is briefly described in Section 16.7.2.
At the other end of the spectrum are simply sparse auto-encoders, which
combine with the standard auto-encoder schema a sparsity penalty which
encourages the output of the encoder to be sparse. These are described in
Section 16.7.1. Besides the L1 penalty, other sparsity penalties that have been
explored include the Student-t penalty (Olshausen and Field, 1996; Bergstra,
2011),
X
i
log(1 + α
2
h
2
i
)
(i.e. where αh
i
has a Student-t prior density) and the KL-divergence penalty (Lee
et al., 2008; Goodfellow et al., 2009; Larochelle and Bengio, 2008a)
X
i
(t log h
i
+ (1 t) log(1 h
i
)),
with a target sparsity level t, for h
i
(0, 1), e.g. through a sigmoid non-
linearity.
Contractive autoencoders (Rifai et al., 2011b), covered in Section 16.9,
explicitly penalize ||
h
x
||
2
F
, i.e., the sum of the squared norm of the vectors
285
h
i
(x)
x
(each indicating how much each hidden unit h
i
responds to changes in
x and what direction of change in x that unit is most sensitive to, around
a particular x). With such a regularization penalty, the auto-encoder is
called contractive
1
because the mapping from input x to representation h
is encouraged to be contractive, i.e., to have small derivatives in all directions.
Note that a sparsity regularization indirectly leads to a contractive mapping
as well, when the non-linearity used happens to have a zero derivative at
h
i
= 0 (which is the case for the sigmoid non-linearity).
Robustness to injected noise or missing information: if noise is injected
in inputs or hidden units, or if some inputs are missing, while the neural network
is asked to reconstruct the clean and complete input, then it cannot simply learn
the identity function. It has to capture the structure of the data distribution
in order to optimally perform this reconstruction. Such auto-encoders are called
denoising auto-encoders and are discussed in more detail in Section 16.8. There is
a tight connection between the denoising auto-encoders and the contractive auto-
encoders: it can be shown (Alain and Bengio, 2013) that in the limit of small
Gaussian injected input noise, the denoising reconstruction error is equivalent to
a contractive penalty on the reconstruction function that maps x to r = g(f(x)).
In other words, since both x and x + (where is some small noise vector)
must yield the same target output x, the reconstruction function is encouraged
to be insensitive to changes in all directions . The only thing that prevents
reconstruction r from simply being a constant (completely insensitive to the input
x), is that one also has to reconstruct correctly for different training examples x.
However, the auto-encoder can learn to be approximately constant around training
examples x while producing a different answer for different training examples. As
discussed in Section 18.3, if the examples are near a low-dimensional manifold, this
encourages the representation to vary only on the manifold and be locally constant
in directions orthogonal to the manifold, i.e., the representation locally captures
a (not necessarily Euclidean, not necessarily orthogonal) coordinate system for
the manifold. In addition to the denoising auto-encoder, the variational auto-
encoder (Section 21.8.2) and the generative stochastic networks (Section 21.10)
also involve the injection of noise, but typically in the representation-space itself,
thus introducing the notion of h as a latent variable.
Pressure of a Prior on the Representation: an interesting way to generalize
the notion of regularization applied to the representation is to introduce in the
cost function for the auto-encoder a log-prior term
log P (h)
which captures the assumption that we would like to find a representation that has
a simple distribution (if P (h) has a simple form, such as a factorized distribution
2
),
1
A function f (x) is contractive if ||f (x) f(y)|| < ||x y|| for nearby x and y, or equivalently if its
derivative ||f
0
(x)|| < 1.
2
all the sparse priors we have described correspond to a factorized distribution
286
or at least one that is simpler than the original data distribution. Among all the
encoding functions f, we would like to pick one that
1. can be inverted (easily), and this is achieved by minimizing some reconstruc-
tion loss, and
2. yields representations h whose distribution is “simpler”, i.e., can be captured
with less capacity than the original training distribution itself.
The sparse variants described above clearly fall in that framework. The varia-
tional auto-encoder (Section 21.8.2) provides a clean mathematical framework for
justifying the above pressure of a top-level prior when the objective is to model
the data generating distribution.
From the point of view of regularization (Chapter 7), adding the log P (h) term
to the objective function (e.g. for encouraging sparsity) or adding a contractive penalty
do not fit the traditional view of a prior on the parameters. Instead, the prior on
the latent variables acts like a data-dependent prior, in the sense that it depends on the
particular values h that are going to be sampled (usually from a posterior or an encoder),
based on the input example x. Of course, indirectly, this is also a regularization on the
parameters, but one that depends on the particular data distribution.
16.2 Representational Power, Layer Size and Depth
Nothing in the above description of auto-encoders restricts the encoder or decoder to
be shallow, but in the literature on the subject, most trained auto-encoders have had a
single hidden layer which is also the representation layer or code
3
For one, we know by the usual universal approximator abilities of single hidden-
layer neural networks that a sufficiently large hidden layer can represent any function
with a given accuracy. This observation justifies overcomplete auto-encoders: in order to
represent a rich enough distribution, one probably needs many hidden units in the inter-
mediate representation layer. We also know that Principal Components Analysis (PCA)
corresponds to an undercomplete auto-encoder with no intermediate non-linearity, and
that PCA can only capture a set of directions of variation that are the same every-
where in space. This notion is discussed in more details in Chapter 18 in the context of
manifold learning.
For two, it has also been reported many times that training a deep neural network,
and in particular a deep auto-encoder (i.e. with a deep encoder and a deep decoder) is
more difficult than training a shallow one. This was actually a motivation for the initial
work on the greedy layerwise unsupervised pre-training procedure, described below in
Section 17.1, by which we only need to train a series of shallow auto-encoders in order to
initialize a deep auto-encoder. It was shown early on (Hinton and Salakhutdinov, 2006)
3
as argued in this book, this is probably not a good choice, and we would like to independently
control the constraints on the representation, e.g. dimension and sparsity of the code, and the capacity
of the encoder.
287
that, if trained properly, such deep auto-encoders could yield much better compression
than corresponding shallow or linear auto-encoders (which are basically doing the same
as PCA, see Section 16.5 below). As discussed in Section 17.7, deeper architectures
can be in some cases exponentially more efficient (both in terms of computation and
statistically) than shallow ones. However, because we can usefully pre-train a deep net
by training and stacking shallow ones, it makes it interesting to consider single-layer
(or at least shallow and easy to train) auto-encoders, as has been done in most of the
literature discussed in this chapter.
16.3 Reconstruction Distribution
The above “parts” (encoder function f, decoder function g, reconstruction loss L) make
sense when the loss L is simply the squared reconstruction error, but there are many
cases where this is not appropriate, e.g., when x is a vector of discrete variables or
when P (x|h) is not well approximated by a Gaussian distribution
4
. Just like in the
case of other types of neural networks (starting with the feedforward neural networks,
Section 6.2.2), it is convenient to define the loss L as a negative log-likelihood over
some target random variables. This probabilistic interpretation is particularly impor-
tant for the discussion in Sections 21.8.2, 21.9 and 21.10 about generative extensions of
auto-encoders and stochastic recurrent networks, where the output of the auto-encoder
is interpreted as a probability distribution P (x|h), for reconstructing x, given hid-
den units h. This distribution captures not just the expected reconstruction but also
the uncertainty about the original x (which gave rise to h, either deterministically or
stochastically, given h). In the simplest and most ordinary cases, this distribution fac-
torizes, i.e., P (x|h) =
Q
i
P (x
i
|h). This covers the usual cases of x
i
|h being Gaussian
(for unbounded real values) and x
i
|h having a Bernoulli distribution (for binary values
x
i
), but one can readily generalize this to other distributions, such as mixtures (see
Sections 3.10.5 and 6.2.2).
Thus we can generalize the notion of decoding function g(h) to decoding distribution
P (x|h). Similarly, we can generalize the notion of encoding function f(x) to encoding
distribution Q(h|x), as illustrated in Figure 16.3. We use this to capture the fact
that noise is injected at the level of the representation h, now considered like a latent
variable. This generalization is crucial in the development of the variational auto-encoder
(Section 21.8.2) and the generalized stochastic networks (Section 21.10).
We also find a stochastic encoder and a stochastic decoder in the RBM, described
in Section 21.1. In that case, the encoding distribution Q(h|x) and P (x|h) “match”, in
the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distribution which has both
Q(h|x) and P (x|h) as conditionals. This is not true in general for two independently
parametrized conditionals like Q(h|x) and P (x|h), although the work on generative
stochastic networks (Alain et al., 2015) shows that learning will tend to make them
compatible asymptotically (with enough capacity and examples).
4
See the link between squared error and normal density in Sections 5.6 and 6.2.2
288
x
h
Q(h|x)
P (x|h)
Figure 16.3: Basic scheme of a stochastic auto-encoder, in which both the encoder and
the decoder are not simple functions but instead involve some noise injection, meaning
that their output can be seen as sampled from a distribution, Q(h|x) for the encoder
and P (x|h) for the decoder. RBMs are a special case where P = Q (in the sense of a
unique joint corresponding to both conditinals) but in general these two distributions
are not necessarily conditional distributions compatible with a unique joint distribution
P (x, h).
16.4 Linear Factor Models
Now that we have introduced the notion of a probabilistic decoder, let us focus on a
very special case where the latent variable h generates x via a linear transformation plus
noise, i.e., classical linear factor models, which do not necessarily have a corresponding
parametric encoder.
The idea of discovering explanatory factors that have a simple joint distribution
among themselves is old, e.g., see Factor Analysis (see below), and has been explored
first in the context where the relationship between factors and data is linear, i.e., we
assume that the data was generated as follows. First, sample the real-valued factors,
h P (h), (16.2)
and then sample the real-valued observable variables given the factors:
x = W h + b + noise (16.3)
where the noise is typically Gaussian and diagonal (independent across dimensions).
This is illustrated in Figure 16.4.
16.5 Probabilistic PCA and Factor Analysis
Probabilistic PCA (Principal Components Analysis) and factor analysis are both special
cases of the above equations (16.2 and 16.3) and only differ in the choices made for the
289
P (x|h)
h P (h)
Figure 16.4: Basic scheme of a linear factors model, in which we assume that an observed
data vector x is obtained by a linear combination of latent factors h, plus some noise.
Different models, such as probabilistic PCA, factor analysis or ICA, make different
choices about the form of the noise and of the prior P (h).
prior (over latent, not parameters) and noise distributions.
In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable prior
is just the unit variance Gaussian
h N(0, I)
while the observed variables x
i
are assumed to be conditionally independent, given h,
i.e., the noise is assumed to be coming from a diagonal covariance Gaussian distribution,
with covariance matrix ψ = diag(σ
2
), with σ
2
= (σ
2
1
, σ
2
2
, . . .) a vector of per-variable
variances.
The role of the latent variables is thus to capture the dependencies between the
different observed variables x
i
. Indeed, it can easily be shown that x is just a Gaussian-
distribution (multivariate normal) random variable, with
x N(b, W W
>
+ ψ)
where we see that the weights W induce a dependency between two variables x
i
and
x
j
through a kind of auto-encoder path, whereby x
i
influences
ˆ
h
k
= W
k
x via w
ki
(for
every k) and
ˆ
h
k
influences x
j
via w
kj
.
In order to cast PCA in a probabilistic framework, we can make a slight modification
to the factor analysis model, making the conditional variances σ
i
equal to each other.
In that case the covariance of x is just W W
>
+ σ
2
I, where σ
2
is now a scalar, i.e.,
x N(b, W W
>
+ σ
2
I)
or equivalently
x = W h + b + σz
290
where z N(0, I) is white noise. Tipping and Bishop (1999) then show an iterative
EM algorithm for estimating the parameters W and σ
2
.
What the probabilistic PCA model is basically saying is that the covariance is mostly
captured by the latent variables h, up to some small residual reconstruction error σ
2
.
As shown by Tipping and Bishop (1999), probabilistic PCA becomes PCA as σ 0. In
that case, the conditional expected value of h given x becomes an orthogonal projection
onto the space spanned by the d columns of W , like in PCA. See Section 18.1 for a
discussion of the “inference” mechanism associated with PCA (probabilistic or not),
i.e., recovering the expected value of the latent factors h
i
given the observed input x.
That section also explains the very insightful geometric and manifold interpretation of
PCA.
However, as σ 0, the density model becomes very sharp around these d dimensions
spanned the columns of W , as discussed in Section 18.1, which would not make it a
very faithful model of the data, in general (not just because the data may live on a
higher-dimensional manifold, but more importantly because the real data manifold may
not be a flat hyperplane - see Chapter 18 for more).
16.5.1 ICA
Independent Component Analysis (ICA) is among the oldest representation learning
algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,
1999; Hyv¨arinen et al., 2001). It is an approach to modeling linear factors that seeks
non-Gaussian projections of the data. Like probabilistic PCA and factor analysis, it
also fits the linear factor model of Eqs. 16.2 and 16.3. What is particular about ICA is
that unlike PCA and factor analysis it does not assume that the latent variable prior is
Gaussian. It only assumes that it is factorized, i.e.,
P (h) =
Y
i
P (h
i
). (16.4)
Since there is no parametric assumption behind the prior, we are really in front of a
so-called semi-parametric model, with parts of the model being parametric (P (x|h))
and parts being non-specified or non-parametric (P (h)). In fact, this typically yields
to non-Gaussian priors: if the priors were Gaussian, then one could not distinguish
between the factors h and a rotation of h. Indeed, note that if
h = U z
with U an orthonormal (rotation) square matrix, i.e.,
z = U
>
h,
then, although h might have a Normal(0, I) distribution, the z also have a unit covari-
ance, i.e., they are uncorrelated:
V ar[z] = E[zz
>
] = E[U
>
hh
>
U] = U
>
V ar[h]U = U
>
U = I.
291
In other words, imposing independence among Gaussian factors does not allow one to
disentangle them, and we could as well recover any linear rotation of these factors. It
means that, given the observed x, even though we might assume the right generative
model, PCA cannot recover the original generative factors. However, if we assume
that the latent variables are non-Gaussian, then we can recover them, and this is what
ICA is trying to achieve. In fact, under these generative model assumptions, the true
underlying factors can be recovered (Comon, 1994). In fact, many ICA algorithms are
looking for projections of the data s = V x such that they are maximally non-Gaussian.
An intuitive explanation for these approaches is that although the true latent variables h
may be non-Gaussian, almost any linear combination of them will look more Gaussian,
because of the central limit theorem. Since linear combinations of the x
i
’s are also linear
combinations of the h
j
’s, to recover the h
j
’s we just need to find the linear combinations
that are maximally non-Gaussian (while keeping these different projections orthogonal
to each other).
There is an interesting connection between ICA and sparsity, since the dominant
form of non-Gaussianity in real data is due to sparsity, i.e., concentration of probability
at or near 0. Non-Gaussian distributions typically have more mass around zero, although
you can also get non-Gaussianity by increasing skewness, asymmetry, or kurtosis.
Like PCA can be generalized to non-linear auto-encoders described later in this
chapter, ICA can be generalized to a non-linear generative model, e.g., x = f(h) +
noise. See Hyv¨arinen and Pajunen (1999) for the initial work on non-linear ICA and
its successful use with ensemble learning by Roberts and Everson (2001); Lappalainen
et al. (2000).
16.5.2 Sparse Coding as a Generative Model
One particularly interesting form of non-Gaussianity arises with distributions that are
sparse. These typically have not just a peak at 0 but also a fat tail
5
. Like the other
linear factor models (Eq. 16.3), sparse coding corresponds to a linear factor model, but
one with a “sparse” latent variable h, i.e., P (h) puts high probability at or around 0.
Unlike with ICA (previous section), the latent variable prior is parametric. For example
the factorized Laplace density prior is
P (h) =
Y
i
P (h
i
) =
Y
i
λ
2
e
λ|h
i
|
(16.5)
and the factorized Student-t prior is
P (h) =
Y
i
P (h
i
)
Y
i
1
1 +
h
2
i
ν
ν+1
2
. (16.6)
Both of these densities have a strong preference for near-zero values but, unlike the
Gaussian, accomodate large values. In the standard sparse coding models, the recon-
5
with probability going to 0 as the values increase in magnitude at a rate that is slower than the
Gaussian, i.e., less than quadratic in the log-domain.
292
struction noise is assumed to be Gaussian, so that the corresponding reconstruction
error is the squared error.
Regarding sparsity, note that the actual value h
i
= 0 has zero measure under both
densities, meaning that the posterior distribution P (h|x) will not generate values h = 0.
However, sparse coding is normally considered under a maximum a posteriori (MAP)
inference framework, in which the inferred values of h are those that maximize the
posterior, and these tend to often be zero if the prior is sufficiently concentrated around
0. The inferred values are those defined in Eq. 16.1, reproduced here,
h = f (x) = arg min
h
L(g(h), x)) + λΩ(h)
where L(g(h), x) is interpreted as log P(x|g(h)) and Ω(h) as log P (h). This MAP
inference view of sparse coding and an interesting probabilistic interpretation of sparse
coding are further discussed in Section 20.3.
To relate the generative model of sparse coding to ICA, note how the prior imposes
not just sparsity but also independence of the latent variables h
i
under P (h), which may
help to separate different explanatory factors, unlike PCA, factor analysis or probabilis-
tic PCA, because these rely on a Gaussian prior, which yields a factorized prior under
any rotation of the factors, multiplication by an orthonormal matrix, as demonstrated
in Section 16.5.1.
See Section 18.2 about the manifold interpretation of sparse coding.
TODO: relate to and point to Spike-and-slab sparse coding (Goodfellow et al., 2012)
(section?)
16.6 Probabilistic Interpretation of Reconstruction Error
as Log-Likelihood
Although traditional auto-encoders (like traditional neural networks) were introduced
with an associated training loss, just like for neural networks, that training loss can
generally be given a probabilistic interpretation as a conditional log-likelihood of the
original input x, given the reprensentation h.
We have already covered negative log-likelihood as a loss function in general for feed-
forward neural networks in Section 6.2.2. Like prediction error for regular feedforward
neural networks, reconstruction error for auto-encoders does not have to be squared
error. When we view the loss as negative log-likelihood, we interpret the reconstruction
error as
L = log P(x|h)
where h is the representation, which may generally be obtained through an encoder
taking x as input.
293
x
f
g
h = f(x)
L = log P (x|g(f(x)))
Figure 16.5: The computational graph of an auto-encoder, which is trained to maximize
the probability assigned by the decoder g to the data point x, given the output of the
encoder h = f(x). The training objective is thus L = log P (x|g(f (x))), which ends up
being squared reconstruction error if we choose a Gaussian reconstruction distribution
with mean g(f (x)), and cross-entropy if we choose a factorized Bernoulli reconstruction
distribution with means g(f(x)).
An advantage of this view is that it immediately tells us what kind of loss function
one should use depending on the nature of the input. If the input is real-valued and
unbounded, then squared error is a reasonable choice of reconstruction error, and cor-
responds to P (x|h) being Normal. If the input is a vector of bits, then cross-entropy
is a more reasonable choice, and corresponds to P (x|h) =
Q
i
P (x
i
|h) with x
i
|h being
Bernoulli-distributed. We then view the decoder g(h) as computing the parameters of
the reconstruction distribution, i.e., P (x|h) = P (x|g(h)).
Another advantage of this view is that we can think about the training of the decoder
as estimating the conditional distribution P (x|h), which comes handy in the probabilis-
tic interpretation of denoising auto-encoders, allowing us to talk about the distribution
P (x) explicitly or implicitly represented by the auto-encoder (see Sections 16.8, 21.8.2
and 21.9 for more details). In the same spirit, we can rethink the notion of encoder from
a simple function to a conditional distribution Q(h|x), with a special case being when
Q(h|x) is a Dirac at some particular value. Equivalently, thinking about the encoder
as a distribution corresponds to injecting noise inside the auto-encoder. This view is
developed further in Sections 21.8.2 and 21.10.
294
16.7 Sparse Representations
Sparse auto-encoders are auto-encoders which learn a sparse representation, i.e., one
whose elements are often either zero or close to zero. Sparse coding was introduced in
Section 16.5.2 as a linear factor model in which the prior P (h) on the representation
h = f(x) encourages values at or near 0. In Section 16.7.1, we see how ordinary auto-
encoders can be prevented from learning a useless identity transformation by using a
sparsity penalty rather than a bottleneck. The main difference between a sparse auto-
encoder and sparse coding is that sparse coding has no explicit parametric encoder,
whereas sparse auto-encoders have one. The “encoder” of sparse coding is the algorithm
that performs the approximate inference, i.e., looks for
h
(x) = arg max
h
log P (h|x) = arg min
h
||x (b + Wh)||
2
σ
2
log P (h) (16.7)
where σ
2
is a reconstruction variance parameter (which should equal the average squared
reconstruction error
6
), and P(h) is a “sparse” prior that puts more probability mass
around h = 0, such as the Laplacian prior, with factorized marginals
P (h
i
) =
λ
2
e
λ|h
i
|
(16.8)
or the Student-t prior, with factorized marginals
P (h
i
)
1
(1 +
h
2
i
ν
)
ν+1
2
. (16.9)
The advantages of such a non-parametric encoder and the sparse coding approach over
sparse auto-encoders are that
1. it can in principle minimize the combination of reconstruction error and log-prior
better than any parametric encoder,
2. it performs what is called explaining away (see Figure 14.8), i.e., it allows to
“choose” some “explanations” (hidden factors) and inhibits the others.
The disadvantages are that
1. computing time for encoding the given input x, i.e., performing inference (com-
puting the representation h that goes with the given x) can be substantially larger
than with a parametric encoder (because an optimization must be performed for
each example x), and
2. the resulting encoder function could be non-smooth and possibly too non-linear
(with two nearby x’s being associated with very different h’s), potentially making
it more difficult for the downstream layers to properly generalize.
6
but can be lumped into the regularizer λ which controls the strength of the sparsity prior, defined
in Eq. 16.8, for example.
295
In Section 16.7.2, we describe PSD (Predictive Sparse Decomposition), which com-
bines a non-parametric encoder (as in sparse coding, with the representation obtained
via an optimization) and a parametric encoder (like in the sparse auto-encoder). Sec-
tion 16.8 introduces the Denoising Auto-Encoder (DAE), which puts pressure on the
representation by requiring it to extract information about the underlying distribution
and where it concentrates, so as to be able to denoise a corrupted input. Section 16.9
describes the Contractive Auto-Encoder (CAE), which optimizes an explicit regulariza-
tion penalty that aims at making the representation as insensitive as possible to the
input, while keeping the information sufficient to reconstruct the training examples.
16.7.1 Sparse Auto-Encoders
A sparse auto-encoder is simply an auto-encoder whose training criterion involves a
sparsity penalty Ω(h) in addition to the reconstruction error:
L = log P(x|g(h)) + Ω(h) (16.10)
where g(h) is the decoder output and typically we have h = f (x), the encoder output.
We can think of that penalty Ω(h) simply as a regularizer or as a log-prior on the
representations h. For example, the sparsity penalty corresponding to the Laplace prior
(
λ
2
e
λ|h
i
|
) is the absolute value sparsity penalty (see also Eq. 16.8 above):
Ω(h) = λ
X
i
|h
i
|
log P (h) =
X
i
log
λ
2
+ λ|h
i
| = const + Ω(h) (16.11)
where the constant term depends only of λ and not h (which we typically ignore in the
training criterion because we consider λ as a hyper-parameter rather than a parame-
ter). Similarly (as per Eq. 16.9), the sparsity penalty corresponding to the Student-t
prior (Olshausen and Field, 1997) is
Ω(h) =
X
i
ν + 1
2
log(1 +
h
2
i
ν
) (16.12)
where ν is considered to be a hyper-parameter.
The early work on sparse auto-encoders (Ranzato et al., 2007, 2008) considered
various forms of sparsity and proposed a connection between sparsity regularization and
the partition function gradient in energy-based models (see Section TODO). The idea
is that a regularizer such as sparsity makes it difficult for an auto-encoder to achieve
zero reconstruction error everywhere. If we consider reconstruction error as a proxy
for energy (unnormalized log-probability of the data), then minimizing the training
set reconstruction error forces the energy to be low on training examples, while the
regularizer prevents it from being low everywhere. The same role is played by the
gradient of the partition function in energy-based models such as the RBM (Section
TODO).
296
However, the sparsity penalty of sparse auto-encoders does not need to have a prob-
abilistic interpretation. For example, Goodfellow et al. (2009) successfully used the
following sparsity penalty, which does not try to bring h
i
all the way down to 0, but
only towards some low target value such as ρ = 0.05.
Ω(h) =
X
i
ρ log h
i
+ (1 ρ) log(1 h
i
) (16.13)
where 0 < h
i
< 1, usually with h
i
= sigmoid(a
i
). This is just the cross-entropy between
the Bernoulli distributions with probability p = h
i
and the target Bernoulli distribution
with probability p = ρ.
One way to achieve actual zeros in h for sparse (and denoising) auto-encoders was
introduced in Glorot et al. (2011c). The idea is to use a half-rectifier (a.k.a. simply as
“rectifier”) or ReLU (Rectified Linear Unit, introduced in Glorot et al. (2011b) for deep
supervised networks and earlier in Nair and Hinton (2010) in the context of RBMs) as
the output non-linearity of the encoder. With a prior that actually pushes the repre-
sentations to zero (like the absolute value penalty), one can thus indirectly control the
average number of zeros in the representation. ReLUs were first successfully used for
deep feedforward networks in Glorot et al. (2011a), achieving for the first time the ability
to train fairly deep supervised networks without the need for unsupervised pre-training,
and this turned out to be an important component in the 2012 object recognition break-
through with deep convolutional networks (Krizhevsky et al., 2012b).
Interestingly, the “regularizer” used in sparse auto-encoders does not conform to
the classical interpretation of regularizers as priors on the parameters. That classical
interpretation of the regularizer comes from the MAP (Maximum A Posteriori) point
estimation (see Section 5.5.1) of parameters associated with the Bayesian view of param-
eters as random variables and considering the joint distribution of data x and parameters
θ (see Section 5.7):
arg max
θ
P (θ|x) = arg max
θ
(log P (x|θ) + log P (θ))
where the first term on the right is the usual data log-likelihood term and the second
term, the log-prior over parameters, incorporates the preference over particular values
of θ.
With regularized auto-encoders such as sparse auto-encoders and contractive auto-
encoders, instead, the regularizer corresponds to a log-prior over the representation, or
over latent variables. In the case of sparse auto-encoders, predictive sparse decomposi-
tion and contractive auto-encoders, the regularizer specifies a preference over functions
of the data, rather than over parameters. This makes such a regularizer data-dependent,
unlike the classical parameter log-prior. Specifically, in the case of the sparse auto-
encoder, it says that we prefer an encoder whose output produces values closer to 0.
Indirectly (when we marginalize over the training distribution), this is also indicating a
preference over parameters, of course.
297
16.7.2 Predictive Sparse Decomposition
Predictive Sparse Decomposition (PSD) is a variant that combines sparse coding and a
parametric encoder (Kavukcuoglu et al., 2008b), i.e., it has both a parametric encoder
and iterative inference. It has been applied to unsupervised feature learning for object
recognition in images and video (Kavukcuoglu et al., 2009, 2010; Jarrett et al., 2009b;
Farabet et al., 2011), as well as for audio (Henaff et al., 2011). The representation is
considered to be a free variable (possibly a latent variable if we choose a probabilistic
interpretation) and the training criterion combines a sparse coding criterion with a term
that encourages the optimized sparse representation h (after inference) to be close to
the output of the encoder f(x):
L = arg min
h
||x g(h)||
2
+ λ|h|
1
+ γ||h f(x)||
2
(16.14)
where f is the encoder and g is the decoder. Like in sparse coding, for each example x
an iterative optimization is performed in order to obtain a representation h. However,
because the iterations can be initialized from the output of the encoder, i.e., with h =
f(x), only a few steps (e.g. 10) are necessary to obtain good results. Simple gradient
descent on h has been used by the authors. After h is settled, both g and f are updated
towards minimizing the above criterion. The first two terms are the same as in L1 sparse
coding while the third one encourages f to predict the outcome of the sparse coding
optimization, making it a better choice for the initialization of the iterative optimization.
Hence f can be used as a parametric approximation to the non-parametric encoder
implicitly defined by sparse coding. It is one of the first instances of learned approximate
inference (see also Sec. 20.6). Note that this is different from separately doing sparse
coding (i.e., training g) and then training an approximate inference mechanism f, since
both the encoder and decoder are trained together to be “compatible” with each other.
Hence the decoder will be learned in such a way that inference will tend to find solutions
that can be well approximated by the approximate inference. A similar example is the
variational auto-encoder, in which the encoder acts as approximate inference for the
decoder, and both are trained jointly (Section 21.8.2). See also Section 21.8.3 for a
probabilistic interpretation of PSD in terms of a variational lower bound on the log-
likelihood.
In practical applications of PSD, the iterative optimization is only used during train-
ing, and f is used to compute the learned features. It makes computation fast at
recognition time and also makes it easy to use the trained features f as initialization
(unsupervised pre-training) for the lower layers of a deep net. Like other unsupervised
feature learning schemes, PSD can be stacked greedily, e.g., training a second PSD on
top of the features extracted by the first one, etc.
16.8 Denoising Auto-Encoders
The Denoising Auto-Encoder (DAE) was first proposed (Vincent et al., 2008, 2010) as
a means of forcing an auto-encoder to learn to capture the data distribution without an
298
explicit constraint on either the dimension or the sparsity of the learned representation.
It was motivated by the idea that in order to fully capture a complex distribution, an
auto-encoder needs to have at least as many hidden units as needed by the complexity
of that distribution. Hence its dimensionality should not be restricted to the input
dimension.
The principle of the denoising auto-encoder is deceptively simple and illustrated in
Figure 16.6: the encoder sees as input a corrupted version of the input, but the decoder
tries to reconstruct the clean uncorrupted input.
x
˜x
Cx|x)
f
g
L = log P (x|g(fx)))
h = f x)
Figure 16.6: The computational graph of a denoising auto-encoder, which is trained to
reconstruct the clean data point x from its corrupted version
˜
x, i.e., to minimize the loss
L = log P(x|g(f (
˜
x))), where
˜
x is a corrupted version of the data example x, obtained
through a given corruption process C(
˜
x|x).
Mathematically, and following the notations used in this chapter, this can be formal-
ized as follows. We introduce a corruption process C(
˜
x|x) which represents a conditional
distribution over corrupted samples
˜
x, given a data sample x. The auto-encoder then
learns a reconstruction distribution P (x|
˜
x) estimated from training pairs (x,
˜
x), as fol-
lows:
1. Sample a training example x = x from the data generating distribution (the
training set).
2. Sample a corrupted version
˜
x =
˜
x from the conditional distribution C(
˜
x|x = x).
3. Use (x,
˜
x) as a training example for estimating the auto-encoder reconstruction
distribution P (x|
˜
x) = P (x|g(h)) with h the output of encoder f (
˜
x) and g(h) the
output of the decoder.
299
Typically we can simply perform gradient-based approximate minimization (such as
minibatch gradient descent) on the negative log-likelihood log P(x|h), i.e., the de-
noising reconstruction error, using back-propagation to compute gradients, just like for
regular feedforward neural networks (the only difference being the corruption of the
input and the choice of target output).
We can view this training objective as performing stochastic gradient descent on the
denoising reconstruction error, but where the “noise” now has two sources:
1. the choice of training sample x from the data set, and
2. the random corruption applied to x to obtain
˜
x.
We can therefore consider that the DAE is performing stochastic gradient descent
on the following expectation:
E
xQ(x)
E
˜
xC(
˜
x|x)
log P (x|g(f(
˜
x)))
where Q(x) is the training distribution.
300
Cx|x)
x
˜x
˜x
g(fx)) E[x|˜x]
Figure 16.7: A denoising auto-encoder is trained to reconstruct the clean data point x
from it corrupted version
˜
x. In the figure, we illustrate the corruption process C(
˜
x|x)
by a grey circle of equiprobable corruptions, and grey arrow for the corruption process)
acting on examples x (red crosses) lying near a low-dimensional manifold near which
probability concentrates. When the denoising auto-encoder is trained to minimize the
average of squared errors ||g(f(
˜
x)) x||
2
, the reconstruction g(f(
˜
x)) estimates E[x|
˜
x],
which approximately points orthogonally towards the manifold, since it estimates the
center of mass of the clean points x which could have given rise to
˜
x. The auto-encoder
thus learns a vector field g(f(x)) x (the green arrows) and it turns out that this
vector field estimates the gradient field
log Q(x)
x
(up to a multiplicative factor that
is the average root mean square reconstruction error), where Q is the unknown data
generating distribution.
16.8.1 Learning a Vector Field that Estimates a Gradient Field
As illustrated in Figure 16.7, a very important property of DAEs is that their training
criterion makes the auto-encoder learn a vector field (g(f(x)) x) that estimates the
gradient field (or score)
log Q(x)
x
, as per Eq. 16.15. A first result in this direction was
proven by Vincent (2011a), showing that minimizing squared reconstruction error in a
denoising auto-encoder with Gaussian noise was related to score matching (Hyv¨arinen,
2005a), making the denoising criterion a regularized form of score matching called de-
noising score matching (Kingma and LeCun, 2010a). Score matching is an alternative
to maximum likelihood and provides a consistent estimator. It is discussed further in
Section 19.4. The denoising version is discussed in Section 19.5.
301
The connection between denoising auto-encoders and score matching was first made (Vin-
cent, 2011a) in the case where the denoising auto-encoder has a particular parametriza-
tion (one hidden layer, sigmoid activation functions on hidden units, linear reconstruc-
tion), in which case the denoising criterion actually corresponds to a regularized form
of score matching on a Gaussian RBM (with binomial hidden units and Gaussian vis-
ible units). The connection between ordinary auto-encoders and Gaussian RBMs had
previously been made by Bengio and Delalleau (2009), which showed that contrastive di-
vergence training of RBMs was related to an associated auto-encoder gradient, and later
by Swersky (2010), which showed that non-denoising reconstruction error corresponded
to score matching plus a regularizer.
The fact that the denoising criterion yields an estimator of the score for general
encoder/decoder parametrizations has been proven (Alain and Bengio, 2012, 2013) in
the case where the corruption and the reconstruction distributions are Gaussian (and of
course x is continuous-valued), i.e., with the squared error denoising error
||g(f(
˜
x)) x||
2
and corruption
C(
˜
x =
˜
x|x) = N(
˜
x; µ = x, Σ = σ
2
I)
with noise variance σ
2
.
302
Figure 16.8: Vector field learned by a denoising auto-encoder around a 1-D curved
manifold near which the data (orange circles) concentrates in a 2-D space. Each arrow
is proportional to the reconstruction minus input vector of the auto-encoder and points
towards higher probability according to the implicitly estimated probability distribution.
Note that the vector field has zeros at both peaks of the estimated density function (on
the data manifolds) and at troughs (local minima) of that density function, e.g., on the
curve that separates different arms of the spiral or in the middle of it.
More precisely, the main theorem states that
g(f(x))x
σ
2
is a consistent estimator of
log Q(x)
x
, where Q(x) is the data generating distribution,
g(f(x)) x
σ
2
log Q(x)
x
, (16.15)
so long as f and g have sufficient capacity to represent the true score (and assuming
that the expected training criterion can be minimized, as usual when proving consistency
associated with a training objective).
Note that in general, there is no guarantee that the reconstruction g(f(x)) minus
the input x corresponds to the gradient of something (the estimated score should be
the gradient of the estimated log-density with respect to the input x). That is why
303
the early results (Vincent, 2011a) are specialized to particular parametrizations where
g(f(x))x is the derivative of something. See a more general treatment by Kamyshanska
and Memisevic (2015).
Although it was intuitively appealing that in order to denoise correctly one must
capture the training distribution, the above consistency result makes it mathematically
very clear in what sense the DAE is capturing the input distribution: it is estimating
the gradient of its energy function (i.e., of its log-density), i.e., learning to point towards
more probable (lower energy) configurations. Figure 16.8 (see details of experiment
in Alain and Bengio (2013)) illustrates this. Note how the norm of reconstruction error
(i.e. the norm of the vectors shown in the figure) is related to but different from the
energy (unnormalized log-density) associated with the estimated model. The energy
should be low only where the probability is high. The reconstruction error (norm of
the estimated score vector) is low where probability is near a peak of probability (or a
trough of energy), but it can also be low at maxima of energy (minima of probability).
Section 21.9 continues the discussion of the relationship between denoising auto-
encoders and probabilistic modeling by showing how one can generate from the distribu-
tion implicitly estimated by a denoising auto-encoder. Whereas (Alain and Bengio, 2013)
generalized the score estimation result of Vincent (2011a) to arbitrary parametrizations,
the result from Bengio et al. (2013b), discussed in Section 21.9, provides a probabilistic
and in fact generative interpretation to every denoising auto-encoder.
16.9 Contractive Auto-Encoders
The Contractive Auto-Encoder or CAE (Rifai et al., 2011a,c) introduces an explicit
regularizer on the code h = f(x), encouraging the derivatives of f to be as small as
possible:
Ω(h) =
f(x)
x
2
F
(16.16)
which is the squared Frobenius norm (sum of squared elements) of the Jacobian matrix
of partial derivatives associated with the encoder function. Whereas the denoising auto-
encoder learns to contract the reconstruction function (the composition of the encoder
and decoder), the CAE learns to specifically contract the encoder. See Figure 18.13
for a view of how contraction near the data points makes the auto-encoder capture the
manifold structure.
If it weren’t for the opposing force of reconstruction error, which attempts to make
the code h keep all the information necessary to reconstruct training examples, the CAE
penalty would yield a code h that is constant and does not depend on the input x. The
compromise between these two forces yields an auto-encoder whose derivatives
f (x)
x
are
tiny in most directions, except those that are needed to reconstruct training examples,
i.e., the directions that are tangent to the manifold near which data concentrate. Indeed,
in order to distinguish (and thus, reconstruct correctly) two nearby examples on the
manifold, one must assign them a different code, i.e., f(x) must vary as x moves from
one to the other, i.e., in the direction of a tangent to the manifold.
304
Figure 16.9: Average (over test examples) of the singular value spectrum of the Jaco-
bian matrix
f (x)
x
for the encoder f learned by a regular auto-encoder (AE) versus a
contractive auto-encoder (CAE). This illustrates how the contractive regularizer yields
a smaller set of directions in input space (those corresponding to large singular value of
the Jacobian) which provoke a response in the representation h while the representation
remains almost insensitive for most directions of change in the input.
What is interesting is that this penalty forces more strongly the representation to be
invariant in directions orthogonal to the manifold. This can be seen clearly by comparing
the singular value spectrum of the Jacobian
f (x)
x
for different auto-encoders, as shown
in Figure 16.9. We see that the CAE manages to concentrate the sensitivity of the
representation in fewer dimensions than a regular (or sparse) auto-encoder. Figure 18.3
illustrates tangent vectors obtained by a CAE on the MNIST digits dataset, showing
that the leading tangent vectors correspond to small deformations such as translation.
More impressively, Figure 16.10 shows tangent vectors learned on 32×32 color (RGB)
CIFAR-10 images by a CAE, compared to the tangent vectors by a non-distributed
representation learner (a mixture of local PCAs).
One practical issue with the CAE regularization criterion is that although it is cheap
to compute in the case of a single hidden layer auto-encoder, it becomes much more ex-
pensive in the case of deeper auto-encoders. The strategy followed by Rifai et al. (2011a)
is to separately pre-train each single-layer auto-encoder stacked to form a deeper auto-
encoder. However, a deeper encoder could be advantageous in spite of the computational
overhead, as argued by Schulz and Behnke (2012).
Another practical issue is that the contraction penalty on the encoder f could yield
305
Figure 16.10: Illustration of tangent vectors (bottom) of the manifold estimated by
a contractive auto-encoder (CAE), at some input point (left, CIFAR-10 image of a
dog). See also Fig. 18.3. Each image on the right corresponds to a tangent vector,
either estimated by a local PCA (equivalent to a Gaussian mixture), top, or by a CAE
(bottom). The tangent vectors are estimated by the leading singular vectors of the
Jacobian matrix
h
x
of the input-to-code mappiing. Although both local PCA and CAE
can capture local tangents that are different in different points, the local PCA does not
have enough training data to meaningful capture good tangent directions, whereas the
CAE does (because it exploits parameter sharing across different locations that share
a subset of active hidden units). The CAE tangent directions typically correspond to
moving or changing parts of the object (such as the head or legs), which corresponds to
plausible changes in the input image.
useless results if the decoder g would exactly compensate (e.g. by being scaled up by
exactly the same amount as f is scaled down). In Rifai et al. (2011a), this is compensated
by tying the weights of f and g, both being of the form of an affine transformation
followed by a non-linearity (e.g. sigmoid), i.e., the weights of g and the transpose of the
weights of f.
306