Chapter 16
Linear Factor Models and
Auto-Encoders
Linear factor models are generative unsupervised learning models in which we
imagine that some unobserved factors h explain the observed variables x through
a linear transformation. Auto-encoders are unsupervised learning methods that
learn a representation of the data, typically obtained by a non-linear paramet-
ric transformation of the data, i.e., from x to h, typically a feedforward neural
network, but not necessarily. They also learn a transformation going backwards
from the representation to the data, from h to x, like the linear factor models.
Linear factor models therefore only specify a parametric decoder, whereas auto-
encoder also specify a parametric encoder. Some linear factor models, like PCA,
actually correspond to an auto-encoder (a linear one), but for others the encoder
is implicitly defined via an inference mechanism that searches for an h that could
have generated the observed x.
The idea of auto-encoders has been part of the historical landscape of neural
networks for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel,
1994) but has really picked up speed in recent years. They remained somewhat
marginal for many years, in part due to what was an incomplete understanding of
the mathematical interpretation and geometrical underpinnings of auto-encoders,
which are developed further in Chapters 18 and 21.11.
An auto-encoder is simply a neural network that tries to copy its in-
put to its output. The architecture of an auto-encoder is typically decomposed
into the following parts, illustrated in Figure 16.1:
an input, x
an encoder function f
a code or internal representation h = f(x)
369
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
input!x!
code!h!
reconstruc,on!r!
Decoder.g!
Encoder.f!
Figure 16.1: General schema of an auto-encoder, mapping an input x to an output (called
reconstruction) r through an internal representation or code h. The auto-encoder has
two components: the encoder f (mapping x to h) and the decoder g (mapping h to r).
a decoder function g
an output, also called reconstruction r = g(h) = g(f (x))
a loss function L computing a scalar L(r, x) measuring how good of a re-
construction r is of the given input x. The objective is to minimize the
expected value of L over the training set of examples {x}.
16.1 Regularized Auto-Encoders
Predicting the input may sound useless: what could prevent the auto-encoder
from simply copying its input into its output? In the 20th century, this was
achieved by constraining the architecture of the auto-encoder to avoid this, by
forcing the dimension of the code h to be smaller than the dimension of the input
x.
Figure 16.2 illustrates the two typical cases of auto-encoders: undercomplete
vs overcomplete, i.e., with the dimension of the representation h respectively
smaller vs larger than the input x. Whereas early work with auto-encoders, just
like PCA, uses the undercompleteness – i.e. a bottleneck in the sequence of layers
to avoid learning the identity function, more recent work allows overcomplete
370
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
input!x!
Code*bo,leneck!h:!
undercomplete*
representa4on*
reconstruc4on!r!
Decoder*
Encoder*
Decoder*
Encoder*
Code!h:!
overcomplete*
representa4on*
Figure 16.2: Left: undercomplete representation (dimension of code h is less than di-
mension of input x). Right: overcomplete representation. Overcomplete auto-encoders
require some other form of regularization (instead of the constraint on the dimension of
h) to avoid the trivial solution where r = x for all x.
representations. What we have learned in recent years is that it is possible to
make the auto-encoder meaningfully capture the structure of the input distribu-
tion even if the representation is overcomplete, with other forms of constraint
or regularization. In fact, once you realize that auto-encoders can capture the
input distribution (indirectly, not as a an explicit probability function), you also
realize that it should need more capacity as one increases the complexity of the
distribution to be captured (and the amount of data available): it should not be
limited by the input dimension. This is a problem in particular with the shal-
low auto-encoders, which have a single hidden layer (for the code). Indeed, that
hidden layer size controls both the dimensionality reduction constraint (the code
size at the bottleneck) and the capacity (which allows to learn a more complex
distribution).
Besides the bottleneck constraint, alternative constraints or regularization
methods have been explored and can guarantee that the auto-encoder does some-
thing useful and not just learn some trivial identity-like function:
Sparsity of the representation or of its derivative: even if the in-
termediate representation has a very high dimensionality, the effective local
dimensionality (number of degrees of freedom that capture a coordinate sys-
371
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
tem among the probable x’s) could be much smaller if most of the elements
of h are zero (or any other constant, such that ||
h
i
x
|| is close to zero). When
||
h
i
x
|| is close to zero, h
i
does not participate in encoding local changes in
x. There is a geometrical interpretation of this situation in terms of mani-
fold learning that is discussed in more depth in Chapter 18. The discussion
in Chapter 17 also explains how an auto-encoder naturally tends towards
learning a coordinate system for the actual factors of variation in the data.
At least four types of “auto-encoders” clearly fall in this category of sparse
representation:
Sparse coding (Olshausen and Field, 1996) has been heavily studied
as an unsupervised feature learning and feature inference mechanism.
It is a linear factor model rather than an auto-encoder, because it has
no explicit parametric encoder, and instead uses an iterative inference
instead to compute the code. Sparse coding looks for representations
that are both sparse and explain the input through the decoder. In-
stead of the code being a parametric function of the input, it is instead
considered like free variable that is obtained through an optimization,
i.e., a particular form of inference:
h
= f(x) = arg min
h
L(g(h), x)) + λΩ(h) (16.1)
where L is the reconstruction loss, f the (non-parametric) encoder, g
the (parametric) decoder, Ω(h) is a sparsity regularizer, and in practice
the minimization can be approximate. Sparse coding has a manifold
or geometric interpretation that is discussed in Section 16.7. It also
has an interpretation as a directed graphical model, described in more
details in Section 20.3. To achieve sparsity, the objective function to
optimize includes a term that is minimized when the representation has
many zero or near-zero values, such as the L1 penalty |h|
1
=
P
i
|h
i
|.
An interesting variation of sparse coding combines the freedom to
choose the representation through optimization and a parametric en-
coder. It is called predictive sparse decomposition (PSD) (Kavukcuoglu
et al., 2008a) and is briefly described in Section 16.7.2.
At the other end of the spectrum are simply sparse auto-encoders,
which combine with the standard auto-encoder schema a sparsity penalty
which encourages the output of the encoder to be sparse. These are
described in Section 16.7.1. Besides the L1 penalty, other sparsity
penalties that have been explored include the Student-t penalty (Ol-
372
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
shausen and Field, 1996; Bergstra, 2011),
X
i
log(1 + α
2
h
2
i
)
(i.e. where αh
i
has a Student-t prior density) and the KL-divergence
penalty (Lee et al., 2008; Goodfellow et al., 2009; Larochelle and Ben-
gio, 2008a)
X
i
(t log h
i
+ (1 t) log(1 h
i
)),
with a target sparsity level t, for h
i
(0, 1), e.g. through a sigmoid
non-linearity.
Contractive autoencoders (Rifai et al., 2011b), covered in Sec-
tion 16.9, explicitly penalize ||
h
x
||
2
F
, i.e., the sum of the squared norm
of the vectors
h
i
(x)
x
(each indicating how much each hidden unit h
i
responds to changes in x and what direction of change in x that unit is
most sensitive to, around a particular x). With such a regularization
penalty, the auto-encoder is called contractive
1
because the mapping
from input x to representation h is encouraged to be contractive, i.e.,
to have small derivatives in all directions. Note that a sparsity regu-
larization indirectly leads to a contractive mapping as well, when the
non-linearity used happens to have a zero derivative at h
i
= 0 (which
is the case for the sigmoid non-linearity).
Robustness to injected noise or missing information: if noise is in-
jected in inputs or hidden units, or if some inputs are missing, while the
neural network is asked to reconstruct the clean and complete input, then it
cannot simply learn the identity function. It has to capture the structure
of the data distribution in order to optimally perform this reconstruction.
Such auto-encoders are called denoising auto-encoders and are discussed in
more detail in Section 16.8. There is a tight connection between the denois-
ing auto-encoders and the contractive auto-encoders: it can be shown (Alain
and Bengio, 2013) that in the limit of small Gaussian injected input noise,
the denoising reconstruction error is equivalent to a contractive penalty on
the reconstruction function that maps x to r = g(f(x)). In other words,
since both x and x + (where is some small noise vector) must yield
the same target output x, the reconstruction function is encouraged to be
insensitive to changes in all directions . The only thing that prevents re-
construction r from simply being a constant (completely insensitive to the
1
A function f(x) is contractive if ||f(x)f(y)|| < ||xy|| for nearby x and y, or equivalently
if its derivative ||f
0
(x)|| < 1.
373
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
input x), is that one also has to reconstruct correctly for different train-
ing examples x. However, the auto-encoder can learn to be approximately
constant around training examples x while producing a different answer for
different training examples. As discussed in Section 18.4, if the examples
are near a low-dimensional manifold, this encourages the representation to
vary only on the manifold and be locally constant in directions orthogonal
to the manifold, i.e., the representation locally captures a (not necessarily
Euclidean, not necessarily orthogonal) coordinate system for the manifold.
In addition to the denoising auto-encoder, the variational auto-encoder (Sec-
tion 21.9.2) and the generative stochastic networks (Section 21.11) also in-
volve the injection of noise, but typically in the representation-space itself,
thus introducing the notion of h as a latent variable.
Pressure of a Prior on the Representation: an interesting way to
generalize the notion of regularization applied to the representation is to
introduce in the cost function for the auto-encoder a log-prior term
log P (h)
which captures the assumption that we would like to find a representation
that has a simple distribution (if P(h) has a simple form, such as a fac-
torized distribution
2
), or at least one that is simpler than the original data
distribution. Among all the encoding functions f , we would like to pick one
that
1. can be inverted (easily), and this is achieved by minimizing some re-
construction loss, and
2. yields representations h whose distribution is “simpler”, i.e., can be
captured with less capacity than the original training distribution itself.
The sparse variants described above clearly fall in that framework. The vari-
ational auto-encoder (Section 21.9.2) provides a clean mathematical frame-
work for justifying the above pressure of a top-level prior when the objective
is to model the data generating distribution.
From the point of view of regularization (Chapter 7), adding the log P (h)
term to the objective function (e.g. for encouraging sparsity) or adding a contrac-
tive penalty do not fit the traditional view of a prior on the parameters. Instead,
the prior on the latent variables acts like a data-dependent prior, in the sense that
it depends on the particular values h that are going to be sampled (usually from a
posterior or an encoder), based on the input example x. Of course, indirectly, this
2
all the sparse priors we have described correspond to a factorized distribution
374
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
is also a regularization on the parameters, but one that depends on the particular
data distribution.
16.2 Representational Power, Layer Size and Depth
Nothing in the above description of auto-encoders restricts the encoder or decoder
to be shallow, but in the literature on the subject, most trained auto-encoders
have had a single hidden layer which is also the representation layer or code
3
For one, we know by the usual universal approximator abilities of single
hidden-layer neural networks that a sufficiently large hidden layer can represent
any function with a given accuracy. This observation justifies overcomplete auto-
encoders: in order to represent a rich enough distribution, one probably needs
many hidden units in the intermediate representation layer. We also know that
Principal Components Analysis (PCA) corresponds to an undercomplete auto-
encoder with no intermediate non-linearity, and that PCA can only capture a set
of directions of variation that are the same everywhere in space. This notion is
discussed in more details in Chapter 18 in the context of manifold learning.
For two, it has also been reported many times that training a deep neural
network, and in particular a deep auto-encoder (i.e. with a deep encoder and a
deep decoder) is more difficult than training a shallow one. This was actually a
motivation for the initial work on the greedy layerwise unsupervised pre-training
procedure, described below in Section 17.1, by which we only need to train a series
of shallow auto-encoders in order to initialize a deep auto-encoder. It was shown
early on (Hinton and Salakhutdinov, 2006) that, if trained properly, such deep
auto-encoders could yield much better compression than corresponding shallow or
linear auto-encoders (which are basically doing the same as PCA, see Section 16.5
below). As discussed in Section 17.7, deeper architectures can be in some cases
exponentially more efficient (both in terms of computation and statistically) than
shallow ones. However, because we can usefully pre-train a deep net by training
and stacking shallow ones, it makes it interesting to consider single-layer (or at
least shallow and easy to train) auto-encoders, as has been done in most of the
literature discussed in this chapter.
16.3 Reconstruction Distribution
The above “parts” (encoder function f, decoder function g, reconstruction loss
L) make sense when the loss L is simply the squared reconstruction error, but
3
as argued in this book, this is probably not a good choice, and we would like to independently
control the constraints on the representation, e.g. dimension and sparsity of the code, and the
capacity of the encoder.
375
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
there are many cases where this is not appropriate, e.g., when x is a vector
of discrete variables or when P(x|h) is not well approximated by a Gaussian
distribution
4
. Just like in the case of other types of neural networks (starting with
the feedforward neural networks, Section 6.2.2), it is convenient to define the loss L
as a negative log-likelihood over some target random variables. This probabilistic
interpretation is particularly important for the discussion in Sections 21.9.2, 21.10
and 21.11 about generative extensions of auto-encoders and stochastic recurrent
networks, where the output of the auto-encoder is interpreted as a probability
distribution P (x|h), for reconstructing x, given hidden units h. This distribution
captures not just the expected reconstruction but also the uncertainty about the
original x (which gave rise to h, either deterministically or stochastically, given
h). In the simplest and most ordinary cases, this distribution factorizes, i.e.,
P (x|h) =
Q
i
P (x
i
|h). This covers the usual cases of x
i
|h being Gaussian (for
unbounded real values) and x
i
|h having a Bernoulli distribution (for binary values
x
i
), but one can readily generalize this to other distributions, such as mixtures
(see Sections 3.10.5 and 6.2.2).
x
h
Q(h|x)
P (x|h)
Figure 16.3: Basic scheme of a stochastic auto-encoder, in which both the encoder and
the decoder are not simple functions but instead involve some noise injection, meaning
that their output can be seen as sampled from a distribution, Q(h|x) for the encoder and
P (x|h) for the decoder. RBMs are a special case where P = Q (in the sense of a unique
joint corresponding to both conditinals) but in general these two distributions are not
necessarily conditional distributions compatible with a unique joint distribution P (x, h).
Thus we can generalize the notion of decoding function g(h) to decoding dis-
tribution P (x|h). Similarly, we can generalize the notion of encoding function
f(x) to encoding distribution Q(h|x), as illustrated in Figure 16.3. We use this
to capture the fact that noise is injected at the level of the representation h,
4
See the link between squared error and normal density in Sections 5.7 and 6.2.2
376
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
now considered like a latent variable. This generalization is crucial in the de-
velopment of the variational auto-encoder (Section 21.9.2) and the generalized
stochastic networks (Section 21.11).
We also find a stochastic encoder and a stochastic decoder in the RBM, de-
scribed in Section 21.2. In that case, the encoding distribution Q(h|x) and P(x|h)
“match”, in the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distri-
bution which has both Q(h|x) and P (x|h) as conditionals. This is not true in
general for two independently parametrized conditionals like Q(h|x) and P (x|h),
although the work on generative stochastic networks (Alain et al., 2015) shows
that learning will tend to make them compatible asymptotically (with enough
capacity and examples).
P (x|h)
h P (h)
x = Wh+ b + noise
Figure 16.4: Basic scheme of a linear factors model, in which we assume that an observed
data vector x is obtained by a linear combination of latent factors h, plus some noise.
Different models, such as probabilistic PCA, factor analysis or ICA, make different choices
about the form of the noise and of the prior P (h).
16.4 Linear Factor Models
Now that we have introduced the notion of a probabilistic decoder, let us focus
on a very special case where the latent variable h generates x via a linear trans-
formation plus noise, i.e., classical linear factor models, which do not necessarily
have a corresponding parametric encoder.
The idea of discovering explanatory factors that have a simple joint distribu-
tion among themselves is old, e.g., see Factor Analysis (see below), and has been
explored first in the context where the relationship between factors and data is
linear, i.e., we assume that the data was generated as follows. First, sample the
real-valued factors,
h P(h), (16.2)
377
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
and then sample the real-valued observable variables given the factors:
x = W h + b + noise (16.3)
where the noise is typically Gaussian and diagonal (independent across dimen-
sions). This is illustrated in Figure 16.4.
16.5 Probabilistic PCA and Factor Analysis
Probabilistic PCA (Principal Components Analysis), factor analysis and other
linear factor models are special cases of the above equations (16.2 and 16.3) and
only differ in the choices made for the prior (over latent, not parameters) and
noise distributions.
In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable
prior is just the unit variance Gaussian
h N(0, I)
while the observed variables x
i
are assumed to be conditionally independent, given
h, i.e., the noise is assumed to be coming from a diagonal covariance Gaussian
distribution, with covariance matrix ψ = diag(σ
2
), with σ
2
= (σ
2
1
, σ
2
2
, . . .) a
vector of per-variable variances.
The role of the latent variables is thus to capture the dependencies between
the different observed variables x
i
. Indeed, it can easily be shown that x is just
a Gaussian-distribution (multivariate normal) random variable, with
x N(b, W W
>
+ ψ)
where we see that the weights W induce a dependency between two variables x
i
and x
j
through a kind of auto-encoder path, whereby x
i
influences
ˆ
h
k
= W
k
x
via w
ki
(for every k) and
ˆ
h
k
influences x
j
via w
kj
.
In order to cast PCA in a probabilistic framework, we can make a slight
modification to the factor analysis model, making the conditional variances σ
i
equal to each other. In that case the covariance of x is just W W
>
+ σ
2
I, where
σ
2
is now a scalar, i.e.,
x N(b, W W
>
+ σ
2
I)
or equivalently
x = W h + b + σz
where z N(0, I) is white noise. Tipping and Bishop (1999) then show an
iterative EM algorithm for estimating the parameters W and σ
2
.
378
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
What the probabilistic PCA model is basically saying is that the covariance
is mostly captured by the latent variables h, up to some small residual recon-
struction error σ
2
. As shown by Tipping and Bishop (1999), probabilistic PCA
becomes PCA as σ 0. In that case, the conditional expected value of h given
x becomes an orthogonal projection onto the space spanned by the d columns of
W , like in PCA. See Section 18.1 for a discussion of the “inference” mechanism
associated with PCA (probabilistic or not), i.e., recovering the expected value of
the latent factors h
i
given the observed input x. That section also explains the
very insightful geometric and manifold interpretation of PCA.
However, as σ 0, the density model becomes very sharp around these d
dimensions spanned the columns of W , as discussed in Section 18.1, which would
not make it a very faithful model of the data, in general (not just because the
data may live on a higher-dimensional manifold, but more importantly because
the real data manifold may not be a flat hyperplane - see Chapter 18 for more).
16.5.1 ICA
Independent Component Analysis (ICA) is among the oldest representation learn-
ing algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994;
Hyv¨arinen, 1999; Hyv¨arinen et al., 2001). It is an approach to modeling linear
factors that seeks non-Gaussian projections of the data. Like probabilistic PCA
and factor analysis, it also fits the linear factor model of Eqs. 16.2 and 16.3. What
is particular about ICA is that unlike PCA and factor analysis it does not assume
that the latent variable prior is Gaussian. It only assumes that it is factorized,
i.e.,
P (h) =
Y
i
P (h
i
). (16.4)
Since there is no parametric assumption behind the prior, we are really in front
of a so-called semi-parametric model, with parts of the model being parametric
(P (x|h)) and parts being non-specified or non-parametric (P (h)). In fact, this
typically yields to non-Gaussian priors: if the priors were Gaussian, then one
could not distinguish between the factors h and a rotation of h. Indeed, note
that if
h = Uz
with U an orthonormal (rotation) square matrix, i.e.,
z = U
>
h,
then, although h might have a Normal(0, I) distribution, the z also have a unit
covariance, i.e., they are uncorrelated:
V ar[z] = E[zz
>
] = E[U
>
hh
>
U] = U
>
V ar[h]U = U
>
U = I.
379
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
In other words, imposing independence among Gaussian factors does not allow one
to disentangle them, and we could as well recover any linear rotation of these fac-
tors. It means that, given the observed x, even though we might assume the right
generative model, PCA cannot recover the original generative factors. However, if
we assume that the latent variables are non-Gaussian, then we can recover them,
and this is what ICA is trying to achieve. In fact, under these generative model
assumptions, the true underlying factors can be recovered (Comon, 1994). In fact,
many ICA algorithms are looking for projections of the data s = V x such that
they are maximally non-Gaussian. An intuitive explanation for these approaches
is that although the true latent variables h may be non-Gaussian, almost any
linear combination of them will look more Gaussian, because of the central limit
theorem. Since linear combinations of the x
i
’s are also linear combinations of
the h
j
’s, to recover the h
j
’s we just need to find the linear combinations that are
maximally non-Gaussian (while keeping these different projections orthogonal to
each other).
There is an interesting connection between ICA and sparsity, since the dom-
inant form of non-Gaussianity in real data is due to sparsity, i.e., concentration
of probability at or near 0. Non-Gaussian distributions typically have more mass
around zero, although you can also get non-Gaussianity by increasing skewness,
asymmetry, or kurtosis.
Like PCA can be generalized to non-linear auto-encoders described later in
this chapter, ICA can be generalized to a non-linear generative model, e.g., x =
f(h)+noise. See Hyv¨arinen and Pajunen (1999) for the initial work on non-linear
ICA and its successful use with ensemble learning by Roberts and Everson (2001);
Lappalainen et al. (2000).
16.5.2 Sparse Coding as a Generative Model
One particularly interesting form of non-Gaussianity arises with distributions that
are sparse. These typically have not just a peak at 0 but also a fat tail
5
. Like the
other linear factor models (Eq. 16.3), sparse coding corresponds to a linear factor
model, but one with a “sparse” latent variable h, i.e., P (h) puts high probability
at or around 0. Unlike with ICA (previous section), the latent variable prior is
parametric. For example the factorized Laplace density prior is
P (h) =
Y
i
P (h
i
) =
Y
i
λ
2
e
λ|h
i
|
(16.5)
5
with probability going to 0 as the values increase in magnitude at a rate that is slower than
the Gaussian, i.e., less than quadratic in the log-domain.
380
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
and the factorized Student-t prior is
P (h) =
Y
i
P (h
i
)
Y
i
1
1 +
h
2
i
ν
ν+1
2
. (16.6)
Both of these densities have a strong preference for near-zero values but, unlike
the Gaussian, accomodate large values. In the standard sparse coding models,
the reconstruction noise is assumed to be Gaussian, so that the corresponding
reconstruction error is the squared error.
Regarding sparsity, note that the actual value h
i
= 0 has zero measure under
both densities, meaning that the posterior distribution P (h|x) will not generate
values h = 0. However, sparse coding is normally considered under a maximum
a posteriori (MAP) inference framework, in which the inferred values of h are
those that maximize the posterior, and these tend to often be zero if the prior
is sufficiently concentrated around 0. The inferred values are those defined in
Eq. 16.1, reproduced here,
h = f(x) = arg min
h
L(g(h), x)) + λΩ(h)
where L(g(h), x) is interpreted as log P (x|g(h)) and Ω(h) as log P(h). This
MAP inference view of sparse coding and an interesting probabilistic interpreta-
tion of sparse coding are further discussed in Section 20.3.
To relate the generative model of sparse coding to ICA, note how the prior
imposes not just sparsity but also independence of the latent variables h
i
under
P (h), which may help to separate different explanatory factors, unlike PCA,
factor analysis or probabilistic PCA, because these rely on a Gaussian prior,
which yields a factorized prior under any rotation of the factors, multiplication
by an orthonormal matrix, as demonstrated in Section 16.5.1.
See Section 18.2 about the manifold interpretation of sparse coding.
TODO: relate to and point to Spike-and-slab sparse coding (Goodfellow et al.,
2012) (section?)
16.6 Reconstruction Error as Log-Likelihood
Although traditional auto-encoders (like traditional neural networks) were intro-
duced with an associated training loss, just like for neural networks, that training
loss can generally be given a probabilistic interpretation as a conditional log-
likelihood of the original input x, given the reprensentation h.
We have already covered negative log-likelihood as a loss function in general
for feedforward neural networks in Section 6.2.2. Like prediction error for regular
feedforward neural networks, reconstruction error for auto-encoders does not have
381
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
to be squared error. When we view the loss as negative log-likelihood, we interpret
the reconstruction error as
L = log P (x|h)
where h is the representation, which may generally be obtained through an en-
coder taking x as input.
x
f
g
L = log P (x|g(f(x)))
Figure 16.5: The computational graph of an auto-encoder, which is trained to maximize
the probability assigned by the decoder g to the data point x, given the output of the
encoder h = f(x). The training objective is thus L = log P (x|g(f(x))), which ends up
being squared reconstruction error if we choose a Gaussian reconstruction distribution
with mean g(f(x)), and cross-entropy if we choose a factorized Bernoulli reconstruction
distribution with means g(f(x)).
An advantage of this view is that it immediately tells us what kind of loss
function one should use depending on the nature of the input. If the input is real-
valued and unbounded, then squared error is a reasonable choice of reconstruction
error, and corresponds to P(x|h) being Normal. If the input is a vector of bits,
then cross-entropy is a more reasonable choice, and corresponds to P(x|h) =
Q
i
P (x
i
|h) with x
i
|h being Bernoulli-distributed. We then view the decoder g(h)
as computing the parameters of the reconstruction distribution, i.e., P (x|h) =
P (x|g(h)).
Another advantage of this view is that we can think about the training of the
decoder as estimating the conditional distribution P (x|h), which comes handy in
the probabilistic interpretation of denoising auto-encoders, allowing us to talk
about the distribution P (x) explicitly or implicitly represented by the auto-
encoder (see Sections 16.8, 21.9.2 and 21.10 for more details). In the same spirit,
382
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
we can rethink the notion of encoder from a simple function to a conditional
distribution Q(h|x), with a special case being when Q(h|x) is a Dirac at some
particular value. Equivalently, thinking about the encoder as a distribution corre-
sponds to injecting noise inside the auto-encoder. This view is developed further
in Sections 21.9.2 and 21.11.
16.7 Sparse Representations
Sparse auto-encoders are auto-encoders which learn a sparse representation, i.e.,
one whose elements are often either zero or close to zero. Sparse coding was in-
troduced in Section 16.5.2 as a linear factor model in which the prior P (h) on
the representation h = f(x) encourages values at or near 0. In Section 16.7.1, we
see how ordinary auto-encoders can be prevented from learning a useless identity
transformation by using a sparsity penalty rather than a bottleneck. The main
difference between a sparse auto-encoder and sparse coding is that sparse cod-
ing has no explicit parametric encoder, whereas sparse auto-encoders have one.
The “encoder” of sparse coding is the algorithm that performs the approximate
inference, i.e., looks for
h
(x) = arg max
h
log P (h|x) = arg min
h
||x (b + W h)||
2
σ
2
log P (h) (16.7)
where σ
2
is a reconstruction variance parameter (which should equal the average
squared reconstruction error
6
), and P (h) is a “sparse” prior that puts more prob-
ability mass around h = 0, such as the Laplacian prior, with factorized marginals
P (h
i
) =
λ
2
e
λ|h
i
|
(16.8)
or the Student-t prior, with factorized marginals
P (h
i
)
1
(1 +
h
2
i
ν
)
ν+1
2
. (16.9)
The advantages of such a non-parametric encoder and the sparse coding approach
over sparse auto-encoders are that
1. it can in principle minimize the combination of reconstruction error and
log-prior better than any parametric encoder,
2. it performs what is called explaining away (see Figure 14.8), i.e., it allows
to “choose” some “explanations” (hidden factors) and inhibits the others.
6
but can be lumped into the regularizer λ which controls the strength of the sparsity prior,
defined in Eq. 16.8, for example.
383
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
The disadvantages are that
1. computing time for encoding the given input x, i.e., performing inference
(computing the representation h that goes with the given x) can be sub-
stantially larger than with a parametric encoder (because an optimization
must be performed for each example x), and
2. the resulting encoder function could be non-smooth and possibly too non-
linear (with two nearby x’s being associated with very different h’s), po-
tentially making it more difficult for the downstream layers to properly
generalize.
In Section 16.7.2, we describe PSD (Predictive Sparse Decomposition), which
combines a non-parametric encoder (as in sparse coding, with the representation
obtained via an optimization) and a parametric encoder (like in the sparse auto-
encoder). Section 16.8 introduces the Denoising Auto-Encoder (DAE), which
puts pressure on the representation by requiring it to extract information about
the underlying distribution and where it concentrates, so as to be able to denoise
a corrupted input. Section 16.9 describes the Contractive Auto-Encoder (CAE),
which optimizes an explicit regularization penalty that aims at making the rep-
resentation as insensitive as possible to the input, while keeping the information
sufficient to reconstruct the training examples.
16.7.1 Sparse Auto-Encoders
A sparse auto-encoder is simply an auto-encoder whose training criterion involves
a sparsity penalty Ω(h) in addition to the reconstruction error:
L = log P (x|g(h)) + Ω(h) (16.10)
where g(h) is the decoder output and typically we have h = f(x), the encoder
output.
We can think of that penalty Ω(h) simply as a regularizer or as a log-prior
on the representations h. For example, the sparsity penalty corresponding to the
Laplace prior (
λ
2
e
λ|h
i
|
) is the absolute value sparsity penalty (see also Eq. 16.8
above):
Ω(h) = λ
X
i
|h
i
|
log P (h) =
X
i
log
λ
2
+ λ|h
i
| = const + Ω(h) (16.11)
where the constant term depends only of λ and not h (which we typically ignore
in the training criterion because we consider λ as a hyperparameter rather than
384
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
a parameter). Similarly (as per Eq. 16.9), the sparsity penalty corresponding to
the Student-t prior (Olshausen and Field, 1997) is
Ω(h) =
X
i
ν + 1
2
log(1 +
h
2
i
ν
) (16.12)
where ν is considered to be a hyperparameter.
The early work on sparse auto-encoders (Ranzato et al., 2007, 2008) con-
sidered various forms of sparsity and proposed a connection between sparsity
regularization and the partition function gradient in energy-based models (see
Section TODO). The idea is that a regularizer such as sparsity makes it difficult
for an auto-encoder to achieve zero reconstruction error everywhere. If we con-
sider reconstruction error as a proxy for energy (unnormalized log-probability of
the data), then minimizing the training set reconstruction error forces the energy
to be low on training examples, while the regularizer prevents it from being low
everywhere. The same role is played by the gradient of the partition function in
energy-based models such as the RBM (Section TODO).
However, the sparsity penalty of sparse auto-encoders does not need to have
a probabilistic interpretation. For example, Goodfellow et al. (2009) successfully
used the following sparsity penalty, which does not try to bring h
i
all the way
down to 0, but only towards some low target value such as ρ = 0.05.
Ω(h) =
X
i
ρ log h
i
+ (1 ρ) log(1 h
i
) (16.13)
where 0 < h
i
< 1, usually with h
i
= sigmoid(a
i
). This is just the cross-entropy be-
tween the Bernoulli distributions with probability p = h
i
and the target Bernoulli
distribution with probability p = ρ.
One way to achieve actual zeros in h for sparse (and denoising) auto-encoders
was introduced in Glorot et al. (2011c). The idea is to use a half-rectifier (a.k.a.
simply as “rectifier”) or ReLU (Rectified Linear Unit, introduced in Glorot et al.
(2011b) for deep supervised networks and earlier in Nair and Hinton (2010a) in
the context of RBMs) as the output non-linearity of the encoder. With a prior
that actually pushes the representations to zero (like the absolute value penalty),
one can thus indirectly control the average number of zeros in the representation.
ReLUs were first successfully used for deep feedforward networks in Glorot et al.
(2011a), achieving for the first time the ability to train fairly deep supervised
networks without the need for unsupervised pre-training, and this turned out to
be an important component in the 2012 object recognition breakthrough with
deep convolutional networks (Krizhevsky et al., 2012b).
Interestingly, the “regularizer” used in sparse auto-encoders does not conform
to the classical interpretation of regularizers as priors on the parameters. That
385
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
classical interpretation of the regularizer comes from the MAP (Maximum A
Posteriori) point estimation (see Section 5.6.1) of parameters associated with
the Bayesian view of parameters as random variables and considering the joint
distribution of data x and parameters θ (see Section 5.8):
arg max
θ
P (θ|x) = arg max
θ
(log P (x|θ) + log P (θ))
where the first term on the right is the usual data log-likelihood term and the
second term, the log-prior over parameters, incorporates the preference over par-
ticular values of θ.
With regularized auto-encoders such as sparse auto-encoders and contractive
auto-encoders, instead, the regularizer corresponds to a log-prior over the repre-
sentation, or over latent variables. In the case of sparse auto-encoders, predictive
sparse decomposition and contractive auto-encoders, the regularizer specifies a
preference over functions of the data, rather than over parameters. This makes
such a regularizer data-dependent, unlike the classical parameter log-prior. Specif-
ically, in the case of the sparse auto-encoder, it says that we prefer an encoder
whose output produces values closer to 0. Indirectly (when we marginalize over
the training distribution), this is also indicating a preference over parameters, of
course.
16.7.2 Predictive Sparse Decomposition
Predictive Sparse Decomposition (PSD) is a variant that combines sparse cod-
ing and a parametric encoder (Kavukcuoglu et al., 2008b), i.e., it has both a
parametric encoder and iterative inference. It has been applied to unsupervised
feature learning for object recognition in images and video (Kavukcuoglu et al.,
2009, 2010; Jarrett et al., 2009a; Farabet et al., 2011), as well as for audio (Henaff
et al., 2011). The representation is considered to be a free variable (possibly a la-
tent variable if we choose a probabilistic interpretation) and the training criterion
combines a sparse coding criterion with a term that encourages the optimized
sparse representation h (after inference) to be close to the output of the encoder
f(x):
L = arg min
h
||x g(h)||
2
+ λ|h|
1
+ γ||h f(x)||
2
(16.14)
where f is the encoder and g is the decoder. Like in sparse coding, for each exam-
ple x an iterative optimization is performed in order to obtain a representation h.
However, because the iterations can be initialized from the output of the encoder,
i.e., with h = f(x), only a few steps (e.g. 10) are necessary to obtain good results.
Simple gradient descent on h has been used by the authors. After h is settled,
both g and f are updated towards minimizing the above criterion. The first two
386
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
terms are the same as in L1 sparse coding while the third one encourages f to
predict the outcome of the sparse coding optimization, making it a better choice
for the initialization of the iterative optimization. Hence f can be used as a para-
metric approximation to the non-parametric encoder implicitly defined by sparse
coding. It is one of the first instances of learned approximate inference (see also
Sec. 20.6). Note that this is different from separately doing sparse coding (i.e.,
training g) and then training an approximate inference mechanism f, since both
the encoder and decoder are trained together to be “compatible” with each other.
Hence the decoder will be learned in such a way that inference will tend to find
solutions that can be well approximated by the approximate inference. A similar
example is the variational auto-encoder, in which the encoder acts as approximate
inference for the decoder, and both are trained jointly (Section 21.9.2). See also
Section 21.9.3 for a probabilistic interpretation of PSD in terms of a variational
lower bound on the log-likelihood.
In practical applications of PSD, the iterative optimization is only used during
training, and f is used to compute the learned features. It makes computation
fast at recognition time and also makes it easy to use the trained features f as
initialization (unsupervised pre-training) for the lower layers of a deep net. Like
other unsupervised feature learning schemes, PSD can be stacked greedily, e.g.,
training a second PSD on top of the features extracted by the first one, etc.
16.8 Denoising Auto-Encoders
The Denoising Auto-Encoder (DAE) was first proposed (Vincent et al., 2008,
2010) as a means of forcing an auto-encoder to learn to capture the data distribu-
tion without an explicit constraint on either the dimension or the sparsity of the
learned representation. It was motivated by the idea that in order to fully capture
a complex distribution, an auto-encoder needs to have at least as many hidden
units as needed by the complexity of that distribution. Hence its dimensionality
should not be restricted to the input dimension.
The principle of the denoising auto-encoder is deceptively simple and illus-
trated in Figure 16.6: the encoder sees as input a corrupted version of the input,
but the decoder tries to reconstruct the clean uncorrupted input.
387
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
x
˜x
Cx|x)
f
g
L = log P (x|g(fx)))
h = fx)
Figure 16.6: The computational graph of a denoising auto-encoder, which is trained to
reconstruct the clean data point x from its corrupted version
˜
x, i.e., to minimize the loss
L = log P (x|g(f (
˜
x))), where
˜
x is a corrupted version of the data example x, obtained
through a given corruption process C(
˜
x|x).
Mathematically, and following the notations used in this chapter, this can be
formalized as follows. We introduce a corruption process C(
˜
x|x) which repre-
sents a conditional distribution over corrupted samples
˜
x, given a data sample
x. The auto-encoder then learns a reconstruction distribution P (x|
˜
x) estimated
from training pairs (x,
˜
x), as follows:
1. Sample a training example x = x from the data generating distribution (the
training set).
2. Sample a corrupted version
˜
x =
˜
x from the conditional distribution C(
˜
x|x =
x).
3. Use (x,
˜
x) as a training example for estimating the auto-encoder reconstruc-
tion distribution P(x|
˜
x) = P(x|g(h)) with h the output of encoder f(
˜
x)
and g(h) the output of the decoder.
Typically we can simply perform gradient-based approximate minimization (such
as minibatch gradient descent) on the negative log-likelihood log P(x|h), i.e.,
the denoising reconstruction error, using back-propagation to compute gradients,
just like for regular feedforward neural networks (the only difference being the
corruption of the input and the choice of target output).
We can view this training objective as performing stochastic gradient descent
on the denoising reconstruction error, but where the “noise” now has two sources:
388
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
1. the choice of training sample x from the data set, and
2. the random corruption applied to x to obtain
˜
x.
We can therefore consider that the DAE is performing stochastic gradient
descent on the following expectation:
E
xQ(x)
E
˜
xC(
˜
x|x)
log P (x|g(f(
˜
x)))
where Q(x) is the training distribution.
Cx|x)
x
˜x
˜x
g(fx)) E[x|˜x]
Figure 16.7: A denoising auto-encoder is trained to reconstruct the clean data point x
from it corrupted version
˜
x. In the figure, we illustrate the corruption process C(
˜
x|x)
by a grey circle of equiprobable corruptions, and grey arrow for the corruption process)
acting on examples x (red crosses) lying near a low-dimensional manifold near which
probability concentrates. When the denoising auto-encoder is trained to minimize the
average of squared errors ||g(f (
˜
x)) x||
2
, the reconstruction g(f(
˜
x)) estimates E[x|
˜
x],
which approximately points orthogonally towards the manifold, since it estimates the
center of mass of the clean points x which could have given rise to
˜
x. The auto-encoder
thus learns a vector field g(f(x)) x (the green arrows) and it turns out that this
vector field estimates the gradient field
log Q(x)
x
(up to a multiplicative factor that is the
average root mean square reconstruction error), where Q is the unknown data generating
distribution.
389
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
16.8.1 Learning a Vector Field that Estimates a Gradient Field
As illustrated in Figure 16.7, a very important property of DAEs is that their
training criterion makes the auto-encoder learn a vector field (g(f(x)) x) that
estimates the gradient field (or score)
log Q(x)
x
, as per Eq. 16.15. A first result in
this direction was proven by Vincent (2011a), showing that minimizing squared
reconstruction error in a denoising auto-encoder with Gaussian noise was related
to score matching (Hyv¨arinen, 2005a), making the denoising criterion a regular-
ized form of score matching called denoising score matching (Kingma and LeCun,
2010a). Score matching is an alternative to maximum likelihood and provides a
consistent estimator. It is discussed further in Section 19.4. The denoising version
is discussed in Section 19.5.
The connection between denoising auto-encoders and score matching was first
made (Vincent, 2011a) in the case where the denoising auto-encoder has a par-
ticular parametrization (one hidden layer, sigmoid activation functions on hidden
units, linear reconstruction), in which case the denoising criterion actually corre-
sponds to a regularized form of score matching on a Gaussian RBM (with binomial
hidden units and Gaussian visible units). The connection between ordinary auto-
encoders and Gaussian RBMs had previously been made by Bengio and Delalleau
(2009), which showed that contrastive divergence training of RBMs was related to
an associated auto-encoder gradient, and later by Swersky (2010), which showed
that non-denoising reconstruction error corresponded to score matching plus a
regularizer.
The fact that the denoising criterion yields an estimator of the score for gen-
eral encoder/decoder parametrizations has been proven (Alain and Bengio, 2012,
2013) in the case where the corruption and the reconstruction distributions are
Gaussian (and of course x is continuous-valued), i.e., with the squared error de-
noising error
||g(f(
˜
x)) x||
2
and corruption
C(
˜
x =
˜
x|x) = N(
˜
x; µ = x, Σ = σ
2
I)
with noise variance σ
2
.
390
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
Figure 16.8: Vector field learned by a denoising auto-encoder around a 1-D curved man-
ifold near which the data (orange circles) concentrates in a 2-D space. Each arrow is
proportional to the reconstruction minus input vector of the auto-encoder and points
towards higher probability according to the implicitly estimated probability distribution.
Note that the vector field has zeros at both peaks of the estimated density function (on
the data manifolds) and at troughs (local minima) of that density function, e.g., on the
curve that separates different arms of the spiral or in the middle of it.
More precisely, the main theorem states that
g(f(x))x
σ
2
is a consistent estimator
of
log Q(x)
x
, where Q(x) is the data generating distribution,
g(f(x)) x
σ
2
log Q(x)
x
, (16.15)
so long as f and g have sufficient capacity to represent the true score (and assum-
ing that the expected training criterion can be minimized, as usual when proving
consistency associated with a training objective).
Note that in general, there is no guarantee that the reconstruction g(f(x))
minus the input x corresponds to the gradient of something (the estimated score
should be the gradient of the estimated log-density with respect to the input
391
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
x). That is why the early results (Vincent, 2011a) are specialized to particular
parametrizations where g(f(x)) x is the derivative of something. See a more
general treatment by Kamyshanska and Memisevic (2015).
Although it was intuitively appealing that in order to denoise correctly one
must capture the training distribution, the above consistency result makes it
mathematically very clear in what sense the DAE is capturing the input distribu-
tion: it is estimating the gradient of its energy function (i.e., of its log-density),
i.e., learning to point towards more probable (lower energy) configurations. Fig-
ure 16.8 (see details of experiment in Alain and Bengio (2013)) illustrates this.
Note how the norm of reconstruction error (i.e. the norm of the vectors shown in
the figure) is related to but different from the energy (unnormalized log-density)
associated with the estimated model. The energy should be low only where the
probability is high. The reconstruction error (norm of the estimated score vector)
is low where probability is near a peak of probability (or a trough of energy), but
it can also be low at maxima of energy (minima of probability).
Section 21.10 continues the discussion of the relationship between denoising
auto-encoders and probabilistic modeling by showing how one can generate from
the distribution implicitly estimated by a denoising auto-encoder. Whereas (Alain
and Bengio, 2013) generalized the score estimation result of Vincent (2011a) to
arbitrary parametrizations, the result from Bengio et al. (2013b), discussed in
Section 21.10, provides a probabilistic and in fact generative – interpretation to
every denoising auto-encoder.
16.9 Contractive Auto-Encoders
The Contractive Auto-Encoder or CAE (Rifai et al., 2011a,c) introduces an ex-
plicit regularizer on the code h = f(x), encouraging the derivatives of f to be as
small as possible:
Ω(h) =
f(x)
x
2
F
(16.16)
which is the squared Frobenius norm (sum of squared elements) of the Jacobian
matrix of partial derivatives associated with the encoder function. Whereas the
denoising auto-encoder learns to contract the reconstruction function (the com-
position of the encoder and decoder), the CAE learns to specifically contract the
encoder. See Figure 18.13 for a view of how contraction near the data points
makes the auto-encoder capture the manifold structure.
If it weren’t for the opposing force of reconstruction error, which attempts
to make the code h keep all the information necessary to reconstruct training
examples, the CAE penalty would yield a code h that is constant and does not
392
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
depend on the input x. The compromise between these two forces yields an auto-
encoder whose derivatives
f (x)
x
are tiny in most directions, except those that are
needed to reconstruct training examples, i.e., the directions that are tangent to
the manifold near which data concentrate. Indeed, in order to distinguish (and
thus, reconstruct correctly) two nearby examples on the manifold, one must assign
them a different code, i.e., f(x) must vary as x moves from one to the other, i.e.,
in the direction of a tangent to the manifold.
Figure 16.9: Average (over test examples) of the singular value spectrum of the Jacobian
matrix
f (x)
x
for the encoder f learned by a regular auto-encoder (AE) versus a contractive
auto-encoder (CAE). This illustrates how the contractive regularizer yields a smaller set
of directions in input space (those corresponding to large singular value of the Jacobian)
which provoke a response in the representation h while the representation remains almost
insensitive for most directions of change in the input.
What is interesting is that this penalty forces more strongly the representation
to be invariant in directions orthogonal to the manifold. This can be seen clearly
by comparing the singular value spectrum of the Jacobian
f (x)
x
for different auto-
encoders, as shown in Figure 16.9. We see that the CAE manages to concentrate
the sensitivity of the representation in fewer dimensions than a regular (or sparse)
auto-encoder. Figure 18.3 illustrates tangent vectors obtained by a CAE on the
MNIST digits dataset, showing that the leading tangent vectors correspond to
small deformations such as translation. More impressively, Figure 16.10 shows
tangent vectors learned on 32×32 color (RGB) CIFAR-10 images by a CAE,
393
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
compared to the tangent vectors by a non-distributed representation learner (a
mixture of local PCAs).
Figure 16.10: Illustration of tangent vectors (bottom) of the manifold estimated by a
contractive auto-encoder (CAE), at some input point (left, CIFAR-10 image of a dog).
See also Fig. 18.3. Each image on the right corresponds to a tangent vector, either
estimated by a local PCA (equivalent to a Gaussian mixture), top, or by a CAE (bottom).
The tangent vectors are estimated by the leading singular vectors of the Jacobian matrix
h
x
of the input-to-code mappiing. Although both local PCA and CAE can capture local
tangents that are different in different points, the local PCA does not have enough training
data to meaningful capture good tangent directions, whereas the CAE does (because it
exploits parameter sharing across different locations that share a subset of active hidden
units). The CAE tangent directions typically correspond to moving or changing parts of
the object (such as the head or legs), which corresponds to plausible changes in the input
image.
One practical issue with the CAE regularization criterion is that although it
is cheap to compute in the case of a single hidden layer auto-encoder, it becomes
much more expensive in the case of deeper auto-encoders. The strategy followed
by Rifai et al. (2011a) is to separately pre-train each single-layer auto-encoder
stacked to form a deeper auto-encoder. However, a deeper encoder could be
advantageous in spite of the computational overhead, as argued by Schulz and
Behnke (2012).
Another practical issue is that the contraction penalty on the encoder f could
yield useless results if the decoder g would exactly compensate (e.g. by being
scaled up by exactly the same amount as f is scaled down). In Rifai et al.
(2011a), this is compensated by tying the weights of f and g, both being of the
form of an affine transformation followed by a non-linearity (e.g. sigmoid), i.e.,
the weights of g and the transpose of the weights of f.
394