Chapter 13
The Manifold Perspective on
Auto-Encoders
Manifold learning is an approach to machine learning that is capitalizing on the manifold
hypothesis (Cayton, 2005; Narayanan and Mitter, 2010): the data generating distribution
is assumed to concentrate near regions of low dimensionality. The notion of manifold in
mathematics refers to continuous spaces that locally resemble Euclidean space, and the
term we should be using is really submanifold, which corresponds to a subset which has
a manifold structure. The use of the term manifold in machine learning is much looser
than its use in mathematics, though:
the data may not be strictly on the manifold, but only near it,
the dimensionality many not be the same everywhere,
the notion actually referred to in machine learning naturally extends to discrete
spaces.
207
Figure 13.1: Top: data sampled from a distribution that is actually concentrated near
a one-dimensional manifold. Bottom: the underlying manifold that the learner should
infer.
Indeed, although the very notions of a manifold or submanifold are defined for con-
tinuous spaces, the more general notion of probability concentration applies equally well
to discrete data. It is a kind of informal prior assumption about the data generating dis-
tribution that seems particularly well-suited for AI tasks such as those involving images,
video, speech, music, text, etc. In all of these cases the natural data has the property
208
that randomly choosing configurations of the observed variables according to a factored
distribution (e.g. uniformly) are very unlikely to generate the kind of observations we
want to model. What is the probability of generating a natural looking image by choos-
ing pixel intensities independently of each other? What is the probability of generating
a meaningful natural language paragraph by independently choosing each character in
a string? Exponentially tiny. That is because the probability distribution of interest
concentrates in a tiny volume of the total space of configurations. That means that to
first degree, the problem of characterizing the data generating distribution can be re-
duced to a binary classification problem: is this configuration probable or not?. Is this a
grammatically and semantically plausible sentence in English? Is this a natural-looking
image? Answering these questions tell us much more about the nature of natural lan-
guage or text than the additional information one would have by being able to assign
a precise probability to each possible sequence of characters or set of pixels. Hence,
simply characterizing where probability concentrates is a fundamental importance, and
this is what manifold learning algorithms attempt to do. Because it is a where question,
it is more about geometry than about probability distributions, although we find both
views useful when designing learning algorithms for AI tasks.
tangent directions
tangent plane
Data on a curved manifold
Figure 13.2: A two-dimensional manifold near which training examples are concentrated,
along with a tangent plane and its associated tangent directions, forming a basis that
specify the directions of small moves one can make to stay on the manifold.
209
In addition to the property of probability concentration, there is another one that
characterizes the manifold hypothesis: when a configuration is probable it is generally
surrounded (at least in some directions) by other probable configurations. If a configu-
ration of pixels looks like a natural image, then there are tiny changes one can make
to the image (like translating everything by 0.1 pixel to the left) which yield another
natural-looking image. The number of independent ways (each characterized by a num-
ber indicating how much or whether we do it) by which a probable configuration can be
locally transformed into another probable configuration indicates the local dimension of
the manifold. Whereas maximum likelihood procedures tend to concentrate probability
mass on the training examples (which can each become a local maximum of probability
when the model overfits), the manifold hypothesis suggests that good solutions instead
concentrate probability along ridges of high probability (or their high-dimensional gener-
alization) that connect nearby examples to each other. This is illustrated in Figure 13.1.
An important characterization of a manifold is the set of its tangent planes. At a point
x on a d-dimensional manifold, the tangent plane is given by d basis vectors that span
the local directions of variation allowed on the manifold. As illustrated in Figure 13.2,
these local directions specify how one can change x infinitesimally while staying on the
manifold.
210
Figure 13.3: Non-parametric manifold learning procedures build a nearest neighbor
graph whose nodes are training examples and arcs connect nearest neighbors. Various
procedures can thus obtain the tangent plane associated with a neighborhood of the
graph, and a coordinate system that associates each training example with a real-valued
vector position. It is possible to generalize to new examples by a form of interpolation.
So long as the number of examples is large enough to cover the curvature and twists
of the manifold, these approaches work well. Images from the QMUL Multiview Face
Dataset (Gong et al., 2000).
Manifold learning has mostly focused on unsupervised learning procedures that at-
tempt to capture these manifolds. Most of the initial machine learning research on
learning non-linear manifolds has focused on non-parametric methods based on the
nearest-neighbor graph. This graph has one node per training example and edges con-
necting near neighbors. Basically, these methods (Scolkopf et al., 1998; Roweis and
Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and
Grimes, 2003; Weinberger and Saul, 2004; Hinton and Roweis, 2003; van der Maaten
and Hinton, 2008) associate each of these nodes with a tangent plane that spans the
directions of variations associated with the difference vectors between the example and
its neighbors, as illustrated in Figure 13.3.
211
Figure 13.4: If the tangent plane at each location is known, then they can be tiled to
form a global coordinate system or a density function. In the figure, each local patch
can be thought of as a local Euclidean coordinate system or as a local flat Gaussian
(or “pancake”). The average of all these Gaussians would provide an estimated density
function, as in the Manifold Parzen algorithm (Vincent and Bengio, 2003) or its non-local
neural-net based variant (Bengio et al., 2006b).
A global coordinate system can then be obtained through an optimization or solving
a linear system. Figure 13.4 illustrates how a manifold can be tiled by a large number
of locally linear Gaussian-like patches (or “pancakes”, because the Gaussians are flat
in the tangent directions). However, there is a fundamental difficulty with such non-
parametric neighborhood-based approaches to manifold learning, raised in Bengio and
Monperrus (2005): if the manifolds are not very smooth (they have many ups and
downs and twists), one may need a very large number of training examples to cover
each one of these variations, with no chance to generalize to unseen variations. Indeed,
these methods can only generalize the shape of the manifold by interpolating between
neighboring examples. Unfortunately, the manifolds of interest in AI have many ups and
downs and twists and strong curvature, as illustrated in Figure 13.5. This motivates the
use of distributed representations and deep learning for capturing manifold structure,
which is the subject of this chapter.
212
tangent directions
tangent image
tangent directions
tangent image
shifted
image
high−contrast image
Figure 13.5: When the data are images, the tangent vectors can also be visualized like
images. Here we show the tangent vector associated with translation: it corresponds
to the difference between an image and a slightly translated version. This basically
extracts edges, with the sign (shown as red or green color) indicating whether ink is
being added or subtracted when moving to the right. The figure illustrates the fact that
the tangent vectors along the manifolds associated with translation, rotation and such
apparently benign transformations of images, can be highly curved: the tangent vector
at a position x is very different from the tangent vector at a nearby position (slightly
translated image). Note how the two tangent vectors shown have almost no overlap,
i.e., their dot product is nearly 0, and they are as far as two such images can be. If
the tangent vectors of a manifold change quickly, it means that the manifold is highly
curved. This is going to make it very difficult to capture such manifolds (with a huge
numbers of twists, ups and downs) with local non-parametric methods such as graph-
based non-parametric manifold learning. This point was made in Bengio and Monperrus
(2005).
The hope of many manifold learning algorithms, including those based on deep
learning and auto-encoders, is that one learns an explicit or implicit coordinate system
for the leading factors of variation that explain most of the structure in the unknown
data generating distribution. An example of explicit coordinate system is one where the
dimensions of the representation (e.g., the outputs of the encoder, i.e., of the hidden
213
units that compute the “code” associated with the input) are directly the coordinates
that map the unknown manifold. Training examples of a face dataset in which the
images have been arranged visually on a 2-D manifold are shown in Figure 13.6, with
the images laid down so that each of the two axes corresponds to one of the two angles
of rotation of the face.
Figure 13.6: Training examples of a face dataset the QMUL Multiview Face
Dataset (Gong et al., 2000) for which the subjects were asked to move in such a
way as to cover the two-dimensional manifold corresponding to two angles of rotation.
We would like learning algorithms to be able to discover and disentangle such factors.
Figure 13.7 illustrates such a feat.
However, the objective is to discover such manifolds, and Figure 13.7 illustrates the
images generated by a variational auto-encoder (Kingma and Welling, 2014) when the
two-dimensional auto-encoder code (representation) is varied on the 2-D plane. Note
how the algorithm actually discovered two independent factors of variation: angle of
rotation and emotional expression.
214
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p
(x|z) with the learned parameters .
(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space
Figure 5: Random samples from learned generative models of MNIST for different dimensionalities
of latent space.
B Solution of D
KL
(q
(z)||p
(z)), Gaussian case
The variational lower bound (the objective to be maximized) contains a KL term that can often be
integrated analytically. Here we give the solution when both the prior p
(z)=N(0, I) and the
posterior approximation q
(z|x
(i)
) are Gaussian. Let J be the dimensionality of z. Let µ and
denote the variational mean and s.d. evaluated at datapoint i, and let µ
j
and
j
simply denote the
j-th element of these vectors. Then:
Z
q
(z) log p(z) dz =
Z
N(z; µ,
2
) log N(z; 0, I) dz
=
J
2
log(2)
1
2
J
X
j=1
(µ
2
j
+
2
j
)
10
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p
(x|z) with the learned parameters .
(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space
Figure 5: Random samples from learned generative models of MNIST for different dimensionalities
of latent space.
B Solution of D
KL
(q
(z)||p
(z)), Gaussian case
The variational lower bound (the objective to be maximized) contains a KL term that can often be
integrated analytically. Here we give the solution when both the prior p
(z)=N(0, I) and the
posterior approximation q
(z|x
(i)
) are Gaussian. Let J be the dimensionality of z. Let µ and
denote the variational mean and s.d. evaluated at datapoint i, and let µ
j
and
j
simply denote the
j-th element of these vectors. Then:
Z
q
(z) log p(z) dz =
Z
N(z; µ,
2
) log N(z; 0, I) dz
=
J
2
log(2)
1
2
J
X
j=1
(µ
2
j
+
2
j
)
10
Figure 13.7: Two-dimensional representation space (for easier visualization), i.e., a Eu-
clidean coordinate system for Frey faces (left) and MNIST digits (right), learned by a
variational auto-encoder (Kingma and Welling, 2014). Figures reproduced with permis-
sion from the authors. The images shown are not examples from the training set but
actually generated by the model, simply by changing the 2-D “code”. On the left, one
dimension that has been discovered (horizontal) mostly corresponds to a rotation of the
face, while the other (vertical) corresponds to the emotional expression. The decoder
deterministically maps codes (here two numbers) to images. The encoder maps images
to codes (and adds noise, during training).
Another kind of interesting illustration of manifold learning involves the discovery
of distributed representations for words. Neural language models were initiated with the
work of Bengio et al. (2001b, 2003), in which a neural network is trained to predict
the next word in a sequence of natural language text, given the previous words, and
where each word is represented by a real-valued vector, called embedding or neural word
embedding.
Figure 13.8 shows such neural word embeddings reduced to two dimensions (orig-
inally 50 or 100) using the t-SNE non-linear dimensionality reduction algorithm (van
der Maaten and Hinton, 2008). The figures zooms into different areas of the word-space
and illustrates that words that are semantically and syntactically close end up having
nearby embeddings.
215
Figure 13.8: Two-dimensional representation space (for easier visualization), of English
words, learned by a neural language model as in Bengio et al. (2001b, 2003), with t-SNE
for the non-linear dimensionality reduction from 100 to 2. Different regions are zoomed
to better see the details. At the global level one can identify big clusters correspond-
ing to part-of-speech, while locally one sees mostly semantic similarity explaining the
neighborhood structure.
13.1 Manifold Learning via Regularized Auto-Encoders
Auto-encoders have been introduced in Section 10.1. What is their connection to man-
ifold learning? This is what we discuss here.
We denote f the encoder function, with h = f(x) the representation of x, and g the
decoding function, with ˆx = g(h) the reconstruction of x, although in some cases the
encoder is a conditional distribution q(h|x) and the decoder is a conditional distribution
P (x|h).
What all auto-encoders have in common, when they are prevented from simply
learning the identity function for all possible input x, is that training them involves a
compromise between two “forces”:
1. Learning a representation h of training examples x such that x can be approxi-
mately recovered from h through a decoder. Note that this needs not be true for
216
any x, only for those that are probable under the data generating distribution.
2. Some constraint or regularization is imposed, either on the code h or on the com-
position of the encoder/decoder, so as to prevent the auto-encoder from achieving
perfect reconstruction everywhere. We can think of these constraints or regulariza-
tion as a preference for solutions in which the representation is as constant as pos-
sible in as many directions as possible. In the case of the bottleneck auto-encoder
a fixed number of representation dimensions is allowed, that is smaller than the
dimension of x. In the case of sparse auto-encoders (Section 13.3) the represen-
tation elements h
i
are pushed towards 0. In the case of denoising auto-encoders
(Section 13.4), the encoder/decoder function is encouraged to be contractive (have
small derivatives). In the case of the contractive auto-encoder (Section 13.5), the
encoder function alone is encouraged to be contractive, while the decoder function
is tied (by symmetric weights) to the encoder function. In the case of the vari-
ational auto-encoder (Section 17.7.1), a prior log P (h) is imposed on h to make
its distribution concentrate as much as possible. Note how in the limit, for all of
these cases, the regularization prefers constant representations.
How can these two forces (reconstruction error on one hand, and lack of representa-
tional power on the other hand) be reconciled? The solution of the optimization problem
is that only the variations that are needed to distinguish training examples need to be
represented. If the data generating distribution concentrates near a low-dimensional
manifold, this yields representations that implicitly capture a local coordinate for this
manifold: only the variations tangent to the manifold around x need to correspond to
changes in h = f(x). Hence the encoder learns a mapping from the embedding space x
to a representation space, a mapping that is only sensitive to changes along the manifold
directions, but that is insensitive to changes orthogonal to the manifold. This idea is
illustrated in Figure 13.9.
217
Figure 13.9: A regularized auto-encoder or a bottleneck auto-encoder has to reconcile
two forces: reconstruction error (which forces it to keep enough information to distin-
guish training examples from each other), and a regularizer or constraint that aims at
reducing its representational ability, to make it as insensitive as possible to the input
in as many directions as possible. The solution is for the learned representation to
be sensitive to changes along the manifold (green arrow going to the right, tangent to
the manifold) but invariant to changes orthogonal to the manifold (blue arrow going
down). This yields to contraction of the representation in the directions orthogonal to
the manifold.
13.2 Probabilistic Interpretation of Reconstruction Error
as Log-Likelihood
Although traditional auto-encoders (like traditional neural networks) were introduced
with an associated training loss, just like for neural networks, that training loss can
generally be given a probabilistic interpretation as a log-likelihood.
We have already covered that in general for feedforward neural networks in Sec-
tion 6.2.2. Like prediction error for regular feedforward neural networks, reconstruction
error for auto-encoders does not have to be squared error. When we view the loss as
negative log-likelihood, the interpretation of the reconstruction error as
L = log P(x|h)
where h is the representation, which may generally be obtained through an encoder
taking x as input.
218
x
f
g
h = f(x)
L = log P (x|g(f(x)))
Figure 13.10: The computational graph of an auto-encoder, which is trained to maximize
the probability assigned by the decoder g to the data point x, given the output of the
encoder h = f(x). The training objective is thus L = log P (x|g(f(x))), which ends
being squared reconstruction error if we choose a Gaussian reconstruction distribution
with mean g(f (x)), and cross-entropy if we choose a factorized Bernoulli reconstruction
distribution with means g(f(x)).
An advantage of this view is that it immediately tells us what kind of loss function
one should use depending on the nature of the input. If the input is real-valued and
unbounded, then squared error is a reasonable choice of reconstruction error, and cor-
responds to P (x|h) being Normal. If the input is a vector of bits, then cross-entropy
is a more reasonable choice, and corresponds to P (x|h) =
i
P (x
i
|h) with x
i
|h being
Bernoulli-distributed. We then view the decoder g(h) as computing the parameters of
the reconstruction distribution, i.e., P(x|h) = P(x|g(h)).
Another advantage of this view is that we can think about the training of the decoder
as estimating the conditional distribution P (x|h), which comes handy in the probabilistic
interpretation of denoising auto-encoders, allowing us to talk about the distribution P (x)
implicitly represented by the auto-encoder (see Section 13.4 below and Section 17.8
for more details). In the same spirit, we can rethink the notion of encoder from a
simple function to a conditional distribution Q(h|x), with a special case being when
Q(h|x) is a Dirac at some particular value. Equivalently, thinking about the encoder
as a distribution corresponds to injecting noise inside the auto-encoder. This view is
developed further in Sections 17.7.1 and 17.9.
219
13.3 Sparse Representations
Sparse auto-encoders are auto-encoders which learn a sparse representation, i.e., one
whose elements are often either zero or close to zero. Sparse coding was introduced
in Section 10.2.4 as a linear factors model in which the prior on the representation
encourages values at or near 0. In Section 13.3.1, we see how ordinary auto-encoders
can be prevented from learning a useless identity transformation by using a sparsity
penalty rather than a bottleneck. The main difference between a sparse auto-encoder
and sparse coding is that sparse coding has no explicit parametric encoder, whereas
sparse auto-encoders have one. The “encoder” of sparse coding is the algorithm that
performs the approximate inference, i.e., looks for
h
(x) = arg max
h
log P (h|x) = arg min
h
||x (b + W h)||
2
log P (h) (13.1)
where P(h) is a “sparse” prior that puts more probability mass around h = 0, such as
the Laplacian prior, with factorized marginals
P (h
i
) =
λ
2
e
λ|h
i
|
(13.2)
or the Student-t prior, with factorized marginals
P (h
i
)
1
(1 +
h
2
i
ν
)
ν+1
2
. (13.3)
The advantages of such a non-parametric encoder and the sparse coding approach over
sparse auto-encoders are that
1. it can in principle minimize the combination of reconstruction error and log-prior
better than any parametric encoder,
2. performs what is called explaining away (see Figure 9.8), i.e., it allows to “choose”
some “explanations” (hidden factors) and inhibits the others.
The disadvantages are that
1. computing time for encoding the given input x, i.e., performing inference (comput-
ing the representation h that goes with the given x) can substantially larger than
with a parametric encoder (because an optimization must be performed), and
2. the resulting encoder function can be highly non-smooth and non-linear (with two
nearby x’s being associated with very different h’s), potentially making it more
difficult for the downstream layers to properly generalize.
In Section 13.3.2, we describe PSD (Predictive Sparse Decomposition), which com-
bines a non-parametric encoder (as in sparse coding, with the representation obtained
via an optimization) and a parametric encoder (like in the sparse auto-encoder). Sec-
tion 13.4 introduces the Denoising Auto-Encoder (DAE), which puts pressure on the
220
representation by requiring it to extract information about the underlying distribution
and where it concentrates, so as to be able to denoise a corrupted input. Section13.5
describes the Contractive Auto-Encoder (CAE), which optimizes an explicit regulariza-
tion penalty that aims at making the representation as insensitive as possible to the
input, while keeping the information sufficient to reconstruct the training examples.
13.3.1 Sparse Auto-Encoders
A sparse auto-encoder is simply an auto-encoder whose training criterion involves a
sparsity penalty Ω(h) in addition to the reconstruction error:
L = log P (x|g(h)) + Ω(h) (13.4)
where g(h) is the decoder output and typically we have h = f(x), the encoder output.
We can think of that penalty Ω(h) simply as a regularizer or as a log-prior on the
representations h. For example, the sparsity penalty corresponding to the Laplace prior
is the absolute value sparsity penalty (Glorot et al., 2011a):
Ω(h) = λ
i
|h
i
|
log P(h) =
i
log
λ
2
+ λ|h
i
| = const + Ω(h) (13.5)
where the constant term depends only of λ and not h (which we typically ignore in the
training criterion because we consider λ as a hyper-parameter rather than a parameter).
Similarly, the sparsity penalty corresponding to the Student-t prior (Olshausen and
Field, 1997) is
Ω(h) =
i
ν + 1
2
log(1 +
h
2
i
ν
) (13.6)
where ν is considered to be a hyper-parameter.
TODO: reference to early LeCun work with sparse coding and sparse auto-encoders
However, the sparsity penalty does not need to have a probabilistic interpretation.
For example, Goodfellow et al. (2009) successfully used the following sparsity penalty,
which does not try to bring h
i
all the way down to 0, but only towards some low target
value such as ρ = 0.05.
Ω(h) =
i
ρ log h
i
+ (1 ρ) log(1 h
i
) (13.7)
where 0 < h
i
< 1, usually with h
i
= sigmoid(a
i
). This is just the cross-entropy between
the Bernoulli distributions with probability p = h
i
and the target Bernoulli distribution
with probability p = ρ.
One way to achieve actual zeros in h for sparse (and denoising) auto-encoders was
introduced in Glorot et al. (2011b). The idea is to use a half-rectifier (a.k.a. sim-
ply as “rectifier”) or ReLU (Rectified Linear Unit) as the output non-linearity of the
221
encoder. With a prior that actually pushes the representations to zero (like the ab-
solute value penalty), one can thus indirectly control the average number of zeros in
the representation. The same strategy was first successfully used for deep feedforward
networks in Glorot et al. (2011a), achieving for the first time the ability to train fairly
deep supervised networks without the need for unsupervised pre-training.
Interestingly, the “regularizer” used in sparse auto-encoders does not conform to the
classical interpretation of regularizers as priors on the parameters. That interpretation
of the regularizer comes from the MAP (Maximum A Posteriori) point estimation (see
Section 5.5.1) of parameters associated with the Bayesian view of parameters as random
variables and the joint distribution of data x and parameters θ (see Section 5.7):
arg max
θ
P (θ|x) = arg max
θ
log P (x|θ) + log P(θ)
where the first term is the usual data log-likelihood term and the second term, the
log-prior over parameters, incorporates the preference over particular values of θ.
Here, instead, the regularizer corresponds to a log-prior over latent variables, or
equivalently in the case of sparse auto-encoders, predictive sparse decomposition and
contractive auto-encoders, as a preference over a learned function of the data. This
makes such a regularizer data-dependent, unlike the classical parameter log-prior. Specif-
ically, in the case of the sparse auto-encoder, it says that we prefer an encoder whose
output produces values closer to 0. Indirectly (when we marginalize over the training
distribution), this is also indicating a preference over parameters, of course.
13.3.2 Predictive Sparse Decomposition
Predictive Sparse Decomposition (PSD)...
13.4 Denoising Auto-Encoders
The Denoising Auto-Encoder (DAE) was first proposed as a means of forcing an auto-
encoder to learn to capture the data distribution without an explicit constraint on either
the dimension or the sparsity of the learned representation. It was motivated by the
idea that in order to fully capture a complex distribution, an auto-encoder needs to have
at least as many hidden units as needed by the complexity of that distribution. Hence
its dimensionality should not be restricted to the input dimension.
The principle of the denoising auto-encoder is deceptively simple and illustrated in
Figure 13.11: the encoder sees as input a corrupted version of the input, but the decoder
tries to reconstruct the clean uncorrupted input.
222
x
˜x
Cx|x)
f
g
L = log P (x|g(f x)))
h = fx)
Figure 13.11: The computational graph of a denoising auto-encoder, which is trained to
reconstruct the clean data point x from it corrupted version ˜x, i.e., to minimize the loss
L = log P (x|g(f(˜x))), where ˜x is a corrupted version of the data example x, obtained
through a given corruption process C(˜x|x).
Mathematically, and following the notations used in this chapter, this can be for-
malized as follows. We introduce a corruption process C(
˜
X|X) which represents a
conditional distribution over corrupted samples
˜
X, given a data sample X. The auto-
encoder then learns a reconstruction distribution P(X|
˜
X) estimated from training pairs
(x, ˜x), as follows:
1. Sample a training example X = x from the data generating distribution (the
training set).
2. Sample a corrupted version
˜
X = ˜x from the conditional distribution C(
˜
X|X = x).
3. Use (x, ˜x) as a training example for estimating the auto-encoder reconstruction
distribution P(x|˜x) = P (x|g(h)) with h the output of encoder f(˜x) and g(h) the
output of the decoder.
Typically we can simply perform gradient-based approximate minimization (such as
minibatch gradient descent) on the negative log-likelihood log P (x|h), i.e., the de-
noising reconstruction error, using back-propagation to compute gradients, just like for
regular feedforward neural networks (the only difference being the corruption of the
input and the choice of target output).
We can view this training objective as performing stochastic gradient descent on the
denoising reconstruction error, but where the “noise” now has two sources:
1. the choice of training sample x from the data set, and
2. the random corruption applied to x to obtain ˜x.
223
We can therefore consider that the DAE is performing stochastic gradient descent
on the following expectation:
E
xQ(X)
E
˜xC(˜x|x)
log P (x|g(f(˜x)))
where Q(X) is the training distribution.
Cx|x)
x
˜x
˜x
g(fx)) E[x|˜x]
Figure 13.12: A denoising auto-encoder is trained to reconstruct the clean data point x
from it corrupted version ˜x. In the figure, we illustrate the corruption process C(˜x|x)
(grey circle of equiprobable corruptions) acting on examples x (red crosses) lying near
a low-dimensional manifold near which probability concentrates. When the denoising
auto-encoder is trained to minimize the average of squared error ||g(f(˜x) x||
2
||, the
reconstruction estimates E[x|˜x], which points approximately orthogonally towards the
manifold. It thus learns a vector field g(f(x)) x (the green arrows) and it turns out
that this vector field estimates the gradient field
log Q(x)
x
, where Q is the unknown data
generating distribution.
13.4.1 Learning a Vector Field that Estimates a Gradient Field
As illustrated in Figure 13.12, a very important property of DAEs is that their training
criterion makes the auto-encoder learn a vector field g(f(x)) that estimates the gradient
field (or score)
log Q(x)
x
. This has been proven (Alain and Bengio, 2012, 2013) in the
case where the corruption and the reconstruction distributions are Gaussian (and of
course x is continuous-valued), i.e., with the squared error denoising error
||g(f(˜x)) x||
2
224
and corruption
C(
˜
X = ˜x|x) = N(˜x; µ = x, Σ = σ
2
I)
with noise variance σ
2
.
Figure 13.13: Vector field learned by a denoising auto-encoder around a 1-D curved
manifold near which the data (orange circles) concentrates in a 2-D space. Each arrow
is proportional to the reconstruction minus input vector of the auto-encoder and points
towards higher probability according to the implicitly estimated probability distribution.
Note that the vector fields has zeros at both peaks of the estimated density function (on
the data manifolds) and at troughs (local minima) of that density function, e.g., on the
curve that separates different arms of the spiral or in the middle of it.
More precisely, the main theorem states that
g(f(x))x
σ
2
is a consistent estimator of
log Q(x)
x
,
g(f(x)) x
σ
2
log Q(x)
x
,
so long as f and g have sufficient capacity to represent the true score (and assuming
that the expected training criterion can be minimized, as usual when proving consistency
associated with a training objective).
225
Although it was intuitively appealing that in order to denoise correctly one must
capture the training distribution, this result makes it mathematically very clear. Fig-
ure 13.13 (see details of experiment in Alain and Bengio (2013)) illustrates this. Note
how the norm of reconstruction error (i.e. the norm of the vectors shown in the figure)
is related to but different from the energy (unnormalized log-density) associated with
the estimated model. The energy should be low only where the probability is high. The
reconstruction error (norm of the estimated score vector) is also low where probability
is high, but it can also be low at maxima of energy (minima of probability).
13.4.2 Turning the Gradient Field into a Generative Model
Since each application of the encoder-decoder pair brings us towards a more probable
configuration, we can use this to actually sample from the estimated distribution. If you
consider most Monte-Carlo Markov Chain (MCMC) algorithms, they have two elements:
1. move from lower probability configurations towards higher probability configura-
tions, and
2. inject randomness so that the chain moves around (and does not stay stuck at
some peak of probability, or mode) and has a chance to visit the whole space, with
probability equal to the underlying model.
So conceptually all one needs to do is perform encode-decode operations as well as inject
noise, as hinted at in (Mesnil et al., 2012; Rifai et al., 2012). But how much and how?
226
x
t
˜x
t
!
t
= g(h
t
)
h
t
= fx
t
)
f
g
P (X|!)
C(
˜
X|X)
x
t+1
Figure 13.14: Each step of the Markov chain associated with a trained denoising auto-
encoder, that generates the samples from the probabilistic model implicitly trained by
the denoising reconstruction criterion. Each step consists in (a) injecting corruption C
in state x, yielding ˜x, (b) encoding it with f, (c) decoding the result with g, and (d) from
this, sampling a new state from the reconstruction distribution P(X|ω = g(f (˜x))). In
the typical squared reconstruction error case, g(h) = ˆx and estimates E[x|˜x], corruption
consists in adding Gaussian noise and sampling from P(X|ω) consists in adding another
Gaussian noise to the reconstruction ˆx. The latter noise level should correspond to the
mean squared error of reconstructions, whereas the injected noise is a hyper-parameter
that controls the mixing speed as well as the extent to which the estimator smoothes
the empirical distribution (Vincent, 2011). In the figure, only the C and P conditionals
are stochastic steps (f and g are deterministic computations), although noise can also
be injected inside the auto-encoder, as in generative stochastic networks (Bengio et al.,
2014)
The answer has been given by Bengio et al. (2013) (Theorem 1), which state that
each step of the Markov chain that generates from the estimated distribution consists
of the following sub-steps, illustrated in Figure 13.14:
1. starting from the previous state x, inject corruption noise, sampling ˜x from C(˜x|x).
2. Encode ˜x into h = f(˜x).
3. Decode h to obtain the parameters ω = g(h) of P (X|ω = g(h)).
4. Sample the next state x from P (X|ω = g(h)).
The theorem states that if the auto-encoder P (X|
˜
X) forms a consistent estimator of
corresponding true conditional distribution, then the stationary distribution of the above
227
Markov chain forms a consistent estimator (albeit an implicit one) of the data generating
distribution of X.
P (
˜
x|x)
1
P (
˜
x|x)
P (
˜
x|x)
P (x|
˜
x)
C
Figure 13.15: Illustration of one step of the sampling Markov chain associated with a
denoising auto-encoder (see also Figure 13.14). In the figure, the data (black circles) are
sitting near a low-dimensional manifold (a spiral, here), and the two stochastic steps of
the Markov chain are first to corrupt x (clean car image, green circle) into ˜x (noisy car
image, blue circle) via C(˜x|x) (here an isotropic Gaussian noise in green), and then to
sample a new x via the estimated denoising P(x|˜x). Note how there are many possible
x which could have given rise to ˜x, and these all lie on the manifold in the neighborhood
of ˜x, hence the flattened shape of P (x|˜x) (in blue). Modified from a figure first created
and graciously authorized by Jason Yosinski.
Figure 13.15 illustrates the sampling process of the DAE in a way that complements
Figure 13.14, with a specific imagined example. For a more elaborate discussion of the
probabilistic nature of denoising auto-encoders, and their generalization (Bengio et al.,
2014), Generative Stochastic Networks (GSNs), see Section 17.9. In particular, the noise
does not have to be injected only in the input, and it could be injected anywhere along
the chain. GSNs also generalize DAEs by allowing the state of the Markov chain to be
extended beyond the visible variable x, to include also some latent variable h. Finally,
Section 17.9 discusses training strategies for DAEs that are aimed at making it a better
generative model and not just a good feature learner.
228
See also Section 15.5 for the connection to score matching, an induction principle
that is an alternative to the maximum likelihood principle.
13.5 Contractive Auto-Encoders
The Contractive Auto-Encoder (CAE) introduces an explicit regularizer on the code
h = f (x), encouraging the derivatives of f to be as small as possible. If it weren’t for
the opposing force of reconstruction error which attempts to make the code h keep all
the information necessary to reconstruct training examples, the CAE penalty would yield
a code h that is constant and does not depend on the input x. The compromise between
these two forces yields an auto-encoder whose derivatives
h
x
are tiny in most directions,
except those that are needed to reconstruct training examples, i.e., the directions that
are tangent to the manifold near which data concentrate.
229