Chapter 13

The Manifold Perspective on

Auto-Encoders

Manifold learning is an approach to machine learning that is capitalizing on the manifold

hypothesis (Cayton, 2005; Narayanan and Mitter, 2010): the data generating distribution

is assumed to concentrate near regions of low dimensionality. The notion of manifold in

mathematics refers to continuous spaces that locally resemble Euclidean space, and the

term we should be using is really submanifold, which corresponds to a subset which has

a manifold structure. The use of the term manifold in machine learning is much looser

than its use in mathematics, though:

• the data may not be strictly on the manifold, but only near it,

• the dimensionality may not be the same everywhere,

• the notion actually referred to in machine learning naturally extends to discrete

spaces.

Indeed, although the very notions of a manifold or submanifold are deﬁned for con-

tinuous spaces, the more general notion of probability concentration applies equally well

to discrete data. It is a kind of informal prior assumption about the data generating dis-

tribution that seems particularly well-suited for AI tasks such as those involving images,

video, speech, music, text, etc. In all of these cases the natural data has the property

that randomly choosing conﬁgurations of the observed variables according to a factored

distribution (e.g. uniformly) are very unlikely to generate the kind of observations we

want to model. What is the probability of generating a natural looking image by choos-

ing pixel intensities independently of each other? What is the probability of generating

a meaningful natural language paragraph by independently choosing each character in

a string? Doing a thought experiment should give a clear answer: an exponentially tiny

probability. That is because the probability distribution of interest concentrates in a

tiny volume of the total space of conﬁgurations. That means that to ﬁrst degree, the

problem of characterizing the data generating distribution can be reduced to a binary

classiﬁcation problem: is this conﬁguration probable or not?. Is this a grammatically and

261

⇓

Figure 13.1: Top: data sampled from a distribution in a high-dimensional space (one 2

dimensions shown for illustration) that is actually concentrated near a one-dimensional

manifold, which here is like a twisted string. Bottom: the underlying manifold that the

learner should infer.

semantically plausible sentence in English? Is this a natural-looking image? Answering

these questions tells us much more about the nature of natural language or text than

the additional information one would have by being able to assign a precise probability

to each possible sequence of characters or set of pixels. Hence, simply characterizing

where probability concentrates is a fundamental importance, and this is what manifold

learning algorithms attempt to do. Because it is a where question, it is more about

geometry than about probability distributions, although we ﬁnd both views useful when

designing learning algorithms for AI tasks.

262

tangent directions

tangent plane

Data on a curved manifold

Figure 13.2: A two-dimensional manifold near which training examples are concentrated,

along with a tangent plane and its associated tangent directions, forming a basis that

specify the directions of small moves one can make to stay on the manifold.

Figure 13.3: Illustration of tangent vectors of the manifold estimated by a contractive

auto-encoder (CAE), at some input point (top left, image of a zero). Each image on the

top right corresponds to a tangent vector. They are obtained by picking the dominant

singular vectors (with largest singular value) of the Jacobian

∂f (x)

∂x

(see Section 13.5).

Taking the original image plus a small quantity of any of these tangent vectors yields

another plausible image, as illustrated in the bottom. The leading tangent vectors seem

to correspond to small deformations, such as translation, or shifting ink around locally

in the original image.

In addition to the property of probability concentration, there is another one that

characterizes the manifold hypothesis: when a conﬁguration is probable it is generally

surrounded (at least in some directions) by other probable conﬁgurations. If a conﬁgu-

263

ration of pixels looks like a natural image, then there are tiny changes one can make

to the image (like translating everything by 0.1 pixel to the left) which yield another

natural-looking image. The number of independent ways (each characterized by a num-

ber indicating how much or whether we do it) by which a probable conﬁguration can be

locally transformed into another probable conﬁguration indicates the local dimension of

the manifold. Whereas maximum likelihood procedures tend to concentrate probability

mass on the training examples (which can each become a local maximum of probability

when the model overﬁts), the manifold hypothesis suggests that good solutions instead

concentrate probability along ridges of high probability (or their high-dimensional gener-

alization) that connect nearby examples to each other. This is illustrated in Figure 13.1.

Figure 13.4: Non-parametric manifold learning procedures build a nearest neighbor

graph whose nodes are training examples and arcs connect nearest neighbors. Various

procedures can thus obtain the tangent plane associated with a neighborhood of the

graph, and a coordinate system that associates each training example with a real-valued

vector position, or embedding. It is possible to generalize such a representation to new

examples by a form of interpolation. So long as the number of examples is large enough

to cover the curvature and twists of the manifold, these approaches work well. Images

from the QMUL Multiview Face Dataset (Gong et al., 2000).

What is most commonly learned to characterize a manifold is a representation of the

data points on (or near, i.e. projected on) the manifold. Such a representation for a

particular example is also called its embedding. It is typically given by a low-dimensional

264

vector, with less dimensions than the “ambiant” space of which the manifold is a low-

dimensional subset. Some algorithms (non-parametric manifold learning algorithms,

discussed below) directly learn an embedding for each training example, while others

learn a more general mapping, sometimes called an encoder, or representation function,

that maps any point in the ambiant space (the input space) to its embedding.

Another important characterization of a manifold is the set of its tangent planes.

At a point x on a d-dimensional manifold, the tangent plane is given by d basis vectors

that span the local directions of variation allowed on the manifold. As illustrated in

Figure 13.2, these local directions specify how one can change x inﬁnitesimally while

staying on the manifold.

Figure 13.5: If the tangent plane at each location is known, then they can be tiled to

form a global coordinate system or a density function. In the ﬁgure, each local patch

can be thought of as a local Euclidean coordinate system or as a local ﬂat Gaussian

(or “pancake”). The average of all these Gaussians would provide an estimated density

function, as in the Manifold Parzen algorithm (Vincent and Bengio, 2003) or its non-local

neural-net based variant (Bengio et al., 2006b).

Manifold learning has mostly focused on unsupervised learning procedures that at-

tempt to capture these manifolds. Most of the initial machine learning research on

learning non-linear manifolds has focused on non-parametric methods based on the

nearest-neighbor graph. This graph has one node per training example and edges con-

necting near neighbors. Basically, these methods (Sch¨olkopf et al., 1998; Roweis and

Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and

Grimes, 2003; Weinberger and Saul, 2004; Hinton and Roweis, 2003; van der Maaten

and Hinton, 2008) associate each of these nodes with a tangent plane that spans the

directions of variations associated with the diﬀerence vectors between the example and

its neighbors, as illustrated in Figure 13.4.

A global coordinate system can then be obtained through an optimization or solving

a linear system. Figure 13.5 illustrates how a manifold can be tiled by a large number

of locally linear Gaussian-like patches (or “pancakes”, because the Gaussians are ﬂat in

265

the tangent directions).

tangent directions

tangent image

tangent directions

tangent image

shifted

image

high−contrast image

Figure 13.6: When the data are images, the tangent vectors can also be visualized like

images. Here we show the tangent vector associated with translation: it corresponds

to the diﬀerence between an image and a slightly translated version. This basically

extracts edges, with the sign (shown as red or green color) indicating whether ink is

being added or subtracted when moving to the right. The ﬁgure illustrates the fact that

the tangent vectors along the manifolds associated with translation, rotation and such

apparently benign transformations of images, can be highly curved: the tangent vector

at a position x is very diﬀerent from the tangent vector at a nearby position (slightly

translated image). Note how the two tangent vectors shown have almost no overlap,

i.e., their dot product is nearly 0, and they are as far as two such images can be. If

the tangent vectors of a manifold change quickly, it means that the manifold is highly

curved. This is going to make it very diﬃcult to capture such manifolds (with a huge

numbers of twists, ups and downs) with local non-parametric methods such as graph-

based non-parametric manifold learning. This point was made in Bengio and Monperrus

(2005).

However, there is a fundamental diﬃculty with such non-parametric neighborhood-

based approaches to manifold learning, raised in Bengio and Monperrus (2005): if the

manifolds are not very smooth (they have many ups and downs and twists), one may

266

need a very large number of training examples to cover each one of these variations,

with no chance to generalize to unseen variations. Indeed, these methods can only

generalize the shape of the manifold by interpolating between neighboring examples.

Unfortunately, the manifolds of interest in AI have many ups and downs and twists and

strong curvature, as illustrated in Figure 13.6. This motivates the use of distributed

representations and deep learning for capturing manifold structure, which is the subject

of this chapter.

Figure 13.7: Training examples of a face dataset – the QMUL Multiview Face

Dataset (Gong et al., 2000) – for which the subjects were asked to move in such a

way as to cover the two-dimensional manifold corresponding to two angles of rotation.

We would like learning algorithms to be able to discover and disentangle such factors.

Figure 13.8 illustrates such a feat.

The hope of many manifold learning algorithms, including those based on deep

learning and auto-encoders, is that one learns an explicit or implicit coordinate system

for the leading factors of variation that explain most of the structure in the unknown

data generating distribution. An example of explicit coordinate system is one where the

dimensions of the representation (e.g., the outputs of the encoder, i.e., of the hidden

units that compute the “code” associated with the input) are directly the coordinates

that map the unknown manifold. Training examples of a face dataset in which the

images have been arranged visually on a 2-D manifold are shown in Figure 13.7, with

the images laid down so that each of the two axes corresponds to one of the two angles

of rotation of the face.

However, the objective is to discover such manifolds, and Figure 13.8 illustrates the

images generated by a variational auto-encoder (Kingma and Welling, 2014) when the

two-dimensional auto-encoder code (representation) is varied on the 2-D plane. Note

how the algorithm actually discovered two independent factors of variation: angle of

rotation and emotional expression.

Another kind of interesting illustration of manifold learning involves the discovery

of distributed representations for words. Neural language models were initiated with the

work of Bengio et al. (2001b, 2003), in which a neural network is trained to predict

267

the next word in a sequence of natural language text, given the previous words, and

where each word is represented by a real-valued vector, called embedding or neural word

embedding.

(a) Learned Frey Face manifold (b) Learned MNIST manifold

Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent

space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-

dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce

values of the latent variables z. For each of these values z, we plotted the corresponding generative

✓

(x|z) with the learned parameters ✓.

(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space

Figure 5: Random samples from learned generative models of MNIST for different dimensionalities

of latent space.

B Solution of D



(z)||p

✓

(z)), Gaussian case

The variational lower bound (the objective to be maximized) contains a KL term that can often be

integrated analytically. Here we give the solution when both the prior p

✓

(z)=N(0, I) and the

posterior approximation q



(z|x

(i)

) are Gaussian. Let J be the dimensionality of z. Let µ and 

denote the variational mean and s.d. evaluated at datapoint i, and let µ

and 

simply denote the

j-th element of these vectors. Then:

✓

(z) log p(z) dz =

N(z; µ, 

) log N(z; 0, I) dz

= 

log(2⇡) 

j=1

(µ

+ 

)

(a) Learned Frey Face manifold (b) Learned MNIST manifold

Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent

space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-

dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce

values of the latent variables z. For each of these values z, we plotted the corresponding generative

✓

(x|z) with the learned parameters ✓.

(a) 2-D latent space (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space

Figure 5: Random samples from learned generative models of MNIST for different dimensionalities

of latent space.

B Solution of D



(z)||p

✓

(z)), Gaussian case

The variational lower bound (the objective to be maximized) contains a KL term that can often be

integrated analytically. Here we give the solution when both the prior p

✓

(z)=N(0, I) and the

posterior approximation q



(z|x

(i)

) are Gaussian. Let J be the dimensionality of z. Let µ and 

denote the variational mean and s.d. evaluated at datapoint i, and let µ

and 

simply denote the

j-th element of these vectors. Then:

✓

(z) log p(z) dz =

N(z; µ, 

) log N(z; 0, I) dz

= 

log(2⇡) 

j=1

(µ

+ 

)

Figure 13.8: Two-dimensional representation space (for easier visualization), i.e., a Eu-

clidean coordinate system for Frey faces (left) and MNIST digits (right), learned by a

variational auto-encoder (Kingma and Welling, 2014). Figures reproduced with permis-

sion from the authors. The images shown are not examples from the training set but

images x actually generated by the model P(x|h), simply by changing the 2-D “code”

h (each image corresponds to a diﬀerent choice of “code” h on a 2-D uniform grid). On

the left, one dimension that has been discovered (horizontal) mostly corresponds to a

rotation of the face, while the other (vertical) corresponds to the emotional expression.

The decoder deterministically maps codes (here two numbers) to images. The encoder

maps images to codes (and adds noise, during training).

Figure 13.9 shows such neural word embeddings reduced to two dimensions (orig-

inally 50 or 100) using the t-SNE non-linear dimensionality reduction algorithm (van

der Maaten and Hinton, 2008). The ﬁgures zooms into diﬀerent areas of the word-space

and illustrates that words that are semantically and syntactically close end up having

nearby embeddings.

268

Figure 13.9: Two-dimensional representation space (for easier visualization), of English

words, learned by a neural language model as in Bengio et al. (2001b, 2003), with t-SNE

for the non-linear dimensionality reduction from 100 to 2. Diﬀerent regions are zoomed

to better see the details. At the global level one can identify big clusters correspond-

ing to part-of-speech, while locally one sees mostly semantic similarity explaining the

neighborhood structure.

13.1 Manifold Learning via Regularized Auto-Encoders

Auto-encoders have been introduced in Section 10.1. What is their connection to man-

ifold learning? This is what we discuss here.

We denote f the encoder function, with h = f(x) the representation of x, and g the

decoding function, with

x = g(h) the reconstruction of x, although in some cases the

encoder is a conditional distribution q(h|x) and the decoder is a conditional distribution

P (x|h).

What all auto-encoders have in common, when they are prevented from simply

learning the identity function for all possible input x, is that training them involves a

compromise between two “forces”:

1. Learning a representation h of training examples x such that x can be approxi-

mately recovered from h through a decoder. Note that this needs not be true for

269

any x, only for those that are probable under the data generating distribution.

2. Some constraint or regularization is imposed, either on the code h or on the com-

position of the encoder/decoder, so as to make the transformed data somehow

simpler or to prevent the auto-encoder from achieving perfect reconstruction ev-

erywhere. We can think of these constraints or regularization as a preference for

solutions in which the representation is as simple as possible, e.g., factorized or as

constant as possible, in as many directions as possible. In the case of the bottle-

neck auto-encoder a ﬁxed number of representation dimensions is allowed, that is

smaller than the dimension of x. In the case of sparse auto-encoders (Section 13.3)

the representation elements h

are pushed towards 0. In the case of denoising

auto-encoders (Section 13.4), the encoder/decoder function is encouraged to be

contractive (have small derivatives). In the case of the contractive auto-encoder

(Section 13.5), the encoder function alone is encouraged to be contractive, while

the decoder function is tied (by symmetric weights) to the encoder function. In the

case of the variational auto-encoder (Section 17.7.1), a prior log P(h) is imposed

on h to make its distribution factorize and concentrate as much as possible. Note

how in the limit, for all of these cases, the regularization prefers representations

that are insensitive to the input.

Figure 13.10: A regularized auto-encoder or a bottleneck auto-encoder has to reconcile

two forces: reconstruction error (which forces it to keep enough information to distin-

guish training examples from each other), and a regularizer or constraint that aims at

reducing its representational ability, to make it as insensitive as possible to the input

in as many directions as possible. The solution is for the learned representation to

be sensitive to changes along the manifold (green arrow going to the right, tangent to

the manifold) but invariant to changes orthogonal to the manifold (blue arrow going

down). This yields to contraction of the representation in the directions orthogonal to

the manifold.

Clearly, the second type of force alone would not make any sense (as would any

270

regularizer, in general). How can these two forces (reconstruction error on one hand,

and “simplicity” of the representation on the other hand) be reconciled? The solution

of the optimization problem is that only the variations that are needed to distinguish

training examples need to be represented. If the data generating distribution concentrates

near a low-dimensional manifold, this yields representations that implicitly capture a

local coordinate for this manifold: only the variations tangent to the manifold around

x need to correspond to changes in h = f (x). Hence the encoder learns a mapping

from the embedding space x to a representation space, a mapping that is only sensitive

to changes along the manifold directions, but that is insensitive to changes orthogonal

to the manifold. This idea is illustrated in Figure 13.10. A one-dimensional example

is illustrated in Figure 13.11, showing that by making the auto-encoder contractive

around the data points (and the reconstruction point towards the nearest data point),

we recover the manifold structure (of a set of 0-dimensional manifolds in a 1-dimensional

embedding space, in the ﬁgure).

r(x)"

Figure 13.11: If the auto-encoder learns to be contractive around the data points, with

the reconstruction pointing towards the nearest data points, it captures the manifold

structure of the data. This is a 1-dimensional version of Figure 13.10. The denoising

auto-encoder explicitly tries to make the derivative of the reconstruction function r(x)

small around the data points. The contractive auto-encoder does the same thing for the

encoder. Although the derivative of r(x) is asked to be small around the data points,

it can be large between the data points (e.g. in the regions between manifolds), and it

has to be large there so as to reconcile reconstruction error (r(x) ≈ x for data points

x) and contraction (small derivatives of r(x) near data points).

271

13.2 Probabilistic Interpretation of Reconstruction Error

as Log-Likelihood

Although traditional auto-encoders (like traditional neural networks) were introduced

with an associated training loss, just like for neural networks, that training loss can

generally be given a probabilistic interpretation as a conditional log-likelihood of the

original input x, given the reprensentation h.

We have already covered negative log-likelihood as a loss function in general for

feedforward neural networks in Section 6.2.2. Like prediction error for regular feed-

forward neural networks, reconstruction error for auto-encoders does not have to be

squared error. When we view the loss as negative log-likelihood, the interpretation of

the reconstruction error as

L = −log P(x|h)

where h is the representation, which may generally be obtained through an encoder

taking x as input.

h = f(x)

L = log P (x|g(f(x)))

Figure 13.12: The computational graph of an auto-encoder, which is trained to maximize

the probability assigned by the decoder g to the data point x, given the output of the

encoder h = f(x). The training objective is thus L = −log P(x|g(f(x))), which ends up

being squared reconstruction error if we choose a Gaussian reconstruction distribution

with mean g(f(x)), and cross-entropy if we choose a factorized Bernoulli reconstruction

distribution with means g(f(x)).

An advantage of this view is that it immediately tells us what kind of loss function

one should use depending on the nature of the input. If the input is real-valued and

unbounded, then squared error is a reasonable choice of reconstruction error, and cor-

responds to P (x|h) being Normal. If the input is a vector of bits, then cross-entropy

272

is a more reasonable choice, and corresponds to P (x|h) =



P (x

|h) with x

|h being

Bernoulli-distributed. We then view the decoder g(h) as computing the parameters of

the reconstruction distribution, i.e., P(x|h) = P (x|g(h)).

Another advantage of this view is that we can think about the training of the decoder

as estimating the conditional distribution P(x|h), which comes handy in the probabilis-

tic interpretation of denoising auto-encoders, allowing us to talk about the distribution

P (x) explicitly or implicitly represented by the auto-encoder (see Sections 13.4, 17.7.1

and 17.8 for more details). In the same spirit, we can rethink the notion of encoder from

a simple function to a conditional distribution Q(h|x), with a special case being when

Q(h|x) is a Dirac at some particular value. Equivalently, thinking about the encoder

as a distribution corresponds to injecting noise inside the auto-encoder. This view is

developed further in Sections 17.7.1 and 17.9.

13.3 Sparse Representations

Sparse auto-encoders are auto-encoders which learn a sparse representation, i.e., one

whose elements are often either zero or close to zero. Sparse coding was introduced in

Section 10.2.4 as a linear factor model in which the prior P (h) on the representation

h = f(x) encourages values at or near 0. In Section 13.3.1, we see how ordinary auto-

encoders can be prevented from learning a useless identity transformation by using a

sparsity penalty rather than a bottleneck. The main diﬀerence between a sparse auto-

encoder and sparse coding is that sparse coding has no explicit parametric encoder,

whereas sparse auto-encoders have one. The “encoder” of sparse coding is the algorithm

that performs the approximate inference, i.e., looks for

∗

(x) = arg max

log P (h|x) = arg min

||x −(b + W h)||

− log P (h) (13.1)

where σ

is a reconstruction variance parameter (which should equal the average squared

reconstruction error

), and P (h) is a “sparse” prior that puts more probability mass

around h = 0, such as the Laplacian prior, with factorized marginals

P (h

) =

λ|h

(13.2)

or the Student-t prior, with factorized marginals

P (h

) ∝

(1 +

)

ν+1

. (13.3)

The advantages of such a non-parametric encoder and the sparse coding approach over

sparse auto-encoders are that

1. it can in principle minimize the combination of reconstruction error and log-prior

better than any parametric encoder,

but can be lumped into the regularizer λ which controls the strength of the sparsity prior, deﬁned

in Eq. 13.2, for example.

273

2. it performs what is called explaining away (see Figure 9.8), i.e., it allows to

“choose” some “explanations” (hidden factors) and inhibits the others.

The disadvantages are that

1. computing time for encoding the given input x, i.e., performing inference (com-

puting the representation h that goes with the given x) can be substantially larger

than with a parametric encoder (because an optimization must be performed), and

2. the resulting encoder function can be highly non-smooth and non-linear (with two

nearby x’s being associated with very diﬀerent h’s), potentially making it more

diﬃcult for the downstream layers to properly generalize.

In Section 13.3.2, we describe PSD (Predictive Sparse Decomposition), which com-

bines a non-parametric encoder (as in sparse coding, with the representation obtained

via an optimization) and a parametric encoder (like in the sparse auto-encoder). Sec-

tion 13.4 introduces the Denoising Auto-Encoder (DAE), which puts pressure on the

representation by requiring it to extract information about the underlying distribution

and where it concentrates, so as to be able to denoise a corrupted input. Section 13.5

describes the Contractive Auto-Encoder (CAE), which optimizes an explicit regulariza-

tion penalty that aims at making the representation as insensitive as possible to the

input, while keeping the information suﬃcient to reconstruct the training examples.

13.3.1 Sparse Auto-Encoders

A sparse auto-encoder is simply an auto-encoder whose training criterion involves a

sparsity penalty Ω(h) in addition to the reconstruction error:

L = −log P(x|g(h)) + Ω(h) (13.4)

where g(h) is the decoder output and typically we have h = f(x), the encoder output.

We can think of that penalty Ω(h) simply as a regularizer or as a log-prior on the

representations h. For example, the sparsity penalty corresponding to the Laplace prior

is the absolute value sparsity penalty (see also Eq. 13.2 above):

Ω(h) = λ



−log P(h) =



log

+ λ|h

| = const + Ω(h) (13.5)

where the constant term depends only of λ and not h (which we typically ignore in the

training criterion because we consider λ as a hyper-parameter rather than a parame-

ter). Similarly (as per Eq. 13.3), the sparsity penalty corresponding to the Student-t

prior (Olshausen and Field, 1997) is

Ω(h) =



ν + 1

log(1 +

) (13.6)

274

where ν is considered to be a hyper-parameter.

The early work on sparse auto-encoders (Ranzato et al., 2007, 2008) considered

various forms of sparsity and proposed a connection between sparsity regularization and

the partition function gradient in energy-based models (see Section TODO). The idea

is that a regularizer such as sparsity makes it diﬃcult for an auto-encoder to achieve

zero reconstruction error everywhere. If we consider reconstruction error as a proxy

for energy (unnormalized log-probability of the data), then minimizing the training

set reconstruction error forces the energy to be low on training examples, while the

regularizer prevents it from being low everywhere. The same role is played by the

gradient of the partition function in energy-based models such as the RBM (Section

TODO).

However, the sparsity penalty of sparse auto-encoders does not need to have a prob-

abilistic interpretation. For example, Goodfellow et al. (2009) successfully used the

following sparsity penalty, which does not try to bring h

all the way down to 0, but

only towards some low target value such as ρ = 0.05.

Ω(h) =



ρ log h

+ (1 −ρ) log(1 −h

) (13.7)

where 0 < h

< 1, usually with h

= sigmoid(a

). This is just the cross-entropy between

the Bernoulli distributions with probability p = h

and the target Bernoulli distribution

with probability p = ρ.

One way to achieve actual zeros in h for sparse (and denoising) auto-encoders was

introduced in Glorot et al. (2011c). The idea is to use a half-rectiﬁer (a.k.a. simply

as “rectiﬁer”) or ReLU (Rectiﬁed Linear Unit, introduced in Glorot et al. (2011b) for

deep supervised networks and earlier in Nair and Hinton (2010a) in the context of

RBMs) as the output non-linearity of the encoder. With a prior that actually pushes

the representations to zero (like the absolute value penalty), one can thus indirectly

control the average number of zeros in the representation. ReLUs were ﬁrst successfully

used for deep feedforward networks in Glorot et al. (2011a), achieving for the ﬁrst time

the ability to train fairly deep supervised networks without the need for unsupervised

pre-training, and this turned out to be an important component in the 2012 object

recognition breakthrough with deep convolutional networks (Krizhevsky et al., 2012a).

Interestingly, the “regularizer” used in sparse auto-encoders does not conform to

the classical interpretation of regularizers as priors on the parameters. That classical

interpretation of the regularizer comes from the MAP (Maximum A Posteriori) point

estimation (see Section 5.5.1) of parameters associated with the Bayesian view of param-

eters as random variables and considering the joint distribution of data x and parameters

θ (see Section 5.7):

arg max

P (θ|x) = arg max

(log P (x|θ) + log P (θ))

where the ﬁrst term on the right is the usual data log-likelihood term and the second

term, the log-prior over parameters, incorporates the preference over particular values

of θ.

275

With regularized auto-encoders such as sparse auto-encoders and contractive auto-

encoders, instead, the regularizer corresponds to a log-prior over the representation, or

over latent variables. In the case of sparse auto-encoders, predictive sparse decomposi-

tion and contractive auto-encoders, the regularizer speciﬁes a preference over functions

of the data, rather than over parameters. This makes such a regularizer data-dependent,

unlike the classical parameter log-prior. Speciﬁcally, in the case of the sparse auto-

encoder, it says that we prefer an encoder whose output produces values closer to 0.

Indirectly (when we marginalize over the training distribution), this is also indicating a

preference over parameters, of course.

13.3.2 Predictive Sparse Decomposition

Predictive Sparse Decomposition (PSD) is a variant that combines sparse coding and an

auto-encoder (Kavukcuoglu et al., 2008b). It has been applied to unsupervised feature

learning for object recognition in images and video (Kavukcuoglu et al., 2009, 2010;

Jarrett et al., 2009a; Farabet et al., 2011), as well as for audio (Henaﬀ et al., 2011).

The representation is considered to be a free variable (possibly a latent variable if we

choose a probabilistic interpretation) and the training criterion combines a sparse coding

criterion with a term that encourages the optimized sparse representation h to be close

to the output of the encoder f (x):

L = arg min



||x −g(h)||

+ λ|h|

+ γ||h − f(x)||



(13.8)

where f is the encoder and g is the decoder. Like in sparse coding, for each example x

an iterative optimization is performed in order to obtain a representation h. However,

because the iterations can be initialized from the output of the encoder, i.e., with h =

f(x), only a few steps (e.g. 10) are necessary to obtain good results. Simple gradient

descent on h has been used by the authors. After h is settled, both g and f are updated

towards minimizing the above criterion. The ﬁrst two terms are the same as in L1 sparse

coding while the second one encourages f to predict the outcome of the sparse coding

optimization, making it a better choice for the initialization of the iterative optimization.

Hence f can be used as a parametric approximation to the non-parametric encoder

implicitly deﬁned by sparse coding. It is one of the ﬁrst instances of learned approximate

inference (see also Sec. 16.6). Note that this is diﬀerent from separately doing sparse

coding (i.e., training g) and then training an approximate inference mechanism f, since

both the encoder and decoder are trained together to be “compatible” with each other.

Hence the decoder will be learned in such a way that inference will tend to ﬁnd solutions

that can be well approximated by the approximate inference. A similar example is the

variational auto-encoder, in which the encoder acts as approximate inference for the

decoder, and both are trained jointly (Section 17.7.1). See also Section 17.7.2 for a

probabilistic interpretation of PSD in terms of a variational lower bound on the log-

likelihood.

In practical applications of PSD, the iterative optimization is only used during train-

ing, and f is used to compute the learned features (e.g., to pre-training a supervised

276

deep network). It makes computation fast at recognition time and also makes it easy

to use the trained features f as initialization (unsupervised pre-training) for the lower

layers of a deep net.

13.4 Denoising Auto-Encoders

The Denoising Auto-Encoder (DAE) was ﬁrst proposed (Vincent et al., 2008, 2010) as

a means of forcing an auto-encoder to learn to capture the data distribution without an

explicit constraint on either the dimension or the sparsity of the learned representation.

It was motivated by the idea that in order to fully capture a complex distribution, an

auto-encoder needs to have at least as many hidden units as needed by the complexity

of that distribution. Hence its dimensionality should not be restricted to the input

dimension.

The principle of the denoising auto-encoder is deceptively simple and illustrated in

Figure 13.13: the encoder sees as input a corrupted version of the input, but the decoder

tries to reconstruct the clean uncorrupted input.

˜x

C(˜x|x)

L = log P (x|g(f(˜x)))

h = f(˜x)

Figure 13.13: The computational graph of a denoising auto-encoder, which is trained

to reconstruct the clean data point x from its corrupted version

x, i.e., to minimize

the loss L = −log P(x|g(f(

x))), where

x is a corrupted version of the data example x,

obtained through a given corruption process C(

x|x).

Mathematically, and following the notations used in this chapter, this can be formal-

ized as follows. We introduce a corruption process C(

x|x) which represents a conditional

distribution over corrupted samples

x, given a data sample x. The auto-encoder then

learns a reconstruction distribution P (x|

x) estimated from training pairs (x,

x), as fol-

lows:

277

1. Sample a training example x = x from the data generating distribution (the

training set).

2. Sample a corrupted version

x =

x from the conditional distribution C(

x|x = x).

3. Use (x,

x) as a training example for estimating the auto-encoder reconstruction

distribution P(x|

x) = P (x|g(h)) with h the output of encoder f(

x) and g(h) the

output of the decoder.

Typically we can simply perform gradient-based approximate minimization (such as

minibatch gradient descent) on the negative log-likelihood −log P (x|h), i.e., the de-

noising reconstruction error, using back-propagation to compute gradients, just like for

regular feedforward neural networks (the only diﬀerence being the corruption of the

input and the choice of target output).

We can view this training objective as performing stochastic gradient descent on the

denoising reconstruction error, but where the “noise” now has two sources:

1. the choice of training sample x from the data set, and

2. the random corruption applied to x to obtain

We can therefore consider that the DAE is performing stochastic gradient descent

on the following expectation:

−E

x∼Q(x)

x∼C(

x|x)

log P (x|g(f(

x)))

where Q(x) is the training distribution.

278

C(˜x|x)

˜x

g(f(˜x)) ⇡ E[x|˜x]

Figure 13.14: A denoising auto-encoder is trained to reconstruct the clean data point x

from it corrupted version

x. In the ﬁgure, we illustrate the corruption process C(

x|x)

(grey circle of equiprobable corruptions, and grey arrow for the corruption process)

acting on examples x (red crosses) lying near a low-dimensional manifold near which

probability concentrates. When the denoising auto-encoder is trained to minimize the

average of squared errors ||g(f(

x)) −x||

, the reconstruction g(f(

x)) estimates E[x|

x],

which approximately points orthogonally towards the manifold, since it estimates the

center of mass of the clean points x which could have given rise to

x. The auto-encoder

thus learns a vector ﬁeld g(f(x))−x (the green arrows) and it turns out that this vector

ﬁeld estimates the gradient ﬁeld

∂ log Q(x)

∂x

(up to a constant that is the reconstruction

error), where Q is the unknown data generating distribution.

13.4.1 Learning a Vector Field that Estimates a Gradient Field

As illustrated in Figure 13.14, a very important property of DAEs is that their training

criterion makes the auto-encoder learn a vector ﬁeld g(f(x)) that estimates the gradient

ﬁeld (or score)

∂ log Q(x)

∂x

. This has been proven (Alain and Bengio, 2012, 2013) in the

case where the corruption and the reconstruction distributions are Gaussian (and of

course x is continuous-valued), i.e., with the squared error denoising error

||g(f(

x)) −x||

and corruption

x =

x|x) = N(

x; µ = x, Σ = σ

279

with noise variance σ

Figure 13.15: Vector ﬁeld learned by a denoising auto-encoder around a 1-D curved

manifold near which the data (orange circles) concentrates in a 2-D space. Each arrow

is proportional to the reconstruction minus input vector of the auto-encoder and points

towards higher probability according to the implicitly estimated probability distribution.

Note that the vector ﬁeld has zeros at both peaks of the estimated density function (on

the data manifolds) and at troughs (local minima) of that density function, e.g., on the

curve that separates diﬀerent arms of the spiral or in the middle of it.

More precisely, the main theorem states that

g(f(x))−x

is a consistent estimator of

∂ log Q(x)

∂x

g(f(x)) − x

→

∂ log Q(x)

∂x

so long as f and g have suﬃcient capacity to represent the true score (and assuming

that the expected training criterion can be minimized, as usual when proving consistency

associated with a training objective).

Although it was intuitively appealing that in order to denoise correctly one must

capture the training distribution, this result makes it mathematically very clear. Fig-

280

ure 13.15 (see details of experiment in Alain and Bengio (2013)) illustrates this. Note

how the norm of reconstruction error (i.e. the norm of the vectors shown in the ﬁgure)

is related to but diﬀerent from the energy (unnormalized log-density) associated with

the estimated model. The energy should be low only where the probability is high. The

reconstruction error (norm of the estimated score vector) is also low where probability

is high, but it can also be low at maxima of energy (minima of probability).

13.4.2 Turning the Gradient Field into a Generative Model

Since each application of the encoder-decoder pair brings us towards a more probable

conﬁguration, we can use this to actually sample from the estimated distribution. If you

consider most Monte-Carlo Markov Chain (MCMC) algorithms, they have two elements:

1. move from lower probability conﬁgurations towards higher probability conﬁgura-

tions, and

2. inject randomness so that the chain moves around (and does not stay stuck at

some peak of probability, or mode) and has a chance to visit the whole space, with

probability equal to the underlying model.

So conceptually all one needs to do is to perform encode-decode operations (go towards

more probable conﬁgurations) as well as inject noise (to move around the probable

conﬁgurations), as hinted at in (Mesnil et al., 2012; Rifai et al., 2012). But how much

and how?

281

˜x

= g(h

)

= f(˜x

)

P (X| !)

X|X)

t+1

Figure 13.16: Each step of the Markov chain associated with a trained denoising auto-

encoder, that generates the samples from the probabilistic model implicitly trained by

the denoising reconstruction criterion. Each step consists in (a) injecting corruption C

in state x, yielding

x, (b) encoding it with f, (c) decoding the result with g, and (d) from

this, sampling a new state from the reconstruction distribution P(x|ω = g(f(

x))). In

the typical squared reconstruction error case, g(h) =

x and estimates E[x|

x], corruption

consists in adding Gaussian noise and sampling from P (x|ω) consists in adding another

Gaussian noise to the reconstruction

x. The latter noise level should correspond to the

mean squared error of reconstructions, whereas the injected noise is a hyper-parameter

that controls the mixing speed as well as the extent to which the estimator smoothes

the empirical distribution (Vincent, 2011). In the ﬁgure, only the C and P conditionals

are stochastic steps (f and g are deterministic computations), although noise can also

be injected inside the auto-encoder, as in generative stochastic networks (Bengio et al.,

2014)

An answer has been given by Bengio et al. (2013a), Theorem 1, which states that

each step of the Markov chain that generates from the estimated distribution consists

of the following sub-steps, illustrated in Figure 13.16:

1. starting from the previous state x, inject corruption noise, sampling

x from

x|x).

2. Encode

x into h = f(

x).

3. Decode h to obtain the parameters ω = g(h) of P (x|ω = g(h)) = P (x|

x).

4. Sample the next state x from P (x|ω = g(h)) = P (x|

x).

The theorem states that if the auto-encoder P (x|

x) forms a consistent estimator of

corresponding true conditional distribution, then the stationary distribution of the above

282

Markov chain forms a consistent estimator (albeit an implicit one) of the data generating

distribution of x.

P (

x|x)

P (

x|x)

P (

x|x)

P (x|

Figure 13.17: Illustration of one step of the sampling Markov chain associated with a

denoising auto-encoder (see also Figure 13.16). In the ﬁgure, the data (black circles)

are sitting near a low-dimensional manifold (a spiral, here), and the two stochastic steps

of the Markov chain are ﬁrst to corrupt x (clean image of dog, green circle) into

(noisy image of dog, blue circle) via C(

x|x) (here an isotropic Gaussian noise in green),

and then to sample a new x via the estimated denoising P (x|

x). Note how there are

many possible x which could have given rise to

x, and these all lie on the manifold in

the neighborhood of

x, hence the ﬂattened shape of P (x|

x) (in blue). Modiﬁed from a

ﬁgure ﬁrst created and graciously authorized by Jason Yosinski.

Figure 13.17 illustrates the sampling process of the DAE in a way that complements

Figure 13.16, with a speciﬁc imagined example. For a more elaborate discussion of the

probabilistic nature of denoising auto-encoders, and their generalization (Bengio et al.,

2014), Generative Stochastic Networks (GSNs), see Section 17.9. In particular, the noise

does not have to be injected only in the input, and it could be injected anywhere along

the chain. GSNs also generalize DAEs by allowing the state of the Markov chain to be

extended beyond the visible variable x, to include also some latent variable h. Finally,

Section 17.9 discusses training strategies for DAEs that are aimed at making it a better

generative model and not just a good feature learner.

283

See also Section 15.5 for the connection to score matching, an induction principle

that is an alternative to the maximum likelihood principle.

13.5 Contractive Auto-Encoders

The Contractive Auto-Encoder or CAE (Rifai et al., 2011a,c) introduces an explicit

regularizer on the code h = f (x), encouraging the derivatives of f to be as small as

possible:

Ω(h) = ||

∂f(x)

∂x

(13.9)

which is the squared Frobenious norm (sum of squared elements) of the Jacobian matrix

associated with the encoder. Whereas the denoising auto-encoder learns to contract the

reconstruction function (the composition of the encoder and decoder), the CAE learns

to speciﬁcally contract the encoder. See Figure 13.11 for a view of how contraction near

the data points makes the auto-coder capture the manifold structure.

If it weren’t for the opposing force of reconstruction error, which attempts to make

the code h keep all the information necessary to reconstruct training examples, the CAE

penalty would yield a code h that is constant and does not depend on the input x. The

compromise between these two forces yields an auto-encoder whose derivatives

∂f (x)

∂x

are

tiny in most directions, except those that are needed to reconstruct training examples,

i.e., the directions that are tangent to the manifold near which data concentrate.

What is interesting is that this penalty forces more strongly the representation to be

invariant in directions orthogonal to the manifold. This can be seen clearly by comparing

the singular value spectrum of the Jacobian

∂f (x)

∂x

for diﬀerent auto-encoders, as shown

in Figure 13.18. We see that the CAE manages to concentrate the sensitivity of the

representation in fewer dimensions than a regular (or sparse) auto-encoder. Figure 13.3

illustrate tangent vectors obtained by a CAE on the MNIST digits dataset, showing

that the leading tangent vectors correspond to small deformations such as translation.

More impressively, Figure 13.19 shows tangent vectors learned on 32×32 color (RGB)

CIFAR-10 images by a CAE, compared to the tangent vectors by a non-distributed

representation learner (a mixture of local PCAs).

One practical issue with the CAE regularization criterion is that although it is cheap

to compute in the case of a single hidden layer auto-encoder, it becomes much more

expensive in the case of deeper auto-encoders. The stragegy followed in Rifai et al.

(2011a) is to separately pre-train each single-layer auto-encoder stacked to form a deeper

auto-encoder.

Another practical issue is that the contraction penalty on the encoder f could yield

useless results if the decoder g would exactly compensate (e.g. by being scaled up by

exactly the same amount as f is scaled down). In Rifai et al. (2011a), this is compensated

by tying the weigts of f and g, both being of the form of an aﬃne transformation followed

by a non-linearity (e.g. sigmoid), i.e., the weights of g and the transpose of the weights

of f.

284

Figure 13.18: Average (over test examples) of the singular value spectrum of the Jaco-

bian matrix

∂f (x)

∂x

for the encoder f learned by a regular auto-encoder (AE) versus a

contractive auto-encoder (CAE). This illustrates how the contractive regularizer yields

a smaller set of directions in input space (those corresponding to large singular value of

the Jacobian) which provoke a response in the representation h while the representation

remains almost insensitive for most directions of change in the input.

13.6 Tangent Distance, Tangent-Prop, and Manifold Tan-

gent Classiﬁer

One of the early attempts to take advantage of the manifold hypothesis is the Tangent

Distance algorithm (Simard et al., 1993, 1998). It is a non-parametric nearest-neighbor

algorithm in which the metric used is not the generic Euclidean distance but one that

is derived from knowledge of the manifolds near which probability concentrates. It

is assumed that we are trying to classify examples and that examples on the same

manifold share the same category. Since the classiﬁer should be invariant to the local

factors of variation that correspond to movement on the manifold, it would make sense

to build use as nearest-neighbor distance between points x

and x

the distance between

the manifolds M

and M

to which they respectively belong. Although that may be

computationally diﬃcult (it would require an optimization, to ﬁnd the nearest pair of

points on M

and M

), a cheap alternative that makes sense locally is to approximate

by its tangent at x

and measure the distance between the two tangents, or between

a tangent plane and a point. That can be achieved by solving a low-dimensional linear

285

Figure 13.19: Illustration of tangent vectors of the manifold estimated by a contractive

auto-encoder (CAE), at some input point (left, CIFAR-10 image of a dog). See also

Fig. 13.3. Each image on the right corresponds to a tangent vector, either estimated by

a local PCA (equivalent to a Gaussian mixture), top, or by a CAE (bottom). Although

both local PCA and CAE can capture local tangents that are diﬀerent in diﬀerent points,

the local PCA does not have enough training data to meaningful capture good tangent

directions, whereas the CAE does (because it exploits parameter sharing across diﬀerent

locations that share a subset of active hidden units). The CAE tangent directions

typically correspond to moving or changing parts of the object (such as the head or

legs), which corresponds to plausible changes in the input image.

system (in the dimension of the manifolds). Of course, this algorithm requires one to

specify the tangent vectors at any point

In a related spirit, the Tangent-Prop algorithm (Simard et al., 1992) proposes to

train a neural net classiﬁer with an extra penalty to make the output f(x) of the neural

net locally invariant to known factors of variation. These factors of variation correspond

to movement of the manifold near which examples of the same class concentrate. Local

invariance is achieved by requiring

∂f (x)

∂x

to be orthogonal to the known manifold tangent

vectors v

at x, or equivalently that the directional derivative of f at x in the directions

regularizer = λ



(

∂f(x)

∂x

· v

)

. (13.10)

Like for tangent distance, the tangent vectors are derived a priori, e.g., from the formal

knowledge of the eﬀect of transformations such as translation, rotation, and scaling in

images.

A more recent paper introduces the Manifold Tangent Classiﬁer (Rifai et al., 2011d),

which eliminates the need to know the tangent vectors a priori, and instead uses a con-

tractive auto-encoder to estimate them at any point. As we have seen in the previous

286

Figure 13.20: Illustration of the main idea of the tangent-prop algorithm (Simard et al.,

1992) and manifold tangent classiﬁer (Rifai et al., 2011d), which both regularize the

classiﬁer output function f(x) (e.g. estimating conditional class probabilities given the

input) so as to make it invariant to the local directions of variations

∂h

∂x

(manifold tangent

directions). This can be achieved by penalizing the magnitude of the dot product of all

the rows of

∂h

∂x

(the tangent directions) with all the rows of

∂f

∂x

(the directions of sentivity

of each output to the input). In the case of the tangent-prop algorithm, the tangent

directions are given a priori, whereas in the case of the manifold tangent classiﬁer, they

are learned, with h(x) being the learned representation of the input x. The ﬁgure

illustrates two manifold, one per class, and we see that the classiﬁer output increases

the most as we move from one manifold to the other, in input space.

section and Figure 13.18, auto-encoders in general, and contractive auto-encoders es-

pecially well, learn a representation h that is most sensitive to the factors of variation

present in the data x, so that the leading singular vectors of

∂h

∂x

correspond to the esti-

mated tangent vectors. As illustrated in Figure 13.19, these estimated tangent vectors

go beyond the classical invariants that arise out of the geometry of images (such as

translation, rotation and scaling) and include factors that must be learned because they

are object-speciﬁc (such as adding or moving body parts). The algorithm proposed with

the manifold tangent classiﬁer is therefore simple: (1) use a regularized auto-encoder

such as the contractive auto-encoder to learn the manifold structure by unsupervised

learning (2) use these tangents to regularize a neural net classiﬁer as in Tangent Prop

(Eq. 13.10).

287