Chapter 16

Linear Factor Models and

Auto-Encoders

Linear factor models are generative unsupervised learning models in which we imag-

ine that some unobserved factors h explain the observed variables x through a linear

transformation. Auto-encoders are unsupervised learning methods that learn a repre-

sentation of the data, typically obtained by a non-linear parametric transformation of

the data, i.e., from x to h, typically a feedforward neural network, but not necessarily.

They also learn a transformation going backwards from the representation to the data,

from h to x, like the linear factor models. Linear factor models therefore only specify

a parametric decoder, whereas auto-encoder also specify a parametric encoder. Some

linear factor models, like PCA, actually correspond to an auto-encoder (a linear one),

but for others the encoder is implicitly deﬁned via an inference mechanism that searches

for an h that could have generated the observed x.

The idea of auto-encoders has been part of the historical landscape of neural networks

for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994) but

has really picked up speed in recent years. They remained somewhat marginal for

many years, in part due to what was an incomplete understanding of the mathematical

interpretation and geometrical underpinnings of auto-encoders, which are developed

further in Chapters 18 and 21.10.

An auto-encoder is simply a neural network that tries to copy its input

to its output. The architecture of an auto-encoder is typically decomposed into the

following parts, illustrated in Figure 16.1:

• an input, x

• an encoder function f

• a “code” or internal representation h = f(x)

• a decoder function g

• an output, also called “reconstruction” r = g(h) = g(f(x))

282

input!x!

code!h!

reconstruc,on!r!

Decoder.g!

Encoder.f!

Figure 16.1: General schema of an auto-encoder, mapping an input x to an output

(called reconstruction) r through an internal representation or code h. The auto-encoder

has two components: the encoder f (mapping x to h) and the decoder g (mapping h

to r).

• a loss function L computing a scalar L(r, x) measuring how good of a reconstruc-

tion r is of the given input x. The objective is to minimize the expected value of

L over the training set of examples {x}.

16.1 Regularized Auto-Encoders

Predicting the input may sound useless: what could prevent the auto-encoder from

simply copying its input into its output? In the 20th century, this was achieved by

constraining the architecture of the auto-encoder to avoid this, by forcing the dimension

of the code h to be smaller than the dimension of the input x.

Figure 16.2 illustrates the two typical cases of auto-encoders: undercomplete vs over-

complete, i.e., with the dimension of the representation h respectively smaller vs larger

than the input x. Whereas early work with auto-encoders, just like PCA, uses the under-

completeness – i.e. a bottleneck in the sequence of layers – to avoid learning the identity

function, more recent work allows overcomplete representations. What we have learned

in recent years is that it is possible to make the auto-encoder meaningfully capture the

structure of the input distribution even if the representation is overcomplete, with other

forms of constraint or regularization. In fact, once you realize that auto-encoders can

283

input!x!

Code*bo,leneck!h:!

undercomplete*

representa4on*

reconstruc4on!r!

Decoder*

Encoder*

Decoder*

Encoder*

Code!h:!

overcomplete*

representa4on*

Figure 16.2: Left: undercomplete representation (dimension of code h is less than di-

mension of input x). Right: overcomplete representation. Overcomplete auto-encoders

require some other form of regularization (instead of the constraint on the dimension of

h) to avoid the trivial solution where r = x for all x.

capture the input distribution (indirectly, not as a an explicit probability function), you

also realize that it should need more capacity as one increases the complexity of the

distribution to be captured (and the amount of data available): it should not be limited

by the input dimension. This is a problem in particular with the shallow auto-encoders,

which have a single hidden layer (for the code). Indeed, that hidden layer size controls

both the dimensionality reduction constraint (the code size at the bottleneck) and the

capacity (which allows to learn a more complex distribution).

Besides the bottleneck constraint, alternative constraints or regularization methods

have been explored and can guarantee that the auto-encoder does something useful and

not just learn some trivial identity-like function:

• Sparsity of the representation or of its derivative: even if the intermediate

representation has a very high dimensionality, the eﬀective local dimensionality

(number of degrees of freedom that capture a coordinate system among the prob-

able x’s) could be much smaller if most of the elements of h are zero (or any

other constant, such that ||

∂h

∂x

|| is close to zero). When ||

∂h

∂x

|| is close to zero,

does not participate in encoding local changes in x. There is a geometrical

interpretation of this situation in terms of manifold learning that is discussed in

more depth in Chapter 18. The discussion in Chapter 17 also explains how an

auto-encoder naturally tends towards learning a coordinate system for the actual

284

factors of variation in the data. At least four types of “auto-encoders” clearly fall

in this category of sparse representation:

– Sparse coding (Olshausen and Field, 1996) has been heavily studied as an

unsupervised feature learning and feature inference mechanism. It is a linear

factor model rather than an auto-encoder, because it has no explicit para-

metric encoder, and instead uses an iterative inference instead to compute

the code. Sparse coding looks for representations that are both sparse and

explain the input through the decoder. Instead of the code being a para-

metric function of the input, it is instead considered like free variable that is

obtained through an optimization, i.e., a particular form of inference:

∗

= f(x) = arg min

L(g(h), x)) + λΩ(h) (16.1)

where L is the reconstruction loss, f the (non-parametric) encoder, g the

(parametric) decoder, Ω(h) is a sparsity regularizer, and in practice the min-

imization can be approximate. Sparse coding has a manifold or geometric

interpretation that is discussed in Section 16.7. It also has an interpretation

as a directed graphical model, described in more details in Section 20.3. To

achieve sparsity, the objective function to optimize includes a term that is

minimized when the representation has many zero or near-zero values, such

as the L1 penalty |h|

– An interesting variation of sparse coding combines the freedom to choose the

representation through optimization and a parametric encoder. It is called

predictive sparse decomposition (PSD) (Kavukcuoglu et al., 2008a) and

is brieﬂy described in Section 16.7.2.

– At the other end of the spectrum are simply sparse auto-encoders, which

combine with the standard auto-encoder schema a sparsity penalty which

encourages the output of the encoder to be sparse. These are described in

Section 16.7.1. Besides the L1 penalty, other sparsity penalties that have been

explored include the Student-t penalty (Olshausen and Field, 1996; Bergstra,

2011),

log(1 + α

)

(i.e. where αh

has a Student-t prior density) and the KL-divergence penalty (Lee

et al., 2008; Goodfellow et al., 2009; Larochelle and Bengio, 2008a)

−

(t log h

+ (1 −t) log(1 − h

)),

with a target sparsity level t, for h

∈ (0, 1), e.g. through a sigmoid non-

linearity.

– Contractive autoencoders (Rifai et al., 2011b), covered in Section 16.9,

explicitly penalize ||

∂h

∂x

, i.e., the sum of the squared norm of the vectors

285

∂h

(x)

∂x

(each indicating how much each hidden unit h

responds to changes in

x and what direction of change in x that unit is most sensitive to, around

a particular x). With such a regularization penalty, the auto-encoder is

called contractive

because the mapping from input x to representation h

is encouraged to be contractive, i.e., to have small derivatives in all directions.

Note that a sparsity regularization indirectly leads to a contractive mapping

as well, when the non-linearity used happens to have a zero derivative at

= 0 (which is the case for the sigmoid non-linearity).

• Robustness to injected noise or missing information: if noise is injected

in inputs or hidden units, or if some inputs are missing, while the neural network

is asked to reconstruct the clean and complete input, then it cannot simply learn

the identity function. It has to capture the structure of the data distribution

in order to optimally perform this reconstruction. Such auto-encoders are called

denoising auto-encoders and are discussed in more detail in Section 16.8. There is

a tight connection between the denoising auto-encoders and the contractive auto-

encoders: it can be shown (Alain and Bengio, 2013) that in the limit of small

Gaussian injected input noise, the denoising reconstruction error is equivalent to

a contractive penalty on the reconstruction function that maps x to r = g(f(x)).

In other words, since both x and x +  (where  is some small noise vector)

must yield the same target output x, the reconstruction function is encouraged

to be insensitive to changes in all directions . The only thing that prevents

reconstruction r from simply being a constant (completely insensitive to the input

x), is that one also has to reconstruct correctly for diﬀerent training examples x.

However, the auto-encoder can learn to be approximately constant around training

examples x while producing a diﬀerent answer for diﬀerent training examples. As

discussed in Section 18.3, if the examples are near a low-dimensional manifold, this

encourages the representation to vary only on the manifold and be locally constant

in directions orthogonal to the manifold, i.e., the representation locally captures

a (not necessarily Euclidean, not necessarily orthogonal) coordinate system for

the manifold. In addition to the denoising auto-encoder, the variational auto-

encoder (Section 21.8.2) and the generative stochastic networks (Section 21.10)

also involve the injection of noise, but typically in the representation-space itself,

thus introducing the notion of h as a latent variable.

• Pressure of a Prior on the Representation: an interesting way to generalize

the notion of regularization applied to the representation is to introduce in the

cost function for the auto-encoder a log-prior term

−log P (h)

which captures the assumption that we would like to ﬁnd a representation that has

a simple distribution (if P (h) has a simple form, such as a factorized distribution

A function f (x) is contractive if ||f (x) − f(y)|| < ||x − y|| for nearby x and y, or equivalently if its

derivative ||f

(x)|| < 1.

all the sparse priors we have described correspond to a factorized distribution

286

or at least one that is simpler than the original data distribution. Among all the

encoding functions f, we would like to pick one that

1. can be inverted (easily), and this is achieved by minimizing some reconstruc-

tion loss, and

2. yields representations h whose distribution is “simpler”, i.e., can be captured

with less capacity than the original training distribution itself.

The sparse variants described above clearly fall in that framework. The varia-

tional auto-encoder (Section 21.8.2) provides a clean mathematical framework for

justifying the above pressure of a top-level prior when the objective is to model

the data generating distribution.

From the point of view of regularization (Chapter 7), adding the −log P (h) term

to the objective function (e.g. for encouraging sparsity) or adding a contractive penalty

do not ﬁt the traditional view of a prior on the parameters. Instead, the prior on

the latent variables acts like a data-dependent prior, in the sense that it depends on the

particular values h that are going to be sampled (usually from a posterior or an encoder),

based on the input example x. Of course, indirectly, this is also a regularization on the

parameters, but one that depends on the particular data distribution.

16.2 Representational Power, Layer Size and Depth

Nothing in the above description of auto-encoders restricts the encoder or decoder to

be shallow, but in the literature on the subject, most trained auto-encoders have had a

single hidden layer which is also the representation layer or code

For one, we know by the usual universal approximator abilities of single hidden-

layer neural networks that a suﬃciently large hidden layer can represent any function

with a given accuracy. This observation justiﬁes overcomplete auto-encoders: in order to

represent a rich enough distribution, one probably needs many hidden units in the inter-

mediate representation layer. We also know that Principal Components Analysis (PCA)

corresponds to an undercomplete auto-encoder with no intermediate non-linearity, and

that PCA can only capture a set of directions of variation that are the same every-

where in space. This notion is discussed in more details in Chapter 18 in the context of

manifold learning.

For two, it has also been reported many times that training a deep neural network,

and in particular a deep auto-encoder (i.e. with a deep encoder and a deep decoder) is

more diﬃcult than training a shallow one. This was actually a motivation for the initial

work on the greedy layerwise unsupervised pre-training procedure, described below in

Section 17.1, by which we only need to train a series of shallow auto-encoders in order to

initialize a deep auto-encoder. It was shown early on (Hinton and Salakhutdinov, 2006)

as argued in this book, this is probably not a good choice, and we would like to independently

control the constraints on the representation, e.g. dimension and sparsity of the code, and the capacity

of the encoder.

287

that, if trained properly, such deep auto-encoders could yield much better compression

than corresponding shallow or linear auto-encoders (which are basically doing the same

as PCA, see Section 16.5 below). As discussed in Section 17.7, deeper architectures

can be in some cases exponentially more eﬃcient (both in terms of computation and

statistically) than shallow ones. However, because we can usefully pre-train a deep net

by training and stacking shallow ones, it makes it interesting to consider single-layer

(or at least shallow and easy to train) auto-encoders, as has been done in most of the

literature discussed in this chapter.

16.3 Reconstruction Distribution

The above “parts” (encoder function f, decoder function g, reconstruction loss L) make

sense when the loss L is simply the squared reconstruction error, but there are many

cases where this is not appropriate, e.g., when x is a vector of discrete variables or

when P (x|h) is not well approximated by a Gaussian distribution

. Just like in the

case of other types of neural networks (starting with the feedforward neural networks,

Section 6.2.2), it is convenient to deﬁne the loss L as a negative log-likelihood over

some target random variables. This probabilistic interpretation is particularly impor-

tant for the discussion in Sections 21.8.2, 21.9 and 21.10 about generative extensions of

auto-encoders and stochastic recurrent networks, where the output of the auto-encoder

is interpreted as a probability distribution P (x|h), for reconstructing x, given hid-

den units h. This distribution captures not just the expected reconstruction but also

the uncertainty about the original x (which gave rise to h, either deterministically or

stochastically, given h). In the simplest and most ordinary cases, this distribution fac-

torizes, i.e., P (x|h) =

P (x

|h). This covers the usual cases of x

|h being Gaussian

(for unbounded real values) and x

|h having a Bernoulli distribution (for binary values

), but one can readily generalize this to other distributions, such as mixtures (see

Sections 3.10.5 and 6.2.2).

Thus we can generalize the notion of decoding function g(h) to decoding distribution

P (x|h). Similarly, we can generalize the notion of encoding function f(x) to encoding

distribution Q(h|x), as illustrated in Figure 16.3. We use this to capture the fact

that noise is injected at the level of the representation h, now considered like a latent

variable. This generalization is crucial in the development of the variational auto-encoder

(Section 21.8.2) and the generalized stochastic networks (Section 21.10).

We also ﬁnd a stochastic encoder and a stochastic decoder in the RBM, described

in Section 21.1. In that case, the encoding distribution Q(h|x) and P (x|h) “match”, in

the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distribution which has both

Q(h|x) and P (x|h) as conditionals. This is not true in general for two independently

parametrized conditionals like Q(h|x) and P (x|h), although the work on generative

stochastic networks (Alain et al., 2015) shows that learning will tend to make them

compatible asymptotically (with enough capacity and examples).

See the link between squared error and normal density in Sections 5.6 and 6.2.2

288

Q(h|x)

P (x|h)

Figure 16.3: Basic scheme of a stochastic auto-encoder, in which both the encoder and

the decoder are not simple functions but instead involve some noise injection, meaning

that their output can be seen as sampled from a distribution, Q(h|x) for the encoder

and P (x|h) for the decoder. RBMs are a special case where P = Q (in the sense of a

unique joint corresponding to both conditinals) but in general these two distributions

are not necessarily conditional distributions compatible with a unique joint distribution

P (x, h).

16.4 Linear Factor Models

Now that we have introduced the notion of a probabilistic decoder, let us focus on a

very special case where the latent variable h generates x via a linear transformation plus

noise, i.e., classical linear factor models, which do not necessarily have a corresponding

parametric encoder.

The idea of discovering explanatory factors that have a simple joint distribution

among themselves is old, e.g., see Factor Analysis (see below), and has been explored

ﬁrst in the context where the relationship between factors and data is linear, i.e., we

assume that the data was generated as follows. First, sample the real-valued factors,

h ∼ P (h), (16.2)

and then sample the real-valued observable variables given the factors:

x = W h + b + noise (16.3)

where the noise is typically Gaussian and diagonal (independent across dimensions).

This is illustrated in Figure 16.4.

16.5 Probabilistic PCA and Factor Analysis

Probabilistic PCA (Principal Components Analysis) and factor analysis are both special

cases of the above equations (16.2 and 16.3) and only diﬀer in the choices made for the

289

P (x|h)

h ⇠ P (h)

x = Wh+ b + noise

Figure 16.4: Basic scheme of a linear factors model, in which we assume that an observed

data vector x is obtained by a linear combination of latent factors h, plus some noise.

Diﬀerent models, such as probabilistic PCA, factor analysis or ICA, make diﬀerent

choices about the form of the noise and of the prior P (h).

prior (over latent, not parameters) and noise distributions.

In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable prior

is just the unit variance Gaussian

h ∼ N(0, I)

while the observed variables x

are assumed to be conditionally independent, given h,

i.e., the noise is assumed to be coming from a diagonal covariance Gaussian distribution,

with covariance matrix ψ = diag(σ

), with σ

= (σ

, σ

, . . .) a vector of per-variable

variances.

The role of the latent variables is thus to capture the dependencies between the

diﬀerent observed variables x

. Indeed, it can easily be shown that x is just a Gaussian-

distribution (multivariate normal) random variable, with

x ∼ N(b, W W

+ ψ)

where we see that the weights W induce a dependency between two variables x

and

through a kind of auto-encoder path, whereby x

inﬂuences

= W

x via w

(for

every k) and

inﬂuences x

via w

In order to cast PCA in a probabilistic framework, we can make a slight modiﬁcation

to the factor analysis model, making the conditional variances σ

equal to each other.

In that case the covariance of x is just W W

+ σ

I, where σ

is now a scalar, i.e.,

x ∼ N(b, W W

+ σ

or equivalently

x = W h + b + σz

290

where z ∼ N(0, I) is white noise. Tipping and Bishop (1999) then show an iterative

EM algorithm for estimating the parameters W and σ

What the probabilistic PCA model is basically saying is that the covariance is mostly

captured by the latent variables h, up to some small residual reconstruction error σ

As shown by Tipping and Bishop (1999), probabilistic PCA becomes PCA as σ → 0. In

that case, the conditional expected value of h given x becomes an orthogonal projection

onto the space spanned by the d columns of W , like in PCA. See Section 18.1 for a

discussion of the “inference” mechanism associated with PCA (probabilistic or not),

i.e., recovering the expected value of the latent factors h

given the observed input x.

That section also explains the very insightful geometric and manifold interpretation of

PCA.

However, as σ → 0, the density model becomes very sharp around these d dimensions

spanned the columns of W , as discussed in Section 18.1, which would not make it a

very faithful model of the data, in general (not just because the data may live on a

higher-dimensional manifold, but more importantly because the real data manifold may

not be a ﬂat hyperplane - see Chapter 18 for more).

16.5.1 ICA

Independent Component Analysis (ICA) is among the oldest representation learning

algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,

1999; Hyv¨arinen et al., 2001). It is an approach to modeling linear factors that seeks

non-Gaussian projections of the data. Like probabilistic PCA and factor analysis, it

also ﬁts the linear factor model of Eqs. 16.2 and 16.3. What is particular about ICA is

that unlike PCA and factor analysis it does not assume that the latent variable prior is

Gaussian. It only assumes that it is factorized, i.e.,

P (h) =

P (h

). (16.4)

Since there is no parametric assumption behind the prior, we are really in front of a

so-called semi-parametric model, with parts of the model being parametric (P (x|h))

and parts being non-speciﬁed or non-parametric (P (h)). In fact, this typically yields

to non-Gaussian priors: if the priors were Gaussian, then one could not distinguish

between the factors h and a rotation of h. Indeed, note that if

h = U z

with U an orthonormal (rotation) square matrix, i.e.,

z = U

then, although h might have a Normal(0, I) distribution, the z also have a unit covari-

ance, i.e., they are uncorrelated:

V ar[z] = E[zz

] = E[U

U] = U

V ar[h]U = U

U = I.

291

In other words, imposing independence among Gaussian factors does not allow one to

disentangle them, and we could as well recover any linear rotation of these factors. It

means that, given the observed x, even though we might assume the right generative

model, PCA cannot recover the original generative factors. However, if we assume

that the latent variables are non-Gaussian, then we can recover them, and this is what

ICA is trying to achieve. In fact, under these generative model assumptions, the true

underlying factors can be recovered (Comon, 1994). In fact, many ICA algorithms are

looking for projections of the data s = V x such that they are maximally non-Gaussian.

An intuitive explanation for these approaches is that although the true latent variables h

may be non-Gaussian, almost any linear combination of them will look more Gaussian,

because of the central limit theorem. Since linear combinations of the x

’s are also linear

combinations of the h

’s, to recover the h

’s we just need to ﬁnd the linear combinations

that are maximally non-Gaussian (while keeping these diﬀerent projections orthogonal

to each other).

There is an interesting connection between ICA and sparsity, since the dominant

form of non-Gaussianity in real data is due to sparsity, i.e., concentration of probability

at or near 0. Non-Gaussian distributions typically have more mass around zero, although

you can also get non-Gaussianity by increasing skewness, asymmetry, or kurtosis.

Like PCA can be generalized to non-linear auto-encoders described later in this

chapter, ICA can be generalized to a non-linear generative model, e.g., x = f(h) +

noise. See Hyv¨arinen and Pajunen (1999) for the initial work on non-linear ICA and

its successful use with ensemble learning by Roberts and Everson (2001); Lappalainen

et al. (2000).

16.5.2 Sparse Coding as a Generative Model

One particularly interesting form of non-Gaussianity arises with distributions that are

sparse. These typically have not just a peak at 0 but also a fat tail

. Like the other

linear factor models (Eq. 16.3), sparse coding corresponds to a linear factor model, but

one with a “sparse” latent variable h, i.e., P (h) puts high probability at or around 0.

Unlike with ICA (previous section), the latent variable prior is parametric. For example

the factorized Laplace density prior is

P (h) =

P (h

) =

−λ|h

(16.5)

and the factorized Student-t prior is

P (h) =

P (h

) ∝

1 +

ν+1

. (16.6)

Both of these densities have a strong preference for near-zero values but, unlike the

Gaussian, accomodate large values. In the standard sparse coding models, the recon-

with probability going to 0 as the values increase in magnitude at a rate that is slower than the

Gaussian, i.e., less than quadratic in the log-domain.

292

struction noise is assumed to be Gaussian, so that the corresponding reconstruction

error is the squared error.

Regarding sparsity, note that the actual value h

= 0 has zero measure under both

densities, meaning that the posterior distribution P (h|x) will not generate values h = 0.

However, sparse coding is normally considered under a maximum a posteriori (MAP)

inference framework, in which the inferred values of h are those that maximize the

posterior, and these tend to often be zero if the prior is suﬃciently concentrated around

0. The inferred values are those deﬁned in Eq. 16.1, reproduced here,

h = f (x) = arg min

L(g(h), x)) + λΩ(h)

where L(g(h), x) is interpreted as −log P(x|g(h)) and Ω(h) as −log P (h). This MAP

inference view of sparse coding and an interesting probabilistic interpretation of sparse

coding are further discussed in Section 20.3.

To relate the generative model of sparse coding to ICA, note how the prior imposes

not just sparsity but also independence of the latent variables h

under P (h), which may

help to separate diﬀerent explanatory factors, unlike PCA, factor analysis or probabilis-

tic PCA, because these rely on a Gaussian prior, which yields a factorized prior under

any rotation of the factors, multiplication by an orthonormal matrix, as demonstrated

in Section 16.5.1.

See Section 18.2 about the manifold interpretation of sparse coding.

TODO: relate to and point to Spike-and-slab sparse coding (Goodfellow et al., 2012)

(section?)

16.6 Probabilistic Interpretation of Reconstruction Error

as Log-Likelihood

Although traditional auto-encoders (like traditional neural networks) were introduced

with an associated training loss, just like for neural networks, that training loss can

generally be given a probabilistic interpretation as a conditional log-likelihood of the

original input x, given the reprensentation h.

We have already covered negative log-likelihood as a loss function in general for feed-

forward neural networks in Section 6.2.2. Like prediction error for regular feedforward

neural networks, reconstruction error for auto-encoders does not have to be squared

error. When we view the loss as negative log-likelihood, we interpret the reconstruction

error as

L = −log P(x|h)

where h is the representation, which may generally be obtained through an encoder

taking x as input.

293

h = f(x)

L = log P (x|g(f(x)))

Figure 16.5: The computational graph of an auto-encoder, which is trained to maximize

the probability assigned by the decoder g to the data point x, given the output of the

encoder h = f(x). The training objective is thus L = −log P (x|g(f (x))), which ends up

being squared reconstruction error if we choose a Gaussian reconstruction distribution

with mean g(f (x)), and cross-entropy if we choose a factorized Bernoulli reconstruction

distribution with means g(f(x)).

An advantage of this view is that it immediately tells us what kind of loss function

one should use depending on the nature of the input. If the input is real-valued and

unbounded, then squared error is a reasonable choice of reconstruction error, and cor-

responds to P (x|h) being Normal. If the input is a vector of bits, then cross-entropy

is a more reasonable choice, and corresponds to P (x|h) =

P (x

|h) with x

|h being

Bernoulli-distributed. We then view the decoder g(h) as computing the parameters of

the reconstruction distribution, i.e., P (x|h) = P (x|g(h)).

Another advantage of this view is that we can think about the training of the decoder

as estimating the conditional distribution P (x|h), which comes handy in the probabilis-

tic interpretation of denoising auto-encoders, allowing us to talk about the distribution

P (x) explicitly or implicitly represented by the auto-encoder (see Sections 16.8, 21.8.2

and 21.9 for more details). In the same spirit, we can rethink the notion of encoder from

a simple function to a conditional distribution Q(h|x), with a special case being when

Q(h|x) is a Dirac at some particular value. Equivalently, thinking about the encoder

as a distribution corresponds to injecting noise inside the auto-encoder. This view is

developed further in Sections 21.8.2 and 21.10.

294

16.7 Sparse Representations

Sparse auto-encoders are auto-encoders which learn a sparse representation, i.e., one

whose elements are often either zero or close to zero. Sparse coding was introduced in

Section 16.5.2 as a linear factor model in which the prior P (h) on the representation

h = f(x) encourages values at or near 0. In Section 16.7.1, we see how ordinary auto-

encoders can be prevented from learning a useless identity transformation by using a

sparsity penalty rather than a bottleneck. The main diﬀerence between a sparse auto-

encoder and sparse coding is that sparse coding has no explicit parametric encoder,

whereas sparse auto-encoders have one. The “encoder” of sparse coding is the algorithm

that performs the approximate inference, i.e., looks for

∗

(x) = arg max

log P (h|x) = arg min

||x −(b + Wh)||

− log P (h) (16.7)

where σ

is a reconstruction variance parameter (which should equal the average squared

reconstruction error

), and P(h) is a “sparse” prior that puts more probability mass

around h = 0, such as the Laplacian prior, with factorized marginals

P (h

) =

λ|h

(16.8)

or the Student-t prior, with factorized marginals

P (h

) ∝

(1 +

)

ν+1

. (16.9)

The advantages of such a non-parametric encoder and the sparse coding approach over

sparse auto-encoders are that

1. it can in principle minimize the combination of reconstruction error and log-prior

better than any parametric encoder,

2. it performs what is called explaining away (see Figure 14.8), i.e., it allows to

“choose” some “explanations” (hidden factors) and inhibits the others.

The disadvantages are that

1. computing time for encoding the given input x, i.e., performing inference (com-

puting the representation h that goes with the given x) can be substantially larger

than with a parametric encoder (because an optimization must be performed for

each example x), and

2. the resulting encoder function could be non-smooth and possibly too non-linear

(with two nearby x’s being associated with very diﬀerent h’s), potentially making

it more diﬃcult for the downstream layers to properly generalize.

but can be lumped into the regularizer λ which controls the strength of the sparsity prior, deﬁned

in Eq. 16.8, for example.

295

In Section 16.7.2, we describe PSD (Predictive Sparse Decomposition), which com-

bines a non-parametric encoder (as in sparse coding, with the representation obtained

via an optimization) and a parametric encoder (like in the sparse auto-encoder). Sec-

tion 16.8 introduces the Denoising Auto-Encoder (DAE), which puts pressure on the

representation by requiring it to extract information about the underlying distribution

and where it concentrates, so as to be able to denoise a corrupted input. Section 16.9

describes the Contractive Auto-Encoder (CAE), which optimizes an explicit regulariza-

tion penalty that aims at making the representation as insensitive as possible to the

input, while keeping the information suﬃcient to reconstruct the training examples.

16.7.1 Sparse Auto-Encoders

A sparse auto-encoder is simply an auto-encoder whose training criterion involves a

sparsity penalty Ω(h) in addition to the reconstruction error:

L = −log P(x|g(h)) + Ω(h) (16.10)

where g(h) is the decoder output and typically we have h = f (x), the encoder output.

We can think of that penalty Ω(h) simply as a regularizer or as a log-prior on the

representations h. For example, the sparsity penalty corresponding to the Laplace prior

(

−λ|h

) is the absolute value sparsity penalty (see also Eq. 16.8 above):

Ω(h) = λ

−log P (h) =

log

+ λ|h

| = const + Ω(h) (16.11)

where the constant term depends only of λ and not h (which we typically ignore in the

training criterion because we consider λ as a hyper-parameter rather than a parame-

ter). Similarly (as per Eq. 16.9), the sparsity penalty corresponding to the Student-t

prior (Olshausen and Field, 1997) is

Ω(h) =

ν + 1

log(1 +

) (16.12)

where ν is considered to be a hyper-parameter.

The early work on sparse auto-encoders (Ranzato et al., 2007, 2008) considered

various forms of sparsity and proposed a connection between sparsity regularization and

the partition function gradient in energy-based models (see Section TODO). The idea

is that a regularizer such as sparsity makes it diﬃcult for an auto-encoder to achieve

zero reconstruction error everywhere. If we consider reconstruction error as a proxy

for energy (unnormalized log-probability of the data), then minimizing the training

set reconstruction error forces the energy to be low on training examples, while the

regularizer prevents it from being low everywhere. The same role is played by the

gradient of the partition function in energy-based models such as the RBM (Section

TODO).

296

However, the sparsity penalty of sparse auto-encoders does not need to have a prob-

abilistic interpretation. For example, Goodfellow et al. (2009) successfully used the

following sparsity penalty, which does not try to bring h

all the way down to 0, but

only towards some low target value such as ρ = 0.05.

Ω(h) =

ρ log h

+ (1 −ρ) log(1 − h

) (16.13)

where 0 < h

< 1, usually with h

= sigmoid(a

). This is just the cross-entropy between

the Bernoulli distributions with probability p = h

and the target Bernoulli distribution

with probability p = ρ.

One way to achieve actual zeros in h for sparse (and denoising) auto-encoders was

introduced in Glorot et al. (2011c). The idea is to use a half-rectiﬁer (a.k.a. simply as

“rectiﬁer”) or ReLU (Rectiﬁed Linear Unit, introduced in Glorot et al. (2011b) for deep

supervised networks and earlier in Nair and Hinton (2010) in the context of RBMs) as

the output non-linearity of the encoder. With a prior that actually pushes the repre-

sentations to zero (like the absolute value penalty), one can thus indirectly control the

average number of zeros in the representation. ReLUs were ﬁrst successfully used for

deep feedforward networks in Glorot et al. (2011a), achieving for the ﬁrst time the ability

to train fairly deep supervised networks without the need for unsupervised pre-training,

and this turned out to be an important component in the 2012 object recognition break-

through with deep convolutional networks (Krizhevsky et al., 2012b).

Interestingly, the “regularizer” used in sparse auto-encoders does not conform to

the classical interpretation of regularizers as priors on the parameters. That classical

interpretation of the regularizer comes from the MAP (Maximum A Posteriori) point

estimation (see Section 5.5.1) of parameters associated with the Bayesian view of param-

eters as random variables and considering the joint distribution of data x and parameters

θ (see Section 5.7):

arg max

P (θ|x) = arg max

(log P (x|θ) + log P (θ))

where the ﬁrst term on the right is the usual data log-likelihood term and the second

term, the log-prior over parameters, incorporates the preference over particular values

of θ.

With regularized auto-encoders such as sparse auto-encoders and contractive auto-

encoders, instead, the regularizer corresponds to a log-prior over the representation, or

over latent variables. In the case of sparse auto-encoders, predictive sparse decomposi-

tion and contractive auto-encoders, the regularizer speciﬁes a preference over functions

of the data, rather than over parameters. This makes such a regularizer data-dependent,

unlike the classical parameter log-prior. Speciﬁcally, in the case of the sparse auto-

encoder, it says that we prefer an encoder whose output produces values closer to 0.

Indirectly (when we marginalize over the training distribution), this is also indicating a

preference over parameters, of course.

297

16.7.2 Predictive Sparse Decomposition

Predictive Sparse Decomposition (PSD) is a variant that combines sparse coding and a

parametric encoder (Kavukcuoglu et al., 2008b), i.e., it has both a parametric encoder

and iterative inference. It has been applied to unsupervised feature learning for object

recognition in images and video (Kavukcuoglu et al., 2009, 2010; Jarrett et al., 2009b;

Farabet et al., 2011), as well as for audio (Henaﬀ et al., 2011). The representation is

considered to be a free variable (possibly a latent variable if we choose a probabilistic

interpretation) and the training criterion combines a sparse coding criterion with a term

that encourages the optimized sparse representation h (after inference) to be close to

the output of the encoder f(x):

L = arg min



||x −g(h)||

+ λ|h|

+ γ||h − f(x)||



(16.14)

where f is the encoder and g is the decoder. Like in sparse coding, for each example x

an iterative optimization is performed in order to obtain a representation h. However,

because the iterations can be initialized from the output of the encoder, i.e., with h =

f(x), only a few steps (e.g. 10) are necessary to obtain good results. Simple gradient

descent on h has been used by the authors. After h is settled, both g and f are updated

towards minimizing the above criterion. The ﬁrst two terms are the same as in L1 sparse

coding while the third one encourages f to predict the outcome of the sparse coding

optimization, making it a better choice for the initialization of the iterative optimization.

Hence f can be used as a parametric approximation to the non-parametric encoder

implicitly deﬁned by sparse coding. It is one of the ﬁrst instances of learned approximate

inference (see also Sec. 20.6). Note that this is diﬀerent from separately doing sparse

coding (i.e., training g) and then training an approximate inference mechanism f, since

both the encoder and decoder are trained together to be “compatible” with each other.

Hence the decoder will be learned in such a way that inference will tend to ﬁnd solutions

that can be well approximated by the approximate inference. A similar example is the

variational auto-encoder, in which the encoder acts as approximate inference for the

decoder, and both are trained jointly (Section 21.8.2). See also Section 21.8.3 for a

probabilistic interpretation of PSD in terms of a variational lower bound on the log-

likelihood.

In practical applications of PSD, the iterative optimization is only used during train-

ing, and f is used to compute the learned features. It makes computation fast at

recognition time and also makes it easy to use the trained features f as initialization

(unsupervised pre-training) for the lower layers of a deep net. Like other unsupervised

feature learning schemes, PSD can be stacked greedily, e.g., training a second PSD on

top of the features extracted by the ﬁrst one, etc.

16.8 Denoising Auto-Encoders

The Denoising Auto-Encoder (DAE) was ﬁrst proposed (Vincent et al., 2008, 2010) as

a means of forcing an auto-encoder to learn to capture the data distribution without an

298

explicit constraint on either the dimension or the sparsity of the learned representation.

It was motivated by the idea that in order to fully capture a complex distribution, an

auto-encoder needs to have at least as many hidden units as needed by the complexity

of that distribution. Hence its dimensionality should not be restricted to the input

dimension.

The principle of the denoising auto-encoder is deceptively simple and illustrated in

Figure 16.6: the encoder sees as input a corrupted version of the input, but the decoder

tries to reconstruct the clean uncorrupted input.

˜x

C(˜x|x)

L = log P (x|g(f(˜x)))

h = f (˜x)

Figure 16.6: The computational graph of a denoising auto-encoder, which is trained to

reconstruct the clean data point x from its corrupted version

x, i.e., to minimize the loss

L = −log P(x|g(f (

x))), where

x is a corrupted version of the data example x, obtained

through a given corruption process C(

x|x).

Mathematically, and following the notations used in this chapter, this can be formal-

ized as follows. We introduce a corruption process C(

x|x) which represents a conditional

distribution over corrupted samples

x, given a data sample x. The auto-encoder then

learns a reconstruction distribution P (x|

x) estimated from training pairs (x,

x), as fol-

lows:

1. Sample a training example x = x from the data generating distribution (the

training set).

2. Sample a corrupted version

x =

x from the conditional distribution C(

x|x = x).

3. Use (x,

x) as a training example for estimating the auto-encoder reconstruction

distribution P (x|

x) = P (x|g(h)) with h the output of encoder f (

x) and g(h) the

output of the decoder.

299

Typically we can simply perform gradient-based approximate minimization (such as

minibatch gradient descent) on the negative log-likelihood −log P(x|h), i.e., the de-

noising reconstruction error, using back-propagation to compute gradients, just like for

regular feedforward neural networks (the only diﬀerence being the corruption of the

input and the choice of target output).

We can view this training objective as performing stochastic gradient descent on the

denoising reconstruction error, but where the “noise” now has two sources:

1. the choice of training sample x from the data set, and

2. the random corruption applied to x to obtain

We can therefore consider that the DAE is performing stochastic gradient descent

on the following expectation:

−E

x∼Q(x)

x∼C(

x|x)

log P (x|g(f(

x)))

where Q(x) is the training distribution.

300

C(˜x|x)

˜x

g(f(˜x)) ⇡ E[x|˜x]

Figure 16.7: A denoising auto-encoder is trained to reconstruct the clean data point x

from it corrupted version

x. In the ﬁgure, we illustrate the corruption process C(

x|x)

by a grey circle of equiprobable corruptions, and grey arrow for the corruption process)

acting on examples x (red crosses) lying near a low-dimensional manifold near which

probability concentrates. When the denoising auto-encoder is trained to minimize the

average of squared errors ||g(f(

x)) −x||

, the reconstruction g(f(

x)) estimates E[x|

x],

which approximately points orthogonally towards the manifold, since it estimates the

center of mass of the clean points x which could have given rise to

x. The auto-encoder

thus learns a vector ﬁeld g(f(x)) − x (the green arrows) and it turns out that this

vector ﬁeld estimates the gradient ﬁeld

∂ log Q(x)

∂x

(up to a multiplicative factor that

is the average root mean square reconstruction error), where Q is the unknown data

generating distribution.

16.8.1 Learning a Vector Field that Estimates a Gradient Field

As illustrated in Figure 16.7, a very important property of DAEs is that their training

criterion makes the auto-encoder learn a vector ﬁeld (g(f(x)) − x) that estimates the

gradient ﬁeld (or score)

∂ log Q(x)

∂x

, as per Eq. 16.15. A ﬁrst result in this direction was

proven by Vincent (2011a), showing that minimizing squared reconstruction error in a

denoising auto-encoder with Gaussian noise was related to score matching (Hyv¨arinen,

2005a), making the denoising criterion a regularized form of score matching called de-

noising score matching (Kingma and LeCun, 2010a). Score matching is an alternative

to maximum likelihood and provides a consistent estimator. It is discussed further in

Section 19.4. The denoising version is discussed in Section 19.5.

301

The connection between denoising auto-encoders and score matching was ﬁrst made (Vin-

cent, 2011a) in the case where the denoising auto-encoder has a particular parametriza-

tion (one hidden layer, sigmoid activation functions on hidden units, linear reconstruc-

tion), in which case the denoising criterion actually corresponds to a regularized form

of score matching on a Gaussian RBM (with binomial hidden units and Gaussian vis-

ible units). The connection between ordinary auto-encoders and Gaussian RBMs had

previously been made by Bengio and Delalleau (2009), which showed that contrastive di-

vergence training of RBMs was related to an associated auto-encoder gradient, and later

by Swersky (2010), which showed that non-denoising reconstruction error corresponded

to score matching plus a regularizer.

The fact that the denoising criterion yields an estimator of the score for general

encoder/decoder parametrizations has been proven (Alain and Bengio, 2012, 2013) in

the case where the corruption and the reconstruction distributions are Gaussian (and of

course x is continuous-valued), i.e., with the squared error denoising error

||g(f(

x)) −x||

and corruption

x =

x|x) = N(

x; µ = x, Σ = σ

with noise variance σ

302

Figure 16.8: Vector ﬁeld learned by a denoising auto-encoder around a 1-D curved

manifold near which the data (orange circles) concentrates in a 2-D space. Each arrow

is proportional to the reconstruction minus input vector of the auto-encoder and points

towards higher probability according to the implicitly estimated probability distribution.

Note that the vector ﬁeld has zeros at both peaks of the estimated density function (on

the data manifolds) and at troughs (local minima) of that density function, e.g., on the

curve that separates diﬀerent arms of the spiral or in the middle of it.

More precisely, the main theorem states that

g(f(x))−x

is a consistent estimator of

∂ log Q(x)

∂x

, where Q(x) is the data generating distribution,

g(f(x)) −x

→

∂ log Q(x)

∂x

, (16.15)

so long as f and g have suﬃcient capacity to represent the true score (and assuming

that the expected training criterion can be minimized, as usual when proving consistency

associated with a training objective).

Note that in general, there is no guarantee that the reconstruction g(f(x)) minus

the input x corresponds to the gradient of something (the estimated score should be

the gradient of the estimated log-density with respect to the input x). That is why

303

the early results (Vincent, 2011a) are specialized to particular parametrizations where

g(f(x))−x is the derivative of something. See a more general treatment by Kamyshanska

and Memisevic (2015).

Although it was intuitively appealing that in order to denoise correctly one must

capture the training distribution, the above consistency result makes it mathematically

very clear in what sense the DAE is capturing the input distribution: it is estimating

the gradient of its energy function (i.e., of its log-density), i.e., learning to point towards

more probable (lower energy) conﬁgurations. Figure 16.8 (see details of experiment

in Alain and Bengio (2013)) illustrates this. Note how the norm of reconstruction error

(i.e. the norm of the vectors shown in the ﬁgure) is related to but diﬀerent from the

energy (unnormalized log-density) associated with the estimated model. The energy

should be low only where the probability is high. The reconstruction error (norm of

the estimated score vector) is low where probability is near a peak of probability (or a

trough of energy), but it can also be low at maxima of energy (minima of probability).

Section 21.9 continues the discussion of the relationship between denoising auto-

encoders and probabilistic modeling by showing how one can generate from the distribu-

tion implicitly estimated by a denoising auto-encoder. Whereas (Alain and Bengio, 2013)

generalized the score estimation result of Vincent (2011a) to arbitrary parametrizations,

the result from Bengio et al. (2013b), discussed in Section 21.9, provides a probabilistic

– and in fact generative – interpretation to every denoising auto-encoder.

16.9 Contractive Auto-Encoders

The Contractive Auto-Encoder or CAE (Rifai et al., 2011a,c) introduces an explicit

regularizer on the code h = f(x), encouraging the derivatives of f to be as small as

possible:

Ω(h) =



∂f(x)

∂x



(16.16)

which is the squared Frobenius norm (sum of squared elements) of the Jacobian matrix

of partial derivatives associated with the encoder function. Whereas the denoising auto-

encoder learns to contract the reconstruction function (the composition of the encoder

and decoder), the CAE learns to speciﬁcally contract the encoder. See Figure 18.13

for a view of how contraction near the data points makes the auto-encoder capture the

manifold structure.

If it weren’t for the opposing force of reconstruction error, which attempts to make

the code h keep all the information necessary to reconstruct training examples, the CAE

penalty would yield a code h that is constant and does not depend on the input x. The

compromise between these two forces yields an auto-encoder whose derivatives

∂f (x)

∂x

are

tiny in most directions, except those that are needed to reconstruct training examples,

i.e., the directions that are tangent to the manifold near which data concentrate. Indeed,

in order to distinguish (and thus, reconstruct correctly) two nearby examples on the

manifold, one must assign them a diﬀerent code, i.e., f(x) must vary as x moves from

one to the other, i.e., in the direction of a tangent to the manifold.

304

Figure 16.9: Average (over test examples) of the singular value spectrum of the Jaco-

bian matrix

∂f (x)

∂x

for the encoder f learned by a regular auto-encoder (AE) versus a

contractive auto-encoder (CAE). This illustrates how the contractive regularizer yields

a smaller set of directions in input space (those corresponding to large singular value of

the Jacobian) which provoke a response in the representation h while the representation

remains almost insensitive for most directions of change in the input.

What is interesting is that this penalty forces more strongly the representation to be

invariant in directions orthogonal to the manifold. This can be seen clearly by comparing

the singular value spectrum of the Jacobian

∂f (x)

∂x

for diﬀerent auto-encoders, as shown

in Figure 16.9. We see that the CAE manages to concentrate the sensitivity of the

representation in fewer dimensions than a regular (or sparse) auto-encoder. Figure 18.3

illustrates tangent vectors obtained by a CAE on the MNIST digits dataset, showing

that the leading tangent vectors correspond to small deformations such as translation.

More impressively, Figure 16.10 shows tangent vectors learned on 32×32 color (RGB)

CIFAR-10 images by a CAE, compared to the tangent vectors by a non-distributed

representation learner (a mixture of local PCAs).

One practical issue with the CAE regularization criterion is that although it is cheap

to compute in the case of a single hidden layer auto-encoder, it becomes much more ex-

pensive in the case of deeper auto-encoders. The strategy followed by Rifai et al. (2011a)

is to separately pre-train each single-layer auto-encoder stacked to form a deeper auto-

encoder. However, a deeper encoder could be advantageous in spite of the computational

overhead, as argued by Schulz and Behnke (2012).

Another practical issue is that the contraction penalty on the encoder f could yield

305

Figure 16.10: Illustration of tangent vectors (bottom) of the manifold estimated by

a contractive auto-encoder (CAE), at some input point (left, CIFAR-10 image of a

dog). See also Fig. 18.3. Each image on the right corresponds to a tangent vector,

either estimated by a local PCA (equivalent to a Gaussian mixture), top, or by a CAE

(bottom). The tangent vectors are estimated by the leading singular vectors of the

Jacobian matrix

∂h

∂x

of the input-to-code mappiing. Although both local PCA and CAE

can capture local tangents that are diﬀerent in diﬀerent points, the local PCA does not

have enough training data to meaningful capture good tangent directions, whereas the

CAE does (because it exploits parameter sharing across diﬀerent locations that share

a subset of active hidden units). The CAE tangent directions typically correspond to

moving or changing parts of the object (such as the head or legs), which corresponds to

plausible changes in the input image.

useless results if the decoder g would exactly compensate (e.g. by being scaled up by

exactly the same amount as f is scaled down). In Rifai et al. (2011a), this is compensated

by tying the weights of f and g, both being of the form of an aﬃne transformation

followed by a non-linearity (e.g. sigmoid), i.e., the weights of g and the transpose of the

weights of f.

306