Chapter 10

Unsupervised and Transfer

Learning

Whereas supervised learning has been the workhorse of recent industrial successes of

deep learning, the authors of this book believe that it is likely that a key element of

future advances will be unsupervised learning of representations.

In Chapter 1, we have introduced the notion of representation, the idea that some

representations were more helpful (e.g. to classify objects from images or phonemes from

speech) than others. As argued there, this suggests learning representations in order to

“select” the best ones in a systematic way, i.e., by optimizing a function that maps

raw data to its representation, instead of - or in addition to - handcrafting them. This

motivation for learning input features is discussed in Section 6.5, and is one of the major

side-eﬀects of training a feedforward deep network (treated in Chapter 6), typically via

supervised learning, i.e., when one has access to (input,target) pairs

, available for some

task of interest. In the case of supervised learning of deep nets, we learn a representation

with the objective of selecting one that is best suited to the task of predicting targets

given inputs.

But what if we don’t have any labeled examples? or too few? Pure supervised

learning with few labeled examples can easily overﬁt. On the other hand, humans (and

other animals) can sometimes learn a task from just one or very few examples. How is

that possible? Clearly they must rely on previously acquired knowledge, either innate

or (more likely the case for humans) via previous learning experience. Can we discover

good representations purely out of unlabeled examples? (this is treated in the ﬁrst four

sections of this chapter). Can we combine unlabeled examples (which are often easy

to obtain) with labeled examples? (this is semi-supervised learning, Section 7.6). And

what if instead of one task we have many tasks that could share the same representation

or parts of it? (this is multi-task learning, discussed in Section 7.11). What if we

have “training tasks” (on which enough labeled examples are available) as well as “test

tasks” (not known at the time of learning the representation, and for which only very

typically obtained by labeling inputs with some target answer that we wish the computer would

produce

151

few labeled examples will be provided)? What if the test task is similar but diﬀerent

from the training task? (this is transfer learning and domain adaptation, discussed in

Section 10.5).

These are the questions discussed in this chapter. Along the way we introduce some

of the major building blocks for unsupervised learning of representations (auto-encoders,

Section 10.1 and RBMs, Section 10.3) and a very successful recipe for combining them

to form deep representations (greedy layerwise unsupervised pre-training, Section 10.4).

10.1 Auto-Encoders

The idea of auto-encoders has been part of the historical landscape of neural networks

for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994) but

has really picked up speed in recent years. They remained somewhat marginal for

many years, in part due to what was an incomplete understanding of the mathematical

interpretation and geometrical underpinnings of auto-encoders, which are developed

further in Chapters 13 and 17.9.

input!x!

code!h!

reconstruc,on!r!

Decoder.g!

Encoder.f!

Figure 10.1: General schema of an auto-encoder, mapping an input x to an output (called

reconstruction) r through an internal representation or code h. The auto-encoder has

two components: the encoder f (mapping x to h) and the decoder g (mapping h to r).

An auto-encoder is simply a neural network that tries to copy its input

to its output. The architecture of an auto-encoder is typically decomposed into the

following parts, illustrated in Figure 10.1:

• an input, x

152

• an encoder function f

• a “code” or internal representation h = f(x)

• a decoder function g

• an output, also called “reconstruction” r = g(h) = g(f (x))

• a loss function L computing a scalar L(r, x) measuring how good of a reconstruc-

tion r is of the given input x. The objective is to minimize the expected value of

L over the training set of examples {x}.

10.1.1 Regularized Auto-Encoders

Predicting the input may sound useless: what could prevent the auto-encoder from

simply copying its input into its output? In the 20th century, this was achieved by

constraining the architecture of the auto-encoder to avoid this, by forcing the dimension

of the code h to be smaller than the dimension of the input x.

input!x!

Code*bo,leneck!h:!

undercomplete*

representa4on*

reconstruc4on!r!

Decoder*

Encoder*

Decoder*

Encoder*

Code!h:!

overcomplete*

representa4on*

Figure 10.2: Left: undercomplete representation (dimension of code h is less than di-

mension of input x). Right: overcomplete representation. Overcomplete auto-encoders

require some other form of regularization (instead of the constraint on the dimension of

h) to avoid the trivial solution where r = x for all x.

Figure 10.2 illustrates the two typical cases of auto-encoders: undercomplete vs

overcomplete, i.e., with the dimension of the representation h respectively smaller vs

153

larger than the input x. Whereas early work with auto-encoders, just like PCA, uses

the undercompleteness – i.e. a bottleneck in the sequence of layers – to avoid learning

the identity function, more recent work allows overcomplete representations. What we

have learned in recent years is that it is possible to make the auto-encoder meaningfully

capture the structure of the input distribution even if the representation is overcomplete,

with other forms of constraint or regularization. In fact, once you realize that auto-

encoders can capture the input distribution (indirectly, not as a an explicit probability

function), you realize that it should need more capacity as one increases the complexity

of the distribution to be captured (and the amount of data available): it should not

be limited by the input dimension. This is a problem in particular with the shallow

auto-encoders, which have a single hidden layer (for the code).

Besides the bottleneck constraint, alternative constraints or regularization methods

have been explored and can guarantee that the auto-encoder does something useful and

not just learn some trivial identity-like function:

• Sparsity of the representation or of its derivative: even if the intermediate

representation has a very high dimensionality, the eﬀective local dimensionality

(number of degrees of freedom that capture a coordinate system among the prob-

able x’s) could be much smaller if most of the elements of h are zero (or any other

constant, such that ||

∂h

∂x

|| is close to zero). When ||

∂h

∂x

|| is close to zero, h

does

not participate in encoding local changes in x. There is a geometrical interpre-

tation of this situation in terms of manifold learning that is discussed in more

depth in Chapter 13. That discussion also explains how an auto-encoder naturally

tends towards learning a coordinate system for the actual factors of variation in

the data. At least four types of “auto-encoders” clearly fall in this category of

sparse representation:

– Sparse coding (Olshausen and Field, 1996) has been heavily studied as

an unsupervised feature learning and feature inference mechanism. It looks

for representations that are both sparse and explain the input through the

decoder. In general there is no parametric encoder and the representation is

instead considered a free variable that is obtained through an optimization,

i.e., a particular form of inference:

h = f (x) = arg min

L(g(h), x)) + λΩ(h) (10.1)

where L is the reconstruction loss, f the (non-parametric) encoder, g the

(parametric) decoder, Ω(h) is a sparsity regularizer, and in practice the min-

imization can be approximate. Sparse coding has a manifold or geometric

interpretation that is discussed in Section 13.3. It also has an interpreta-

tion as a directed graphical model, described in more details in Section 16.3.

To achieve sparsity, the optimized objective function includes a term that is

minimized when the representation has many zero or near-zero values, such

as the L1 penalty |h|



154

– An interesting variation of sparse coding combines the freedom to choose the

representation through optimization and a parametric encoder. It is called

predictive sparse decomposition (PSD) (Kavukcuoglu et al., 2008) and

is brieﬂy described in Section 13.3.2.

– At the other end of the spectrum are simply sparse auto-encoders, which

combine with the standard auto-encoder schema a sparsity penalty which

encourages the output of the encoder to be sparse. These are also described in

Section 13.3.1. Besides the L1 penalty, other sparsity penalties that have been

explored include the Student-t penalty (Olshausen and Field, 1996; Bergstra,

2011),



log(1 + α

)

(i.e. where αh

has a Student-t prior density) and the KL-divergence penalty (Lee

et al., 2008; Goodfellow et al., 2009; Larochelle and Bengio, 2008)

−



(t log h

+ (1 −t) log(1 − h

)),

with a target sparsity level t, for h

∈ (0, 1), e.g. through a sigmoid non-

linearity.

– Contractive autoencoders (Rifai et al., 2011), covered in Section 13.5,

explicitly penalize ||

∂h

∂x

, i.e., the sum of the squared norm of the vectors

∂h

(x)

∂x

(each indicating how much each hidden unit h

responds to changes in

x and what direction of change in x that unit is most sensitive to, around a

particular x). With such a regularization penalty, the auto-encoder is called

contractive

because the mapping from input to representation, h(x), is

encouraged to be contractive, i.e., to have small derivatives in all directions.

Note that a sparsity regularization indirectly leads to a contractive mapping

as well, when the non-linearity used happens to have a zero derivative at

= 0 (which is the case for the sigmoid non-linearity).

• Robustness to injected noise: if noise is injected in inputs or hidden units, or

if some inputs are missing, while the neural network is asked to reconstruct the

clean and complete input, then it cannot simply learn the identity function. It

has to capture the structure of the data distribution in order to optimally perform

this reconstruction. Such auto-encoders are called denoising auto-encoders and are

discussed in more detail in Section 13.4. There is a tight connection between the

denoising auto-encoders and the contractive auto-encoders: it can be shown that

in the limit of small Gaussian injected noise, the denoising reconstruction error is

equivalent to a contractive penalty on the reconstruction function r(x) = g(f(x)).

In other words, since both x and x +  (where  is some small noise vector) must

yield the same target output x, the reconstruction function is encouraged to be

A function f(x) is contractive if ||f (x) − f (y)|| < ||x − y|| for nearby x and y, or equivalently if its

derivative ||f



(x)|| < 1.

155

insensitive to changes in all directions . The only thing that prevents reconstruc-

tion r(x) from simply being a constant (completely insensitive to the input x), is

that one also has to reconstruct correctly for diﬀerent training examples x. How-

ever, the auto-encoder can learn to be approximately constant around training

examples x while producing a diﬀerent answer for diﬀerent training examples. As

discussed in Section 13.1, if the examples are near a low-dimensional manifold,

this encourages the representation to vary only on the manifold and be locally

constant in directions orthogonal to the manifold, i.e., the representation locally

captures a (not necessarily Euclidean, not necessarily orthogonal) coordinate sys-

tem for the manifold. In addition to the denoising auto-encoder, the variational

auto-encoder (Section 17.7.1) and the generative stochastic networks (Section 17.9)

also involve the injection of noise, but typically in the representation-space itself,

thus introducing the notion of h as a latent variable.

• Pressure of a Prior on the Representation: an interesting way to generalize

the notion of regularization applied to the representation is to introduce in the

cost function for the auto-encoder a log-prior term

log P (h)

which captures the assumption that we would like to ﬁnd a representation that

has a simple distribution, or at least one that is simpler than the original data

distribution. Among all the encoding functions f, we would like to pick one that

1. can be inverted (easily), and this is achieved by minimizing some reconstruc-

tion loss, and

2. yields representations h whose distribution is “simpler”, i.e., can be captured

with less capacity than the original training distribution itself.

The sparse variants described above clearly fall in that framework. The varia-

tional auto-encoder (Section 17.7.1) provides a clean mathematical framework for

justifying the above pressure of a top-level prior when the objective is to model

the data generating distribution.

10.1.2 Representational Power, Layer Size and Depth

Nothing in the above description of auto-encoders restricts the encoder or decoder to

be shallow, but in the literature on the subject, most trained auto-encoders have had a

single hidden layer which is also the representation layer or code.

For one, we know by the usual universal approximator abilities of single hidden-

layer neural networks that a suﬃciently large hidden layer can represent any function

with a given accuracy. This observation justiﬁes overcomplete auto-encoders: in order to

represent a rich enough distribution, one probably needs many hidden units in the inter-

mediate representation layer. We also know that Principal Components Analysis (PCA)

corresponds to an undercomplete auto-encoder with no intermediate non-linearity, and

156

that PCA can only capture a set of directions of variation that are the same every-

where in space. This notion is discussed in more details in Chapter 13 in the context of

manifold learning.

For two, it has also been reported many times that training a deep neural network,

and in particular a deep auto-encoder (i.e. with a deep encoder and a deep decoder) is

more diﬃcult than training a shallow one. This was actually a motivation for the initial

work on the greedy layerwise unsupervised pre-training procedure, described below in

Section 10.4, by which we only need to train a series of shallow auto-encoders in order to

initialize a deep auto-encoder. It was shown early on (Hinton and Salakhutdinov, 2006)

that, if trained properly, such deep auto-encoders could yield much better compression

than corresponding shallow or linear auto-encoders (which are basically doing the same

as PCA, see Section 10.2.1 below). As discussed in Section 14.3, deeper architectures

can be in some cases exponentially more eﬃcient (both in terms of computation and

statistically) than shallow ones. However, because we can usefully pre-train a deep net

by training and stacking shallow ones, it makes it interesting to consider single-layer

auto-encoders, as has been done in most of the literature discussed in Chapter 13.

10.1.3 Reconstruction Distribution

The above “parts” (encoder function f , decoder function g, reconstruction loss L) make

sense when the loss L is simply the squared reconstruction error, but there are many

cases where this is not appropriate, e.g., when x is a vector of discrete variables or

when P (x|h) is not well approximated by a Gaussian distribution

. Just like in the

case of other types of neural networks (starting with the feedforward neural networks,

Section 6.2.2), it is convenient to deﬁne the loss L as a negative log-likelihood over some

target random variables. This probabilistic interpretation is particularly important for

the discussion in Chapter 17.9, where the output of the auto-encoder is interpreted

as a probability distribution P (x|h), for reconstructing x, given hidden units h. This

distribution captures the uncertainty about the original x (which gave rise to h, either

deterministically or stochastically, given h. In the simplest and most ordinary cases, this

distribution factorizes, i.e., P (x|h) =



P (x

|h). This covers the usual cases of x

being Gaussian (for unbounded real values) and x

|h having a Bernoulli distribution

(for binary values x

), but one can readily generalize this to other distributions, such as

mixtures (see Sections 3.10.5 and 6.2.2).

Thus we can generalize the notion of decoding function g(h) to decoding distribution

P (x|h). Similarly, we can generalize the notion of encoding function f(x) to encoding

distribution Q(h|x), as illustrated in Figure 10.3. We use this to capture the fact that

noise is injected at the level of the representation h, now considered like a latent vari-

able. This generalization is crucial in the development of the variational auto-encoder

(Section 17.7.1) and the generalized stochastic networks (Section 17.9).

We also ﬁnd a stochastic encoder and a stochastic decoder in the RBM, described be-

low (Section 10.3). In that case, the encoding distribution Q(h|x) and P (x|h) “match”,

in the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distribution which has

See the link between squared error and normal density in Sections 5.6 and 6.2.2

157

Q(h|x)

P (x|h)

Figure 10.3: Basic scheme of a stochastic auto-encoder, in which both the encoder and

the decoder are not simple functions but instead involve some noise injection, meaning

that their output can be seen as sampled from a distribution, Q(h|x) for the encoder

and P (x|h) for the decoder. RBMs are a special case where P = Q but in general these

two distributions are not necessarily conditional distributions compatible with a unique

joint distribution P (x, h).

both Q(h|x) and P (x|h) as conditionals. This is not true in general for two conditionals

like Q(h|x) and P(x|h).

10.2 Linear Factor Models

The idea of discovering explanatory factors that have a simple joint distribution among

themselves is old, e.g., see Factor Analysis (see below), and has been explored ﬁrst in

the context where the relationship between factors and data is linear, i.e., we assume

that the data was generated as follows. First, sample the real-valued factors,

h ∼ P (h), (10.2)

and then sample the real-valued observable variables given the factors:

x = W h + b + noise (10.3)

where the noise is typically Gaussian and diagonal (independent across dimensions).

This is illustrated in Figure 10.4.

10.2.1 Probabilistic PCA and Factor Analysis

Probabilistic PCA (Principal Components Analysis) and factor analysis are both special

cases of the above equations (10.2 and 10.3) and only diﬀer in the choices made for the

prior and noise distributions.

158

P (x|h)

h ⇠ P(h)

x = Wh+ b + noise

Figure 10.4: Basic scheme of a linear factors model, in which we assume that an observed

data vector x is obtained by a linear combination of latent factors h, plus some noise.

Diﬀerent models, such as probabilistic PCA, factor analysis or ICA, make diﬀerent

choices about the form of the noise and of the prior P (h).

In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable prior

is just the unit variance Gaussian

h ∼ N(0, I)

while the observed variables x

are assumed to be conditionally independent, given h,

i.e., the noise is assumed to be coming from a diagonal covariance Gaussian distribution,

with covariance ψ = diag(σ

), with σ

= (σ

, σ

, . . .) a vector of per-variable variances.

The role of the latent variables is thus to capture the dependencies between the

diﬀerent observed variables x

. Indeed, it can easily be shown that x is just a Gaussian-

distribution (multivariate normal) random variable, with

x ∼ N(b, W W



+ ψ)

where we see that the weights W induce a dependency between two variables x

and

through a kind of auto-encoder path, whereby x

inﬂuences

= W

x via W

(for

every k) and

inﬂuences x

via W

In order to cast PCA in a probabilistic framework, we can make a slight modiﬁcation

to the factor analysis model, making the conditional variances σ

equal to each other.

In that case the covariance of x is just WW



+ σ

I, where σ

is now a scalar, i.e.,

x ∼ N(b, W W



+ σ

)

or equivalently

x = W h + b + σz

where z ∼ N(0, I) is white noise. Tipping and Bishop (1999) then show an iterative

EM algorithm for estimating the parameters W and σ

159

What the probabilistic PCA model is basically saying is that the covariance is mostly

captured by the latent variables h, up to some small residual reconstruction error σ

As shown by Tipping and Bishop (1999), probabilistic PCA becomes PCA as σ → 0. In

that case, the conditional expected value of h given x becomes an orthogonal projection

onto the space spanned by the d columns of W , like in PCA. However, as σ → 0, the

density model becomes very sharp around these d dimensions spanned the columns of

W , as illustrated in Figure 10.5.

Figure 10.5: Flat Gaussian capturing probability concentration near a low-dimensional

manifold. The variance in the direction orthogonal to the manifold is very small and

can be considered like “noise”, where the other variances are large and correspond to

“signal”, and a coordinate system for the reduced-dimension data.

10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders

The above view of probabilistic PCA as a thin “pancake” of high probability a is related

to the manifold interpretation of PCA and linear auto-encoders, in which we are looking

for projections of x into a subspace that preserves as much information as possible about

x. This is illustrated in Figure 10.5. Let the encoder be

h = f (x) = W



(x −µ)

computing such a projection, a low-dimensional representation of h. With the auto-

encoder view, we have a decoder computing the reconstruction

ˆx = g(h) = b + V h.

It turns out that the choices of linear encoder and decoder that minimize reconstruction

error

E[||x − ˆx||

]

correspond to V = W , µ = b = E[x] and the rows of W form an orthonormal basis

which spans the same subspace as the principal eigenvectors of the covariance matrix

C = E[(x − µ)(x −µ)



In the case of PCA, the rows of W are these eigenvectors, ordered by the magnitude of

the corresponding eigenvalues (which are all real and non-negative). This is illustrated

in Figure 10.6.

160

!"#$%&'!(#)$%*"!!$!*+"#'$!*

!"#$%&'!(#)$%,x-*

Figure 10.6: Manifold view of PCA and linear auto-encoders. The data distribution is

concentrated near a manifold aligned with the leading eigenvectors (here, this is just v

)

of the data covariance matrix. The other eigenvectors (here, just v

) are orthogonal to

the manifold. A data point (in red, x) is encoded into a lower-dimensional representation

or code h (here the scalar which indicates the position on the manifold, starting from h =

0). The decoder (transpose of the decoder) maps h to the data space, and corresponds

to a point lying exactly on the manifold (green cross), the orthogonal projection of x

on the manifold. The optimal encoder and decoder minimize the sum of reconstruction

errors (diﬀerence vector between x and its reconstruction).

One can also show that eigenvalue λ

of C corresponds to the variance of x in the

direction of eigenvector v

. If x ∈ R

and h ∈ R

with d < D, then the optimal

reconstruction error (choosing µ, b, V and W as above is

min E[||x − ˆx||

] =



i=d+1

Hence, if the covariance has rank d, the eigenvalues λ

d+1

to λ

are 0 and reconstruction

error is 0.

Furthermore, one can also show that the above solution can also be obtained by max-

imizing the variances of the elements of h, under orthonormal W , instead of minimizing

reconstruction error.

161

10.2.3 ICA

Independent Component Analysis (ICA) is among the oldest representation learning

algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,

1999; Hyv¨arinen et al., 2001). It is an approach to modeling linear factors that seeks

non-Gaussian projections of the data. Like probabilistic PCA and factor analysis, it

also ﬁts the linear factor model of Eqs. 10.2 and 10.3. What is particular about ICA is

that unlike PCA and factor analysis it does not assume that the prior is Gaussian. It

only assumes that it is factorized, i.e.,

P (h) =



P (h

). (10.4)

Since there is no parametric assumption behind the prior, we are really in front of a

so-called semi-parametric model, with parts of the model being parametric (P (x|h))

and parts being non-speciﬁed or non-parametric (P(h)). In fact, this typically yields

to non-Gaussian priors, and if the priors were Gaussian, then one could not distinguish

between the factors h and a rotation of h. Indeed, note that if

h = U z

with U an orthonormal (rotation) square matrix, i.e.,

z = U



then, although h might have a Normal(0, I) distribution,

V ar[z] = E[zz



] = E[U



U] = U



V ar[h]U = U



U = I.

In other words, imposing independence among Gaussian factors does not allow one to

disentangle them, and we could as well recover any linear rotation of these factors. It

means that, given the observed x, even though we might assume the right generative

model, PCA cannot recover the original signal. However, if we assume that the latent

variables are non-Gaussian, then we can recover them, and this is what ICA is trying

to achieve, and in fact (under these assumptions), can be recover the true underlying

factors (Comon, 1994). In fact, many ICA algorithms are looking for projections of

the data s = V x such that are maximally non-Gaussian. An intuitive explanation for

these approaches is that although the true latent variables h may be non-Gaussian,

almost any linear combination of them will look more Gaussian, because of the central

limit theorem. Since linear combinations of the x

’s are also linear combinations of the

’s, to recover the h

we just need to ﬁnd the linear combinations that are maximally

non-Gaussian (while these diﬀerent projections are orthogonal to each other).

There is an interesting connection between ICA and sparsity, since the dominant

form of non-Gaussianity in real data is due to sparsity, i.e., concentration of probability

at or near 0. Non-Gaussian distributions typically have more mass around zero, although

you can also get non-Gaussianity by increasing skewness, asymmetry, or kurtosis.

162

10.2.4 Sparse Coding as a Generative Model

One particularly interesting form of non-Gaussianity arises with distributions that are

sparse. These typically have not just a peak at 0 but also a fat tail

. Like the other

linear factor models (Eq. 10.3), Sparse coding corresponds to a linear factor model, but

one with a “sparse” latent variable h, i.e., P(h) puts high probability at or around 0.

For example the factorized Laplace density prior is

P (h) =



P (h

) =



−λ|h

(10.5)

and the factorized Student-t prior is

P (h) =



P (h

) ∝



1 +

ν+1

. (10.6)

Both of these densities have a strong preference for near-zero values but, unlike the

Gaussian, accomodate large values. In the standard sparse coding models, the recon-

struction noise is assumed to be Gaussian, so that the corresponding reconstruction

error is the squared error.

Regarding sparsity, note that the actual value h

= 0 has zero measure under both

densities, meaning that the posterior distribution P (h|x) will not generate values h = 0.

However, sparse coding is normally considered under a maximum a posteriori (MAP)

framework, in which the inferred values of h are those that maximize the posterior, and

these tend to often be zero if the prior is suﬃciently concentrated around 0. The inferred

values are those deﬁned in Eq. 10.1,

h = f (x) = arg min

L(g(h), x)) + λΩ(h) (10.7)

where L(g(h), x) is interpreted as −log P (x|g(h)) and Ω(h) as logP(h). This MAP

inference view of sparse coding and an interesting probabilistic interpretation of sparse

coding are further discussed in section 16.3.

To relate the generative model of sparse coding to ICA, note how the prior imposes

not just sparsity but also independence of the latent variables h

under P (h), which may

help to separate diﬀerent explanatory factors, unlike PCA, factor analysis or probabilis-

tic PCA (because these rely on a Gaussian prior, which yields a factorized prior under

any rotation of the factors, multiplication by an orthonormal matrix).

Sparse coding also has an interesting manifold learning interpretation. The codes

h inferred with the above equation do not ﬁll the space in which h lives. Instead,

probability mass is concentrated on axis-aligned subspaces: sets of values of h for which

most of the axes are set at 0. We can thus decompose h into two pieces of information:

• A binary pattern β which speciﬁes which h

are non-zero, with N



the

number of “active” (non-zero) dimensions.

with probability going to 0 as the values increase in magnitude at a rate that is slower than the

Gaussian

163

• A variable-length real-valued vector α ∈ R

which speciﬁes the coordinates for

each of the active dimensions.

The pattern β can be viewed as specifying an N

-dimensional region in input space (the

set of x = Wh + b where h

= 0 if b

= 0). That region is actually a linear manifold, an

-dimensional hyper-plane. All those hyperplanes go through a “center” x = b. The

vector α then speciﬁes a Euclidean coordinate on that hyperplane.

TODO: relate to and point to Spike-and-slab sparse coding (Goodfellow et al., 2012)

(section?)

Because the prior P(h) is concentrated around 0, the probability mass of P (x) is

concentrated on the regions of these hyperplanes near x = b. Depending on the amount

of reconstruction error (output variance for P (x|g(h))), there is also probability mass

bleeding around these hyper-planes and making them look more like pancakes. Each of

these hyperplane-aligned manifolds and the associated distribution is just like the ones

we associate to probabilistic PCA and factor analysis. The crucial diﬀerence is that

instead of one hyperplane, we have 2

hyperplanes if h ∈ R

. Due to the sparsity prior,

however, most of these ﬂat Gaussians are unlikely: only the ones corresponding to a

small N

(with only a few of the axes being active) are likely. For example, if we were

to restrict oneself to only those values of b for which N

= k, then one would have





Gaussians. With this exponentially large number of Gaussians, the interesting thing to

observe is that the sparse coding model only has a number of parameters linear in the

number of dimensions of h. This property is shared with other distributed representation

or sparse representation models, such as the RBM described below as well as other sparse

auto-encoders and contractive auto-encoders (see Chapter 13).

10.3 RBMs

A Restricted Boltzmann Machine (RBM)...

10.4 Greedy Layerwise Unsupervised Pre-Training

The unsupervised pre-training protocal has already been introduced in Section 7.8,

where it is seen as a form of regularization. It relies on a one-layer representation

learning algorithm such as those introduced in this chapter (see also Chapter 13 for

more). Each layer is pre-trained by unsupervised learning, taking the output of the

previous layer and producing as output a new representation of the data, where its dis-

tribution (or its relation to other variables such as categories to predict) is hopefully

simpler.

Greedy layerwise unsupervised pre-training was introduced in Hinton et al. (2006);

Hinton and Salakhutdinov (2006); Bengio et al. (2007); Ranzato et al. (2007). It is called

layerwise because it proceeds one layer at a time, training the k-layer while keeping the

previous ones ﬁxed. It is called unsupervised because each layer is trained with an

unsupervised representation learning algorithm. It is called greedy because the diﬀerent

layers are not jointly trained with respect to a global training objective, which could

164

make the procedure sub-optimal. However it is also called pre-training, because it is

supposed to be only a ﬁrst step before a joint training algorithm is applied to ﬁne-tune

all the layers together with respect to a criterion of interest. In the context of supervised

learning task, it can be viewed as a regularizer and a sophisticated form of parameter

initialization.

However, greedy layerwise unsupervised pre-training can also be used as initializa-

tion for other unsupervised learning algorithms, such as auto-encoders (Hinton and

Salakhutdinov, 2006) (this chapter and Chapter 13) deep belief networks (Hinton et al.,

2006) (Section 17.2), or deep Boltzmann machines (Salakhutdinov and Hinton, 2009a)

(Section 17.3).

As discussed in Section 8.9.3, it is also possible to have greedy layerwise supervised

pre-training, to help optimize deep supervised networks.

10.5 Transfer Learning and Domain Adaptation

Transfer learning and refer to the situation where what has been learned in one set-

ting (i.e., distribution P

) is exploited to improve generalization in another setting (say

distribution P

In the case of transfer learning, we consider that the task is diﬀerent but many of

the factors that explain the variations in P

are relevant to the variations that need

to be captured for learning P

. This is typically understood in a supervised learning

context, where the input is the same but target may be of a diﬀerent nature, e.g., learn

about diﬀerent visual categories in the ﬁrst setting and in the second setting. If there

is a lot more data in the ﬁrst setting (sampled from P

), then that may help to learn

representations that are useful to quickly generalize when examples of P

are drawn.

For example, many visual categories share low-level notions of edges and visual shapes,

the eﬀects of geometric changes, changes in lighting, etc.

In the case of domain adaptation, we consider that the task is the same but the

input distribution is diﬀerent. For example, if we predict sentiment (positive or negative

judgement) associated with textual comments posted on the web, the ﬁrst setting may

refer to consumer comments about books, videos and music, while the ﬁrst setting may

refer to televisions or other products. One can imagine is that there is underlying

function that tells whether any statement is positive, neutral or negative, but of course

the vocabulary, style, accent, may vary from one domain to another, making it more

diﬃcult to generalize across domains.

A related problem is that of concept drift, which we can view as a form of trans-

fer learning due to gradual changes in the data distribution over time. Both concept

drift and transfer learning can be viewed as particular forms of multi-task learning

(Section 7.11).

In all of these cases, the objective is to take advantage of data from a ﬁrst setting to

extract information that may be useful when learning or even directly making predictions

in the second setting. One of the potential advantages of representation learning for such

generalization challenges, and especially of deep representation learning, is that it may

165

considerably help to generalize by extracting and disentangling a set of explanatory

factors from data of the ﬁrst setting, some of which may be relevant to the second

setting. In our object recognition example, many of the factors of variation that explain

visual categories in natural images remain the same when we move from one set of

categories to another. This raises a very interesting and important question which is

one of the core questions of this book: what is a good representation? is it possible to

learn representations that disentangle the underlying factors of variation? This theme is

further explored in Chapter 14. What we claim is that in order to maximize our chances

of success in transfer learning, domain adaptation, or concept drift, is would be very

helpful to learn the most abstract features, because thse are the most general, i.e., more

likely to be relevant over many domains, many categories, and stable over time.

A good example of the success of unsupervised deep learning for transfer learning

is its success in two competitions that took place in 2011, with results presented at

ICML 2011 (and IJCNN 2011) in one case (Mesnil et al., 2011) (the Transfer Learning

Challenge, http://www.causality.inf.ethz.ch/unsupervised-learning.php) and

at NIPS 2011 (Goodfellow et al., 2011) in the other case (the Transfer Learning Challenge

that was held as part of the NIPS’2011 workshop on Challenges in learning hierarchical

models, https://sites.google.com/site/nips2011workshop/transfer-learning-challenge).

166

Figure 10.7: Results obtained on the Sylvester validation set (Transfer Learning Chal-

lenge). From left to right and top to bottom, respectively 0, 1, 2, 3, and 4 pre-trained

layers. Horizontal axis is logarithm of number of labeled training example on second

setting (test task). Vertical axis is Area Under the Curve, which reﬂects classiﬁcation

accuracy. With deeper representations (learned unsupervised), the learning curves con-

siderably improve, requiring fewer labeled examples to achieve the best generalization.

167

Figure 10.7 illustrates one of the most striking results of the ﬁrst of these compe-

titions, where we see that as we consider deeper and deeper representations (learned

in a purely unsupervised way from data of the ﬁrst setting), the learning curve on the

new categories of the second setting (with a linear classiﬁer) becomes much better, i.e.,

less labeled examples are necessary to achieve the apparently asymptotic generalization

performance.

168