Chapter 10
Unsupervised and Transfer
Learning
Whereas supervised learning has been the workhorse of recent industrial successes of
deep learning, the authors of this book believe that it is likely that a key element of
future advances will be unsupervised learning of representations.
In Chapter 1, we have introduced the notion of representation, the idea that some
representations were more helpful (e.g. to classify objects from images or phonemes from
speech) than others. As argued there, this suggests learning representations in order to
“select” the best ones in a systematic way, i.e., by optimizing a function that maps
raw data to its representation, instead of - or in addition to - handcrafting them. This
motivation for learning input features is discussed in Section 6.5, and is one of the major
side-effects of training a feedforward deep network (treated in Chapter 6), typically via
supervised learning, i.e., when one has access to (input,target) pairs
1
, available for some
task of interest. In the case of supervised learning of deep nets, we learn a representation
with the objective of selecting one that is best suited to the task of predicting targets
given inputs.
But what if we don’t have any labeled examples? or too few? Pure supervised
learning with few labeled examples can easily overfit. On the other hand, humans (and
other animals) can sometimes learn a task from just one or very few examples. How is
that possible? Clearly they must rely on previously acquired knowledge, either innate
or (more likely the case for humans) via previous learning experience. Can we discover
good representations purely out of unlabeled examples? (this is treated in the first four
sections of this chapter). Can we combine unlabeled examples (which are often easy
to obtain) with labeled examples? (this is semi-supervised learning, Section 7.6). And
what if instead of one task we have many tasks that could share the same representation
or parts of it? (this is multi-task learning, discussed in Section 7.11). What if we
have “training tasks” (on which enough labeled examples are available) as well as “test
tasks” (not known at the time of learning the representation, and for which only very
1
typically obtained by labeling inputs with some target answer that we wish the computer would
produce
151
few labeled examples will be provided)? What if the test task is similar but different
from the training task? (this is transfer learning and domain adaptation, discussed in
Section 10.5).
These are the questions discussed in this chapter. Along the way we introduce some
of the major building blocks for unsupervised learning of representations (auto-encoders,
Section 10.1 and RBMs, Section 10.3) and a very successful recipe for combining them
to form deep representations (greedy layerwise unsupervised pre-training, Section 10.4).
10.1 Auto-Encoders
The idea of auto-encoders has been part of the historical landscape of neural networks
for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994) but
has really picked up speed in recent years. They remained somewhat marginal for
many years, in part due to what was an incomplete understanding of the mathematical
interpretation and geometrical underpinnings of auto-encoders, which are developed
further in Chapters 13 and 17.9.
input!x!
code!h!
reconstruc,on!r!
Decoder.g!
Encoder.f!
Figure 10.1: General schema of an auto-encoder, mapping an input x to an output (called
reconstruction) r through an internal representation or code h. The auto-encoder has
two components: the encoder f (mapping x to h) and the decoder g (mapping h to r).
An auto-encoder is simply a neural network that tries to copy its input
to its output. The architecture of an auto-encoder is typically decomposed into the
following parts, illustrated in Figure 10.1:
an input, x
152
an encoder function f
a code or internal representation h = f(x)
a decoder function g
an output, also called reconstruction r = g(h) = g(f (x))
a loss function L computing a scalar L(r, x) measuring how good of a reconstruc-
tion r is of the given input x. The objective is to minimize the expected value of
L over the training set of examples {x}.
.
10.1.1 Regularized Auto-Encoders
Predicting the input may sound useless: what could prevent the auto-encoder from
simply copying its input into its output? In the 20th century, this was achieved by
constraining the architecture of the auto-encoder to avoid this, by forcing the dimension
of the code h to be smaller than the dimension of the input x.
input!x!
Code*bo,leneck!h:!
undercomplete*
representa4on*
reconstruc4on!r!
Decoder*
Encoder*
Decoder*
Encoder*
Code!h:!
overcomplete*
representa4on*
Figure 10.2: Left: undercomplete representation (dimension of code h is less than di-
mension of input x). Right: overcomplete representation. Overcomplete auto-encoders
require some other form of regularization (instead of the constraint on the dimension of
h) to avoid the trivial solution where r = x for all x.
Figure 10.2 illustrates the two typical cases of auto-encoders: undercomplete vs
overcomplete, i.e., with the dimension of the representation h respectively smaller vs
153
larger than the input x. Whereas early work with auto-encoders, just like PCA, uses
the undercompleteness i.e. a bottleneck in the sequence of layers to avoid learning
the identity function, more recent work allows overcomplete representations. What we
have learned in recent years is that it is possible to make the auto-encoder meaningfully
capture the structure of the input distribution even if the representation is overcomplete,
with other forms of constraint or regularization. In fact, once you realize that auto-
encoders can capture the input distribution (indirectly, not as a an explicit probability
function), you realize that it should need more capacity as one increases the complexity
of the distribution to be captured (and the amount of data available): it should not
be limited by the input dimension. This is a problem in particular with the shallow
auto-encoders, which have a single hidden layer (for the code).
Besides the bottleneck constraint, alternative constraints or regularization methods
have been explored and can guarantee that the auto-encoder does something useful and
not just learn some trivial identity-like function:
Sparsity of the representation or of its derivative: even if the intermediate
representation has a very high dimensionality, the effective local dimensionality
(number of degrees of freedom that capture a coordinate system among the prob-
able x’s) could be much smaller if most of the elements of h are zero (or any other
constant, such that ||
h
i
x
|| is close to zero). When ||
h
i
x
|| is close to zero, h
i
does
not participate in encoding local changes in x. There is a geometrical interpre-
tation of this situation in terms of manifold learning that is discussed in more
depth in Chapter 13. That discussion also explains how an auto-encoder naturally
tends towards learning a coordinate system for the actual factors of variation in
the data. At least four types of “auto-encoders” clearly fall in this category of
sparse representation:
Sparse coding (Olshausen and Field, 1996) has been heavily studied as
an unsupervised feature learning and feature inference mechanism. It looks
for representations that are both sparse and explain the input through the
decoder. In general there is no parametric encoder and the representation is
instead considered a free variable that is obtained through an optimization,
i.e., a particular form of inference:
h = f (x) = arg min
h
L(g(h), x)) + λΩ(h) (10.1)
where L is the reconstruction loss, f the (non-parametric) encoder, g the
(parametric) decoder, Ω(h) is a sparsity regularizer, and in practice the min-
imization can be approximate. Sparse coding has a manifold or geometric
interpretation that is discussed in Section 13.3. It also has an interpreta-
tion as a directed graphical model, described in more details in Section 16.3.
To achieve sparsity, the optimized objective function includes a term that is
minimized when the representation has many zero or near-zero values, such
as the L1 penalty |h|
1
=
i
|h
i
|.
154
An interesting variation of sparse coding combines the freedom to choose the
representation through optimization and a parametric encoder. It is called
predictive sparse decomposition (PSD) (Kavukcuoglu et al., 2008) and
is briefly described in Section 13.3.2.
At the other end of the spectrum are simply sparse auto-encoders, which
combine with the standard auto-encoder schema a sparsity penalty which
encourages the output of the encoder to be sparse. These are also described in
Section 13.3.1. Besides the L1 penalty, other sparsity penalties that have been
explored include the Student-t penalty (Olshausen and Field, 1996; Bergstra,
2011),
i
log(1 + α
2
h
2
i
)
(i.e. where αh
i
has a Student-t prior density) and the KL-divergence penalty (Lee
et al., 2008; Goodfellow et al., 2009; Larochelle and Bengio, 2008)
i
(t log h
i
+ (1 t) log(1 h
i
)),
with a target sparsity level t, for h
i
(0, 1), e.g. through a sigmoid non-
linearity.
Contractive autoencoders (Rifai et al., 2011), covered in Section 13.5,
explicitly penalize ||
h
x
||
2
F
, i.e., the sum of the squared norm of the vectors
h
i
(x)
x
(each indicating how much each hidden unit h
i
responds to changes in
x and what direction of change in x that unit is most sensitive to, around a
particular x). With such a regularization penalty, the auto-encoder is called
contractive
2
because the mapping from input to representation, h(x), is
encouraged to be contractive, i.e., to have small derivatives in all directions.
Note that a sparsity regularization indirectly leads to a contractive mapping
as well, when the non-linearity used happens to have a zero derivative at
h
i
= 0 (which is the case for the sigmoid non-linearity).
Robustness to injected noise: if noise is injected in inputs or hidden units, or
if some inputs are missing, while the neural network is asked to reconstruct the
clean and complete input, then it cannot simply learn the identity function. It
has to capture the structure of the data distribution in order to optimally perform
this reconstruction. Such auto-encoders are called denoising auto-encoders and are
discussed in more detail in Section 13.4. There is a tight connection between the
denoising auto-encoders and the contractive auto-encoders: it can be shown that
in the limit of small Gaussian injected noise, the denoising reconstruction error is
equivalent to a contractive penalty on the reconstruction function r(x) = g(f(x)).
In other words, since both x and x + (where is some small noise vector) must
yield the same target output x, the reconstruction function is encouraged to be
2
A function f(x) is contractive if ||f (x) f (y)|| < ||x y|| for nearby x and y, or equivalently if its
derivative ||f
(x)|| < 1.
155
insensitive to changes in all directions . The only thing that prevents reconstruc-
tion r(x) from simply being a constant (completely insensitive to the input x), is
that one also has to reconstruct correctly for different training examples x. How-
ever, the auto-encoder can learn to be approximately constant around training
examples x while producing a different answer for different training examples. As
discussed in Section 13.1, if the examples are near a low-dimensional manifold,
this encourages the representation to vary only on the manifold and be locally
constant in directions orthogonal to the manifold, i.e., the representation locally
captures a (not necessarily Euclidean, not necessarily orthogonal) coordinate sys-
tem for the manifold. In addition to the denoising auto-encoder, the variational
auto-encoder (Section 17.7.1) and the generative stochastic networks (Section 17.9)
also involve the injection of noise, but typically in the representation-space itself,
thus introducing the notion of h as a latent variable.
Pressure of a Prior on the Representation: an interesting way to generalize
the notion of regularization applied to the representation is to introduce in the
cost function for the auto-encoder a log-prior term
log P (h)
which captures the assumption that we would like to find a representation that
has a simple distribution, or at least one that is simpler than the original data
distribution. Among all the encoding functions f, we would like to pick one that
1. can be inverted (easily), and this is achieved by minimizing some reconstruc-
tion loss, and
2. yields representations h whose distribution is “simpler”, i.e., can be captured
with less capacity than the original training distribution itself.
The sparse variants described above clearly fall in that framework. The varia-
tional auto-encoder (Section 17.7.1) provides a clean mathematical framework for
justifying the above pressure of a top-level prior when the objective is to model
the data generating distribution.
10.1.2 Representational Power, Layer Size and Depth
Nothing in the above description of auto-encoders restricts the encoder or decoder to
be shallow, but in the literature on the subject, most trained auto-encoders have had a
single hidden layer which is also the representation layer or code.
For one, we know by the usual universal approximator abilities of single hidden-
layer neural networks that a sufficiently large hidden layer can represent any function
with a given accuracy. This observation justifies overcomplete auto-encoders: in order to
represent a rich enough distribution, one probably needs many hidden units in the inter-
mediate representation layer. We also know that Principal Components Analysis (PCA)
corresponds to an undercomplete auto-encoder with no intermediate non-linearity, and
156
that PCA can only capture a set of directions of variation that are the same every-
where in space. This notion is discussed in more details in Chapter 13 in the context of
manifold learning.
For two, it has also been reported many times that training a deep neural network,
and in particular a deep auto-encoder (i.e. with a deep encoder and a deep decoder) is
more difficult than training a shallow one. This was actually a motivation for the initial
work on the greedy layerwise unsupervised pre-training procedure, described below in
Section 10.4, by which we only need to train a series of shallow auto-encoders in order to
initialize a deep auto-encoder. It was shown early on (Hinton and Salakhutdinov, 2006)
that, if trained properly, such deep auto-encoders could yield much better compression
than corresponding shallow or linear auto-encoders (which are basically doing the same
as PCA, see Section 10.2.1 below). As discussed in Section 14.3, deeper architectures
can be in some cases exponentially more efficient (both in terms of computation and
statistically) than shallow ones. However, because we can usefully pre-train a deep net
by training and stacking shallow ones, it makes it interesting to consider single-layer
auto-encoders, as has been done in most of the literature discussed in Chapter 13.
10.1.3 Reconstruction Distribution
The above “parts” (encoder function f , decoder function g, reconstruction loss L) make
sense when the loss L is simply the squared reconstruction error, but there are many
cases where this is not appropriate, e.g., when x is a vector of discrete variables or
when P (x|h) is not well approximated by a Gaussian distribution
3
. Just like in the
case of other types of neural networks (starting with the feedforward neural networks,
Section 6.2.2), it is convenient to define the loss L as a negative log-likelihood over some
target random variables. This probabilistic interpretation is particularly important for
the discussion in Chapter 17.9, where the output of the auto-encoder is interpreted
as a probability distribution P (x|h), for reconstructing x, given hidden units h. This
distribution captures the uncertainty about the original x (which gave rise to h, either
deterministically or stochastically, given h. In the simplest and most ordinary cases, this
distribution factorizes, i.e., P (x|h) =
i
P (x
i
|h). This covers the usual cases of x
i
|h
being Gaussian (for unbounded real values) and x
i
|h having a Bernoulli distribution
(for binary values x
i
), but one can readily generalize this to other distributions, such as
mixtures (see Sections 3.10.5 and 6.2.2).
Thus we can generalize the notion of decoding function g(h) to decoding distribution
P (x|h). Similarly, we can generalize the notion of encoding function f(x) to encoding
distribution Q(h|x), as illustrated in Figure 10.3. We use this to capture the fact that
noise is injected at the level of the representation h, now considered like a latent vari-
able. This generalization is crucial in the development of the variational auto-encoder
(Section 17.7.1) and the generalized stochastic networks (Section 17.9).
We also find a stochastic encoder and a stochastic decoder in the RBM, described be-
low (Section 10.3). In that case, the encoding distribution Q(h|x) and P (x|h) “match”,
in the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distribution which has
3
See the link between squared error and normal density in Sections 5.6 and 6.2.2
157
x
h
Q(h|x)
P (x|h)
Figure 10.3: Basic scheme of a stochastic auto-encoder, in which both the encoder and
the decoder are not simple functions but instead involve some noise injection, meaning
that their output can be seen as sampled from a distribution, Q(h|x) for the encoder
and P (x|h) for the decoder. RBMs are a special case where P = Q but in general these
two distributions are not necessarily conditional distributions compatible with a unique
joint distribution P (x, h).
both Q(h|x) and P (x|h) as conditionals. This is not true in general for two conditionals
like Q(h|x) and P(x|h).
10.2 Linear Factor Models
The idea of discovering explanatory factors that have a simple joint distribution among
themselves is old, e.g., see Factor Analysis (see below), and has been explored first in
the context where the relationship between factors and data is linear, i.e., we assume
that the data was generated as follows. First, sample the real-valued factors,
h P (h), (10.2)
and then sample the real-valued observable variables given the factors:
x = W h + b + noise (10.3)
where the noise is typically Gaussian and diagonal (independent across dimensions).
This is illustrated in Figure 10.4.
10.2.1 Probabilistic PCA and Factor Analysis
Probabilistic PCA (Principal Components Analysis) and factor analysis are both special
cases of the above equations (10.2 and 10.3) and only differ in the choices made for the
prior and noise distributions.
158
P (x|h)
h P(h)
Figure 10.4: Basic scheme of a linear factors model, in which we assume that an observed
data vector x is obtained by a linear combination of latent factors h, plus some noise.
Different models, such as probabilistic PCA, factor analysis or ICA, make different
choices about the form of the noise and of the prior P (h).
In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable prior
is just the unit variance Gaussian
h N(0, I)
while the observed variables x
i
are assumed to be conditionally independent, given h,
i.e., the noise is assumed to be coming from a diagonal covariance Gaussian distribution,
with covariance ψ = diag(σ
2
), with σ
2
= (σ
2
1
, σ
2
2
, . . .) a vector of per-variable variances.
The role of the latent variables is thus to capture the dependencies between the
different observed variables x
i
. Indeed, it can easily be shown that x is just a Gaussian-
distribution (multivariate normal) random variable, with
x N(b, W W
+ ψ)
where we see that the weights W induce a dependency between two variables x
i
and
x
j
through a kind of auto-encoder path, whereby x
i
influences
ˆ
h
k
= W
k
x via W
ik
(for
every k) and
ˆ
h
k
influences x
j
via W
kj
.
In order to cast PCA in a probabilistic framework, we can make a slight modification
to the factor analysis model, making the conditional variances σ
i
equal to each other.
In that case the covariance of x is just WW
+ σ
2
I, where σ
2
is now a scalar, i.e.,
x N(b, W W
+ σ
I
)
or equivalently
x = W h + b + σz
where z N(0, I) is white noise. Tipping and Bishop (1999) then show an iterative
EM algorithm for estimating the parameters W and σ
2
.
159
What the probabilistic PCA model is basically saying is that the covariance is mostly
captured by the latent variables h, up to some small residual reconstruction error σ
2
.
As shown by Tipping and Bishop (1999), probabilistic PCA becomes PCA as σ 0. In
that case, the conditional expected value of h given x becomes an orthogonal projection
onto the space spanned by the d columns of W , like in PCA. However, as σ 0, the
density model becomes very sharp around these d dimensions spanned the columns of
W , as illustrated in Figure 10.5.
Figure 10.5: Flat Gaussian capturing probability concentration near a low-dimensional
manifold. The variance in the direction orthogonal to the manifold is very small and
can be considered like “noise”, where the other variances are large and correspond to
“signal”, and a coordinate system for the reduced-dimension data.
10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders
The above view of probabilistic PCA as a thin “pancake” of high probability a is related
to the manifold interpretation of PCA and linear auto-encoders, in which we are looking
for projections of x into a subspace that preserves as much information as possible about
x. This is illustrated in Figure 10.5. Let the encoder be
h = f (x) = W
(x µ)
computing such a projection, a low-dimensional representation of h. With the auto-
encoder view, we have a decoder computing the reconstruction
ˆx = g(h) = b + V h.
It turns out that the choices of linear encoder and decoder that minimize reconstruction
error
E[||x ˆx||
2
]
correspond to V = W , µ = b = E[x] and the rows of W form an orthonormal basis
which spans the same subspace as the principal eigenvectors of the covariance matrix
C = E[(x µ)(x µ)
].
In the case of PCA, the rows of W are these eigenvectors, ordered by the magnitude of
the corresponding eigenvalues (which are all real and non-negative). This is illustrated
in Figure 10.6.
160
!"#$%&'!(#)$%*"!!$!*+"#'$!*
!"#$%&'!(#)$%,x-*
x"
.*
+
.*
+
/*
Figure 10.6: Manifold view of PCA and linear auto-encoders. The data distribution is
concentrated near a manifold aligned with the leading eigenvectors (here, this is just v
1
)
of the data covariance matrix. The other eigenvectors (here, just v
2
) are orthogonal to
the manifold. A data point (in red, x) is encoded into a lower-dimensional representation
or code h (here the scalar which indicates the position on the manifold, starting from h =
0). The decoder (transpose of the decoder) maps h to the data space, and corresponds
to a point lying exactly on the manifold (green cross), the orthogonal projection of x
on the manifold. The optimal encoder and decoder minimize the sum of reconstruction
errors (difference vector between x and its reconstruction).
One can also show that eigenvalue λ
i
of C corresponds to the variance of x in the
direction of eigenvector v
i
. If x R
D
and h R
d
with d < D, then the optimal
reconstruction error (choosing µ, b, V and W as above is
min E[||x ˆx||
2
] =
D
i=d+1
λ
i
.
Hence, if the covariance has rank d, the eigenvalues λ
d+1
to λ
D
are 0 and reconstruction
error is 0.
Furthermore, one can also show that the above solution can also be obtained by max-
imizing the variances of the elements of h, under orthonormal W , instead of minimizing
reconstruction error.
161
10.2.3 ICA
Independent Component Analysis (ICA) is among the oldest representation learning
algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Hyv¨arinen,
1999; Hyv¨arinen et al., 2001). It is an approach to modeling linear factors that seeks
non-Gaussian projections of the data. Like probabilistic PCA and factor analysis, it
also fits the linear factor model of Eqs. 10.2 and 10.3. What is particular about ICA is
that unlike PCA and factor analysis it does not assume that the prior is Gaussian. It
only assumes that it is factorized, i.e.,
P (h) =
i
P (h
i
). (10.4)
Since there is no parametric assumption behind the prior, we are really in front of a
so-called semi-parametric model, with parts of the model being parametric (P (x|h))
and parts being non-specified or non-parametric (P(h)). In fact, this typically yields
to non-Gaussian priors, and if the priors were Gaussian, then one could not distinguish
between the factors h and a rotation of h. Indeed, note that if
h = U z
with U an orthonormal (rotation) square matrix, i.e.,
z = U
h,
then, although h might have a Normal(0, I) distribution,
V ar[z] = E[zz
] = E[U
hh
U] = U
V ar[h]U = U
U = I.
In other words, imposing independence among Gaussian factors does not allow one to
disentangle them, and we could as well recover any linear rotation of these factors. It
means that, given the observed x, even though we might assume the right generative
model, PCA cannot recover the original signal. However, if we assume that the latent
variables are non-Gaussian, then we can recover them, and this is what ICA is trying
to achieve, and in fact (under these assumptions), can be recover the true underlying
factors (Comon, 1994). In fact, many ICA algorithms are looking for projections of
the data s = V x such that are maximally non-Gaussian. An intuitive explanation for
these approaches is that although the true latent variables h may be non-Gaussian,
almost any linear combination of them will look more Gaussian, because of the central
limit theorem. Since linear combinations of the x
i
’s are also linear combinations of the
h
j
’s, to recover the h
j
we just need to find the linear combinations that are maximally
non-Gaussian (while these different projections are orthogonal to each other).
There is an interesting connection between ICA and sparsity, since the dominant
form of non-Gaussianity in real data is due to sparsity, i.e., concentration of probability
at or near 0. Non-Gaussian distributions typically have more mass around zero, although
you can also get non-Gaussianity by increasing skewness, asymmetry, or kurtosis.
162
10.2.4 Sparse Coding as a Generative Model
One particularly interesting form of non-Gaussianity arises with distributions that are
sparse. These typically have not just a peak at 0 but also a fat tail
4
. Like the other
linear factor models (Eq. 10.3), Sparse coding corresponds to a linear factor model, but
one with a “sparse” latent variable h, i.e., P(h) puts high probability at or around 0.
For example the factorized Laplace density prior is
P (h) =
i
P (h
i
) =
i
λ
2
e
λ|h
i
|
(10.5)
and the factorized Student-t prior is
P (h) =
i
P (h
i
)
i
1
1 +
h
2
i
ν
ν+1
2
. (10.6)
Both of these densities have a strong preference for near-zero values but, unlike the
Gaussian, accomodate large values. In the standard sparse coding models, the recon-
struction noise is assumed to be Gaussian, so that the corresponding reconstruction
error is the squared error.
Regarding sparsity, note that the actual value h
i
= 0 has zero measure under both
densities, meaning that the posterior distribution P (h|x) will not generate values h = 0.
However, sparse coding is normally considered under a maximum a posteriori (MAP)
framework, in which the inferred values of h are those that maximize the posterior, and
these tend to often be zero if the prior is sufficiently concentrated around 0. The inferred
values are those defined in Eq. 10.1,
h = f (x) = arg min
h
L(g(h), x)) + λΩ(h) (10.7)
where L(g(h), x) is interpreted as log P (x|g(h)) and Ω(h) as logP(h). This MAP
inference view of sparse coding and an interesting probabilistic interpretation of sparse
coding are further discussed in section 16.3.
To relate the generative model of sparse coding to ICA, note how the prior imposes
not just sparsity but also independence of the latent variables h
i
under P (h), which may
help to separate different explanatory factors, unlike PCA, factor analysis or probabilis-
tic PCA (because these rely on a Gaussian prior, which yields a factorized prior under
any rotation of the factors, multiplication by an orthonormal matrix).
Sparse coding also has an interesting manifold learning interpretation. The codes
h inferred with the above equation do not fill the space in which h lives. Instead,
probability mass is concentrated on axis-aligned subspaces: sets of values of h for which
most of the axes are set at 0. We can thus decompose h into two pieces of information:
A binary pattern β which specifies which h
i
are non-zero, with N
a
=
i
β
i
the
number of “active” (non-zero) dimensions.
4
with probability going to 0 as the values increase in magnitude at a rate that is slower than the
Gaussian
163
A variable-length real-valued vector α R
N
a
which specifies the coordinates for
each of the active dimensions.
The pattern β can be viewed as specifying an N
a
-dimensional region in input space (the
set of x = Wh + b where h
i
= 0 if b
i
= 0). That region is actually a linear manifold, an
N
a
-dimensional hyper-plane. All those hyperplanes go through a “center” x = b. The
vector α then specifies a Euclidean coordinate on that hyperplane.
TODO: relate to and point to Spike-and-slab sparse coding (Goodfellow et al., 2012)
(section?)
Because the prior P(h) is concentrated around 0, the probability mass of P (x) is
concentrated on the regions of these hyperplanes near x = b. Depending on the amount
of reconstruction error (output variance for P (x|g(h))), there is also probability mass
bleeding around these hyper-planes and making them look more like pancakes. Each of
these hyperplane-aligned manifolds and the associated distribution is just like the ones
we associate to probabilistic PCA and factor analysis. The crucial difference is that
instead of one hyperplane, we have 2
d
hyperplanes if h R
d
. Due to the sparsity prior,
however, most of these flat Gaussians are unlikely: only the ones corresponding to a
small N
a
(with only a few of the axes being active) are likely. For example, if we were
to restrict oneself to only those values of b for which N
a
= k, then one would have
d
k
Gaussians. With this exponentially large number of Gaussians, the interesting thing to
observe is that the sparse coding model only has a number of parameters linear in the
number of dimensions of h. This property is shared with other distributed representation
or sparse representation models, such as the RBM described below as well as other sparse
auto-encoders and contractive auto-encoders (see Chapter 13).
10.3 RBMs
A Restricted Boltzmann Machine (RBM)...
10.4 Greedy Layerwise Unsupervised Pre-Training
The unsupervised pre-training protocal has already been introduced in Section 7.8,
where it is seen as a form of regularization. It relies on a one-layer representation
learning algorithm such as those introduced in this chapter (see also Chapter 13 for
more). Each layer is pre-trained by unsupervised learning, taking the output of the
previous layer and producing as output a new representation of the data, where its dis-
tribution (or its relation to other variables such as categories to predict) is hopefully
simpler.
Greedy layerwise unsupervised pre-training was introduced in Hinton et al. (2006);
Hinton and Salakhutdinov (2006); Bengio et al. (2007); Ranzato et al. (2007). It is called
layerwise because it proceeds one layer at a time, training the k-layer while keeping the
previous ones fixed. It is called unsupervised because each layer is trained with an
unsupervised representation learning algorithm. It is called greedy because the different
layers are not jointly trained with respect to a global training objective, which could
164
make the procedure sub-optimal. However it is also called pre-training, because it is
supposed to be only a first step before a joint training algorithm is applied to fine-tune
all the layers together with respect to a criterion of interest. In the context of supervised
learning task, it can be viewed as a regularizer and a sophisticated form of parameter
initialization.
However, greedy layerwise unsupervised pre-training can also be used as initializa-
tion for other unsupervised learning algorithms, such as auto-encoders (Hinton and
Salakhutdinov, 2006) (this chapter and Chapter 13) deep belief networks (Hinton et al.,
2006) (Section 17.2), or deep Boltzmann machines (Salakhutdinov and Hinton, 2009a)
(Section 17.3).
As discussed in Section 8.9.3, it is also possible to have greedy layerwise supervised
pre-training, to help optimize deep supervised networks.
10.5 Transfer Learning and Domain Adaptation
Transfer learning and refer to the situation where what has been learned in one set-
ting (i.e., distribution P
1
) is exploited to improve generalization in another setting (say
distribution P
2
).
In the case of transfer learning, we consider that the task is different but many of
the factors that explain the variations in P
1
are relevant to the variations that need
to be captured for learning P
2
. This is typically understood in a supervised learning
context, where the input is the same but target may be of a different nature, e.g., learn
about different visual categories in the first setting and in the second setting. If there
is a lot more data in the first setting (sampled from P
1
), then that may help to learn
representations that are useful to quickly generalize when examples of P
2
are drawn.
For example, many visual categories share low-level notions of edges and visual shapes,
the effects of geometric changes, changes in lighting, etc.
In the case of domain adaptation, we consider that the task is the same but the
input distribution is different. For example, if we predict sentiment (positive or negative
judgement) associated with textual comments posted on the web, the first setting may
refer to consumer comments about books, videos and music, while the first setting may
refer to televisions or other products. One can imagine is that there is underlying
function that tells whether any statement is positive, neutral or negative, but of course
the vocabulary, style, accent, may vary from one domain to another, making it more
difficult to generalize across domains.
A related problem is that of concept drift, which we can view as a form of trans-
fer learning due to gradual changes in the data distribution over time. Both concept
drift and transfer learning can be viewed as particular forms of multi-task learning
(Section 7.11).
In all of these cases, the objective is to take advantage of data from a first setting to
extract information that may be useful when learning or even directly making predictions
in the second setting. One of the potential advantages of representation learning for such
generalization challenges, and especially of deep representation learning, is that it may
165
considerably help to generalize by extracting and disentangling a set of explanatory
factors from data of the first setting, some of which may be relevant to the second
setting. In our object recognition example, many of the factors of variation that explain
visual categories in natural images remain the same when we move from one set of
categories to another. This raises a very interesting and important question which is
one of the core questions of this book: what is a good representation? is it possible to
learn representations that disentangle the underlying factors of variation? This theme is
further explored in Chapter 14. What we claim is that in order to maximize our chances
of success in transfer learning, domain adaptation, or concept drift, is would be very
helpful to learn the most abstract features, because thse are the most general, i.e., more
likely to be relevant over many domains, many categories, and stable over time.
A good example of the success of unsupervised deep learning for transfer learning
is its success in two competitions that took place in 2011, with results presented at
ICML 2011 (and IJCNN 2011) in one case (Mesnil et al., 2011) (the Transfer Learning
Challenge, http://www.causality.inf.ethz.ch/unsupervised-learning.php) and
at NIPS 2011 (Goodfellow et al., 2011) in the other case (the Transfer Learning Challenge
that was held as part of the NIPS’2011 workshop on Challenges in learning hierarchical
models, https://sites.google.com/site/nips2011workshop/transfer-learning-challenge).
166
Figure 10.7: Results obtained on the Sylvester validation set (Transfer Learning Chal-
lenge). From left to right and top to bottom, respectively 0, 1, 2, 3, and 4 pre-trained
layers. Horizontal axis is logarithm of number of labeled training example on second
setting (test task). Vertical axis is Area Under the Curve, which reflects classification
accuracy. With deeper representations (learned unsupervised), the learning curves con-
siderably improve, requiring fewer labeled examples to achieve the best generalization.
167
Figure 10.7 illustrates one of the most striking results of the first of these compe-
titions, where we see that as we consider deeper and deeper representations (learned
in a purely unsupervised way from data of the first setting), the learning curve on the
new categories of the second setting (with a linear classifier) becomes much better, i.e.,
less labeled examples are necessary to achieve the apparently asymptotic generalization
performance.
168