
that PCA can only capture a set of directions of variation that are the same every-
where in space. This notion is discussed in more details in Chapter 13 in the context of
manifold learning.
For two, it has also been reported many times that training a deep neural network,
and in particular a deep auto-encoder (i.e. with a deep encoder and a deep decoder) is
more difficult than training a shallow one. This was actually a motivation for the initial
work on the greedy layerwise unsupervised pre-training procedure, described below in
Section 10.4, by which we only need to train a series of shallow auto-encoders in order to
initialize a deep auto-encoder. It was shown early on (Hinton and Salakhutdinov, 2006)
that, if trained properly, such deep auto-encoders could yield much better compression
than corresponding shallow or linear auto-encoders (which are basically doing the same
as PCA, see Section 10.2.1 below). As discussed in Section 14.3, deeper architectures
can be in some cases exponentially more efficient (both in terms of computation and
statistically) than shallow ones. However, because we can usefully pre-train a deep net
by training and stacking shallow ones, it makes it interesting to consider single-layer
auto-encoders, as has been done in most of the literature discussed in Chapter 13.
10.1.3 Reconstruction Distribution
The above “parts” (encoder function f , decoder function g, reconstruction loss L) make
sense when the loss L is simply the squared reconstruction error, but there are many
cases where this is not appropriate, e.g., when x is a vector of discrete variables or
when P (x|h) is not well approximated by a Gaussian distribution
3
. Just like in the
case of other types of neural networks (starting with the feedforward neural networks,
Section 6.2.2), it is convenient to define the loss L as a negative log-likelihood over some
target random variables. This probabilistic interpretation is particularly important for
the discussion in Chapter 17.9, where the output of the auto-encoder is interpreted
as a probability distribution P (x|h), for reconstructing x, given hidden units h. This
distribution captures the uncertainty about the original x (which gave rise to h, either
deterministically or stochastically, given h. In the simplest and most ordinary cases, this
distribution factorizes, i.e., P (x|h) =
i
P (x
i
|h). This covers the usual cases of x
i
|h
being Gaussian (for unbounded real values) and x
i
|h having a Bernoulli distribution
(for binary values x
i
), but one can readily generalize this to other distributions, such as
mixtures (see Sections 3.10.5 and 6.2.2).
Thus we can generalize the notion of decoding function g(h) to decoding distribution
P (x|h). Similarly, we can generalize the notion of encoding function f(x) to encoding
distribution Q(h|x), as illustrated in Figure 10.3. We use this to capture the fact that
noise is injected at the level of the representation h, now considered like a latent vari-
able. This generalization is crucial in the development of the variational auto-encoder
(Section 17.7.1) and the generalized stochastic networks (Section 17.9).
We also find a stochastic encoder and a stochastic decoder in the RBM, described be-
low (Section 10.3). In that case, the encoding distribution Q(h|x) and P (x|h) “match”,
in the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distribution which has
3
See the link between squared error and normal density in Sections 5.6 and 6.2.2
157