
that, if trained properly, such deep auto-encoders could yield much better compression
than corresponding shallow or linear auto-encoders (which are basically doing the same
as PCA, see Section 16.5 below). As discussed in Section 17.7, deeper architectures
can be in some cases exponentially more efficient (both in terms of computation and
statistically) than shallow ones. However, because we can usefully pre-train a deep net
by training and stacking shallow ones, it makes it interesting to consider single-layer
(or at least shallow and easy to train) auto-encoders, as has been done in most of the
literature discussed in this chapter.
16.3 Reconstruction Distribution
The above “parts” (encoder function f, decoder function g, reconstruction loss L) make
sense when the loss L is simply the squared reconstruction error, but there are many
cases where this is not appropriate, e.g., when x is a vector of discrete variables or
when P (x|h) is not well approximated by a Gaussian distribution
4
. Just like in the
case of other types of neural networks (starting with the feedforward neural networks,
Section 6.2.2), it is convenient to define the loss L as a negative log-likelihood over
some target random variables. This probabilistic interpretation is particularly impor-
tant for the discussion in Sections 21.8.2, 21.9 and 21.10 about generative extensions of
auto-encoders and stochastic recurrent networks, where the output of the auto-encoder
is interpreted as a probability distribution P (x|h), for reconstructing x, given hid-
den units h. This distribution captures not just the expected reconstruction but also
the uncertainty about the original x (which gave rise to h, either deterministically or
stochastically, given h). In the simplest and most ordinary cases, this distribution fac-
torizes, i.e., P (x|h) =
Q
i
P (x
i
|h). This covers the usual cases of x
i
|h being Gaussian
(for unbounded real values) and x
i
|h having a Bernoulli distribution (for binary values
x
i
), but one can readily generalize this to other distributions, such as mixtures (see
Sections 3.10.5 and 6.2.2).
Thus we can generalize the notion of decoding function g(h) to decoding distribution
P (x|h). Similarly, we can generalize the notion of encoding function f(x) to encoding
distribution Q(h|x), as illustrated in Figure 16.3. We use this to capture the fact
that noise is injected at the level of the representation h, now considered like a latent
variable. This generalization is crucial in the development of the variational auto-encoder
(Section 21.8.2) and the generalized stochastic networks (Section 21.10).
We also find a stochastic encoder and a stochastic decoder in the RBM, described
in Section 21.1. In that case, the encoding distribution Q(h|x) and P (x|h) “match”, in
the sense that Q(h|x) = P (h|x), i.e., there is a unique joint distribution which has both
Q(h|x) and P (x|h) as conditionals. This is not true in general for two independently
parametrized conditionals like Q(h|x) and P (x|h), although the work on generative
stochastic networks (Alain et al., 2015) shows that learning will tend to make them
compatible asymptotically (with enough capacity and examples).
4
See the link between squared error and normal density in Sections 5.6 and 6.2.2
288