
CHAPTER 16. LINEAR FACTOR MODELS AND AUTO-ENCODERS
is also a regularization on the parameters, but one that depends on the particular
data distribution.
16.2 Representational Power, Layer Size and Depth
Nothing in the above description of auto-encoders restricts the encoder or decoder
to be shallow, but in the literature on the subject, most trained auto-encoders
have had a single hidden layer which is also the representation layer or code
3
For one, we know by the usual universal approximator abilities of single
hidden-layer neural networks that a sufficiently large hidden layer can represent
any function with a given accuracy. This observation justifies overcomplete auto-
encoders: in order to represent a rich enough distribution, one probably needs
many hidden units in the intermediate representation layer. We also know that
Principal Components Analysis (PCA) corresponds to an undercomplete auto-
encoder with no intermediate non-linearity, and that PCA can only capture a set
of directions of variation that are the same everywhere in space. This notion is
discussed in more details in Chapter 18 in the context of manifold learning.
For two, it has also been reported many times that training a deep neural
network, and in particular a deep auto-encoder (i.e. with a deep encoder and a
deep decoder) is more difficult than training a shallow one. This was actually a
motivation for the initial work on the greedy layerwise unsupervised pre-training
procedure, described below in Section 17.1, by which we only need to train a series
of shallow auto-encoders in order to initialize a deep auto-encoder. It was shown
early on (Hinton and Salakhutdinov, 2006) that, if trained properly, such deep
auto-encoders could yield much better compression than corresponding shallow or
linear auto-encoders (which are basically doing the same as PCA, see Section 16.5
below). As discussed in Section 17.7, deeper architectures can be in some cases
exponentially more efficient (both in terms of computation and statistically) than
shallow ones. However, because we can usefully pre-train a deep net by training
and stacking shallow ones, it makes it interesting to consider single-layer (or at
least shallow and easy to train) auto-encoders, as has been done in most of the
literature discussed in this chapter.
16.3 Reconstruction Distribution
The above “parts” (encoder function f, decoder function g, reconstruction loss
L) make sense when the loss L is simply the squared reconstruction error, but
3
as argued in this book, this is probably not a good choice, and we would like to independently
control the constraints on the representation, e.g. dimension and sparsity of the code, and the
capacity of the encoder.
375