1. Learning a representation h of training examples x such that x can be approxi-
mately recovered from h through a decoder. Note that this needs not be true for
any x, only for those that are probable under the data generating distribution.
2. Some constraint or regularization is imposed, either on the code h or on the com-
position of the encoder/decoder, so as to make the transformed data somehow
simpler or to prevent the auto-encoder from achieving perfect reconstruction ev-
erywhere. We can think of these constraints or regularization as a preference for
solutions in which the representation is as simple as possible, e.g., factorized or as
constant as possible, in as many directions as possible. In the case of the bottle-
neck auto-encoder a fixed number of representation dimensions is allowed, that is
smaller than the dimension of x. In the case of sparse auto-encoders (Section 16.7)
the representation elements h
i
are pushed towards 0. In the case of denoising
auto-encoders (Section 16.8), the encoder/decoder function is encouraged to be
contractive (have small derivatives). In the case of the contractive auto-encoder
(Section 16.9), the encoder function alone is encouraged to be contractive, while
the decoder function is tied (by symmetric weights) to the encoder function. In the
case of the variational auto-encoder (Section 21.8.2), a prior log P(h) is imposed
on h to make its distribution factorize and concentrate as much as possible. Note
how in the limit, for all of these cases, the regularization prefers representations
that are insensitive to the input.
Clearly, the second type of force alone would not make any sense (as would any
regularizer, in general). How can these two forces (reconstruction error on one hand,
and “simplicity” of the representation on the other hand) be reconciled? The solution
of the optimization problem is that only the variations that are needed to distinguish
training examples need to be represented. If the data generating distribution concentrates
near a low-dimensional manifold, this yields representations that implicitly capture a
local coordinate for this manifold: only the variations tangent to the manifold around
x need to correspond to changes in h = f (x). Hence the encoder learns a mapping
from the embedding space x to a representation space, a mapping that is only sensitive
to changes along the manifold directions, but that is insensitive to changes orthogonal
to the manifold. This idea is illustrated in Figure 18.12. A one-dimensional example
is illustrated in Figure 18.13, showing that by making the auto-encoder contractive
around the data points (and the reconstruction point towards the nearest data point),
we recover the manifold structure (of a set of 0-dimensional manifolds in a 1-dimensional
embedding space, in the figure).
18.4 Tangent Distance, Tangent-Prop, and Manifold Tan-
gent Classifier
One of the early attempts to take advantage of the manifold hypothesis is the Tangent
Distance algorithm (Simard et al., 1993, 1998). It is a non-parametric nearest-neighbor
algorithm in which the metric used is not the generic Euclidean distance but one that
348