
CHAPTER 17. THE MANIFOLD PERSPECTIVE ON REPRESENTATION LEARNING
that characterizes the manifold hypothesis: when a configuration is probable it
is generally surrounded (at least in some directions) by other probable configura-
tions. If a configuration of pixels looks like a natural image, then there are tiny
changes one can make to the image (like translating everything by 0.1 pixel to the
left) which yield another natural-looking image. The number of independent ways
(each characterized by a number indicating how much or whether we do it) by
which a probable configuration can be locally transformed into another probable
configuration indicates the local dimension of the manifold. Whereas maximum
likelihood procedures tend to concentrate probability mass on the training ex-
amples (which can each become a local maximum of probability when the model
overfits), the manifold hypothesis suggests that good solutions instead concen-
trate probability along ridges of high probability (or their high-dimensional gen-
eralization) that connect nearby examples to each other. This is illustrated in
Figure 17.1.
What is most commonly learned to characterize a manifold is a representation
of the data points on (or near, i.e. projected on) the manifold. Such a representa-
tion for a particular example is also called its embedding. It is typically given by a
low-dimensional vector, with less dimensions than the “ambient” space of which
the manifold is a low-dimensional subset. Some algorithms (non-parametric man-
ifold learning algorithms, discussed below) directly learn an embedding for each
training example, while others learn a more general mapping, sometimes called
an encoder, or representation function, that maps any point in the ambient space
(the input space) to its embedding.
Another important characterization of a manifold is the set of its tangent
planes. At a point x on a d-dimensional manifold, the tangent plane is given by d
basis vectors that span the local directions of variation allowed on the manifold.
As illustrated in Figure 17.2, these local directions specify how one can change x
infinitesimally while staying on the manifold.
Manifold learning has mostly focused on unsupervised learning procedures
that attempt to capture these manifolds. Most of the initial machine learning re-
search on learning non-linear manifolds has focused on non-parametric methods
based on the nearest-neighbor graph. This graph has one node per training ex-
ample and edges connecting near neighbors. Basically, these methods (Sch¨olkopf
et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin
and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hinton
and Roweis, 2003; van der Maaten and Hinton, 2008a) associate each of these
nodes with a tangent plane that spans the directions of variations associated with
the difference vectors between the example and its neighbors, as illustrated in
Figure 17.4.
A global coordinate system can then be obtained through an optimization or
504