Chapter 14
Distributed Representations:
Disentangling the Underlying
Factors
What is a good representation? Many answers are possible, and this remains a question
to be further explored in future research. What we propose as answer in this book is that
in general, a good representation is one that makes further learning tasks easy. In an
unsupervised learning setting, this could mean that the joint distribution of the different
elements of the representation (e.g., elements of the representation vector h) is one that
is easy to model (e.g., in the extreme, these elements are marginally independent of
each other). But that would not be enough: a representation that throws away all
information (e.g., h = 0 for all inputs x) is very easy to model but is also useless. Hence
we want to learn a representation that keeps the information (or at least all the relevant
information, in the supervised case) and makes it easy to learn functions of interest from
this representation.
What we put forward as a hypothesis, going a bit further, is that an ideal represen-
tation is one that disentangles the underlying causal factors of variation that generated
the observed data. Note that this may be different from “easy to model”, but we further
assume that for most problems of interest, these two properties coincide: once we “un-
derstand” the underlying explanations for what we observe, it generally becomes easy
to predict one thing from others.
14.1 Causality and Semi-Supervised Learning
A very basic question is whether unsupervised learning on input variables x can yield
representations that are useful when later trying to learn to predict some target vari-
able y, given x. More generally, when does semi-supervised learning work? See also
Section 7.10 for an earlier discussion.
It turns out that the answer to this question is very different dependent on the
underlying relationship between x and y. Put differently, the question is whether P(y|x),
288
seen as a function of x has anything to do with P (x). If not, then unsupervised learning
of P (x) can be of no help to learn P (y|x). Consider for example the case where P (x) is
uniformly distributed and E[y|x] is some function of interest. Clearly, observing x along
gives us no information about P (y|x). As a better case, consider the situation where x
arises from a mixture, with one component per value of y. If the mixture components
are well separated, then modeling P (x) tells us precisely where each component is, and
a single labeled example of each example will then be enough to perfectly learn P (y|x).
But more generally, what could make P (y|x) and P (x) tied together?
If y is closely associated with one of the causal factors of x, then, as first argued
by Janzing et al. (2012), P (x) and P (y|x) will be strongly tied, and unsupervised
representation learning that tries to disentangle the underlying factors of variation is
likely to be useful as a semi-supervised learning strategy.
Consider the assumption that y is one of the causal factors of x, and let h represent
all those factors. Then the true generative process is structured according to this directed
graphical model, with h as the parent of x:
P (h, x) = P (x|h)P (h).
As a consequence, the data has marginal probability
P (x) =
h
P (x|h)p(h)dh
or, in the discrete case (like in the mixture example above):
P (x) =
h
P (x|h)P (h).
From this straightforward observation, we conclude that the best possible model of x
(from a generalization point of view) is the one that uncovers the above “true” structure,
with h as a latent variable that explains the observed variations in x. The “ideal”
representation learning discussed above should thus recover these latent factors. If y is
one of them (or closely related to one of them), then it will be very easy to learn to
predict y from such a representation. We also see that the conditional distribution of y
given x is tied by Bayes rule to the components in the above equation:
P (y|x) =
P (x|y)P(y)
P (x)
.
Thus the marginal P (x) is intimately tied to the conditional P(y|x) and knowledge of
the structure of the former should be helpeful to learn the latter, i.e., semi-supervised
learning works. Furthermore, not knowing which of the factors in h will be the one of
the interest, say y = h
i
, an unsupervised learning should learn a representation that
disentangles all the generative factors h
i
from each other, then making it easy to predict
y from h.
In addition, as pointed out by Janzing et al. (2012), if the true generative process
has x as an effect and y as a cause, then modeling P (x|y) is robust to changes in P (y).
289
If the cause-effect relationship was reversed, it would not be true, since by Bayes rule,
P (x|y) would be sensitive to changes in P (y). Very often, when we consider changes
in distribution due to different domains, temporal non-stationarity, or changes in the
nature of the task, the causal mechanisms remain invariant (“the laws of the universe
are constant”) whereas what changes are the marginal distribution over the underlying
causes (or what factors are linked to our particular task). Hence, better generalization
and robustness to all kinds of changes can be expected via learning a generative model
that attempts to recover the causal factors h and P (x|h).
14.2 Assumption of Underlying Factors and Distributed
Representation
A very basic notion that comes out of the above discussion and of the notion of “dis-
entangled factors” is the very idea that there are underlying factors that generate the
observed data. It is a core assumption behind most neural network and deep learnin
research, more precisely relying on the notion of distributed representation.
What we call a distributed representation is one which can express an exponentially
large number of concepts by allowing to compose the activation of many features. An
example of distributed representation is a vector of n binary features, which can take 2
n
configurations, each potentially corresponding to a different region in input space. This
can be compared with a symbolic representation, where the input is associated with a
single symbol or category. If there are n symbols in the dictionary, one can imagine n
feature detectors, each corresponding to the detection of the presence of the associated
category. In that case only n different configurations of the representation-space are
possible, carving n different regions in input space. Such a symbolic representation is
also called a one-hot representation, since it can be captured by a binary vector with
n bits that are mutually exclusive (only one of them can be active). These ideas are
developed further in the next section.
Examples of learning algorithms based on non-distributed representations include:
Clustering methods, including the k-means algorithm: only one cluster “wins” the
competition.
K-nearest neighbors algorithms: only one template or prototype example is asso-
ciated with a given input.
Decision trees: only one leaf (and the nodes on the path from root to leaf) is
activated when an input is given.
Gaussian mixtures and mixtures of experts: the templates (cluster centers) or
experts are now associated with a degree of activation, which makes the posterior
probability of components (or experts) given input look more like a distributed
representation. However, as discussed in the next section, these models still suffer
from a poor statistical scaling behavior compared to those based on distributed
representations (such as products of experts and RBMs).
290
Kernel machines with a Gaussian kernel (or other similarly local kernel): al-
though the degree of activtion of each “support vector” or template example is
now continuous-valued, the same issue arises as with Gaussian mixtures.
Language or translation models based on N-grams: the set of contexts (sequences
of symbols) is partitioned according to a tree structure of suffixes (e.g. a leaf may
correspond to the last two words being w
1
and w
2
), and separate parameters are
estimated for each leaf of the tree (with some sharing being possible of parameters
associated with internal nodes, between the leaves of the sub-tree rooted at the
same internal node).
291
Figure 14.1: Illustration of how a learning algorithm based on a non-distributed rep-
resentation breaks up the input space into regions, with a separate set of parameters
for each region. For example, a clustering algorithm or a 1-nearest-neighbor algorithm
associates one template (colored X) to each region. This is also true of decision trees,
mixture models, and kernel machines with a local (e.g., Gaussian) kernel. In the latter
algorithms, the output is not constant by parts but instead interpolates between neigh-
boring regions, but the relationship between the number of parameters (or examples)
and the number of regions they can define remains linear. The advantage is that a differ-
ent answer (e.g., density function, predicted output, etc.) can be independently chosen
for each region. The disadvantage is that there is no generalization to new regions,
except by extending the answer for which there is data, exploiting solely a smoothness
prior. It makes it difficult to learn a complicated function, with more ups and downs
than the available number of examples. Contrast this with a distributed representation,
Figure 14.2.
An important related concept that distinguishes a distributed representation from a
symbolic one is that generalization arises due to shared attributes between different con-
cepts. As pure symbols, “ tt cat” and “dog are as far from each other as any other two
symbols. However, if one associates them with a meaningful distributed representation,
then many of the things that can be said about cats can generalize to dogs and vice-
versa. This is what allows neural language models to generalize so well (Section 20.3).
Distributed representations induce a rich similarity space, in which semantically close
concepts (or inputs) are close in distance, a property that is absent from purely sym-
292
bolic representations. Of course, one would get a distributed representation if one would
associated multiple symbolic attributes to each symbol.
Figure 14.2: Illustration of how a learning algorithm based on a distributed represen-
tation breaks up the input space into regions, with exponentially more regions than
parameters. Instead of a single partition (as in the non-distributed case, Figure 14.1),
we have many partitions, one per “feature”, and all their possible intersections. In the
example of the figure, there are 3 binary features C1, C2, and C3, each corresponding
to partitioning the input space in two regions according to a hyperplane, i.e., each is
a linear classifier. Each possible intersection of these half-planes forms a region, i.e.,
each region corresponds to a configuration of the bits specifying whether each feature
is 0 or 1, on which side of their hyperplane is the input falling. If the input space is
large enough, the number of regions grows exponentially with the number of features,
i.e., of parameters. However, the way these regions carve the input space still depends
on few parameters: this huge number of regions are not placed independently of each
other. We can thus represent a function that looks complicated but actually has struc-
ture. Basically, the assumption is that one can learn about each feature without having
to see the examples for all the configurations of all the other features, i.e., these features
corespond to underlying factors explaining the data.
Note that a sparse representation is a distributed representation where the number of
attributes that are active together is small compared to the total number of attributes.
For example, in the case of binary representations, one might have only k n of the
293
n bits that are non-zero. The power of representation grows exponentially with the
number of active attributes, e.g., O(n
k
) in the above example of binary vectors. At
the extreme, a symbolic representation is a very sparse representation where only one
attribute at a time can be active.
14.3 Exponential Gain in Representational Efficiency from
Distributed Representations
When and why can there be a statistical advantage from using a distributed represen-
tation as part of a learning algorithm?
Figures 14.1 and 14.2 explain that advantage in intuitive terms. The argument is
that a function that “looks complicated” can be compactly represented using a small
number of parameters, if some “structure” is uncovered by the learner. Traditional “non-
distributed” learning algorithms generalize only thanks to the smoothness assumption,
which states that if u v, then the target function f to be learned has the property that
f(u) f(v), in general. There are many ways of formalizing such an assumption, but
the end result is that if we have an example (x, y) for which we know that f(x) y, then
we choose an estimator
ˆ
f that approximately satisfies these constraints while changing
as little as possible. This assumption is clearly very useful, but it suffers from the curse
of dimensionality: in order to learn a target function that takes many different values
(e.g. many ups and downs) in a large number of regions
1
, we may need a number of
examples that is at least as large as the number of distinguishible regions. One can
think of each of these regions as a category or symbol: by having a separate degree of
freedom for each symbol (or region), we can learn an arbitrary mapping from symbol to
value. However, this does not allow us to generalize to new symbols, new regions.
If we are lucky, there may be some regularity in the target function, besides being
smooth. For example, the same pattern of variation may repeat itself many times (e.g.,
as in a periodic function or a checkerboard). If we only use the smoothness prior, we
will need additional examples for each repetition of that pattern. However, as discussed
by Montufar et al. (2014), a deep architecture could represent and discover such a repe-
tition pattern and generalize to new instances of it. Thus a small number of parameters
(and therefore, a small number of examples) could suffice to represent a function that
looks complicated (in the sense that it would be expensive to represent with a non-
distributed architecture). Figure 14.2 shows a simple example, where we have n binary
features in a d-dimensional space, and where each binary feature corresponds to a linear
classifier that splits the input space in two parts. The exponentially large number of in-
tersections of n of the corresponding half-spaces corresponds to as many distinguishable
regions that a distributed representation learner could capture. How many regions are
generated by an arrangement of n hyperplanes in R
d
? This corresponds to the number
of regions that a shallow neural network (one hidden layer) can distinguish (Pascanu
1
e.g., exponentially many regions: in a d-dimensional space with at least 2 different values to distin-
guish per dimension, we might want f to differ in 2
d
different regions, requiring O(2
d
) training examples.
294
et al., 2014c), which is
d
j=0
n
j
= O(n
d
),
following a more general result from Zaslavsky (1975), known as Zaslavsky’s theorem,
one of the central results from the theory of hyperplane arrangements. Therefore, we see
a growth that is exponential in the input size and polynomial in the number of hidden
units.
Although a distributed representation (e.g. a shallow neural net) can represent a
richer function with a smaller number of parameters, there is no free lunch: to construct
an arbitrary partition (say with 2
d
different regions) one will need a correspondingly
large number of hidden units, i.e., of parameters and of examples. The use of a dis-
tributed representation therefore also corresponds to a prior, which comes on top of the
smoothness prior. To return to the hyperplanes examples of Figure 14.2, we see that
we are able to get this generalization because we can learn about the location of each
hyperplane with only O(d) examples: we do not need to see examples corresponding to
all O(n
d
) regions.
Let us consider a concrete example. Imagine that the input is the image of a person,
and that we have a classifier that detects whether the person is a child or not, another
that detects if that person is a male or a female, another that detects whether that person
wears glasses or not, etc. Keep in mind that these features are discovered automatically,
not fixed a priori. We can learn about the male vs female distinction, or about the
glasses vs no-classes case, without having to consider all of the configurations of the
n features. This form of statistical separability is what allows one to generalize to
new configurations of a person’s features that have never been seen during training. It
corresponds to the prior discussed above regarding the existence of multiple underlying
explanatory factors. This prior is very plausible for most of the data distributions
on which human intelligence would be useful, but it may not apply to every possible
distribution. However, this apparently innocuous assumption buys us a lot, statistically
speaking, because it allows the learner to discover structure with a reasonably small
number of examples that would otherwise require exponentially more training data.
Another interesting result illustrating the statistical effect of a distributed represen-
tations versus a non-distributed one is the mathematical analysis (Montufar and Morton,
2014) of products of mixtures (which include the RBM as a special case) versus mixture
of products (such as the mixture of Gaussians). The analysis shows that a mixture of
products can require an exponentially larger number of parameters in order to represent
the probability distributions arising out of a product of mixtures.
14.4 Exponential Gain in Representational Efficiency from
Depth
In the above example with the input being an image of a person, it would not be
reasonable to expect factors such as gender, age, and the presence of glasses to be
295
detected simply from a linear classifier, i.e., a shallow neural network. The kinds of
factors that can be chosen almost independently in order to generate data are more likely
to be very high-level and related in highly non-linear ways to the input. This demands
deep distributed representations, where the higher level features (seen as functions of
the input) or factors (seen as generative causes) are obtained through the composition
of many non-linearities.
It turns out that organizing computation through the composition of many non-
linearities and a hierarchy of re-used features can give another exponential boost to
statistical efficiency. Although 2-layer networks (e.g., with saturating non-linearities,
boolean gates, sum/products, or RBF units) can generally be shown to be universal
approximators
2
, the required number of hidden units may be very large. The main
results on the expressive power of deep architectures state that there are families of
functions that can be represented efficiently with a deep architecture (say depth k) but
would require an exponential number of components (with respect to the input size)
with insufficient depth (depth 2 or depth k 1).
Figure 14.3: A sum-product network (Poon and Domingos, 2011) composes summing
units and product units, so that each node computes a polynomial. Consider the product
node computing x
2
x
3
: its value is re-used in its two immediate children, and indirectly
incorporated in its grand-children. In particular, in the top node shown the product
x
2
x
3
would arise 4 times if that node’s polynomial was expanded as a sum of products.
That number could double for each additional layer. In general a deep sum of product
can represent polynomials with a number of min-terms that is exponential in depth,
and some families of polynomials are represented efficiently with a deep sum-product
network but not efficiently representable with a simple sum of products, i.e., a 2-layer
network (Delalleau and Bengio, 2011).
More precisely, a feedforward neural network with a single hidden layer is a universal
2
with enough hidden units they can approximate a large class of functions (e.g. continuous functions)
up to some given tolerance level
296
approximator (of Borel measurable functions) (Hornik et al., 1989; Cybenko, 1989).
Other works have investigated universal approximation of probability distributions by
deep belief networks (Le Roux and Bengio, 2010; Monufar and Ay, 2011), as well as
their approximation properties (Mont´ufar, 2014; Krause et al., 2013).
Regarding the advantage of depth, early theoretical results have focused on circuit
operations (neural net unit computations) that are substantially different from those be-
ing used in real state-of-the-art deep learning applications, such as logic gates (H˚astad,
1986) and linear threshold units with non-negative weights (H˚astad and Goldmann,
1991). More recently, Delalleau and Bengio (2011) showed that a shallow network re-
quires exponentially many more sum-product hidden units
3
than a deep sum-product
network (Poon and Domingos, 2011) in order to compute certain families of polyno-
mials. Figure 14.3 illustrates a sum-product network for representing polynomials, and
how a deeper network can be exponentially more efficient because the same computation
can be re-used exponentially (in depth) many times. Note however that Martens and
Medabalimi (2014) showed that sum-product networks were somewhat limited in their
expressive power, in the sense that there are distributions that can easily be represented
by other generative models but that cannot be efficiently represented under the decom-
posability and completeness conditions associated with the probabilistic interpretation
of sum-product networks (Poon and Domingos, 2011).
Figure 14.4: An absolute value rectification unit has the same output for every pair
of mirror points in its input. The mirror axis of symmetry is given by the hyperplane
defined by the weights and bias of the unit. If one considers a function computed on
top of that unit (the green decision surface), it will be formed of a mirror image of a
simpler pattern, across that axis of symmetry. The middle image shows how it can be
obtained by folding the space around that axis of symmetry, and the right image shows
how another repeating pattern can be folded on top of it (by another downstream unit)
to obtain another symmetry (which is now repeated four times, with two hidden layers).
This is an intuitive explanation of the exponential advantage of deeper rectifier networks
formally shown in Pascanu et al. (2014b); Montufar et al. (2014).
Closer to the kinds of deep networks actually used in practice (Pascanu et al., 2014b;
Montufar et al., 2014) showed that piecewise linear networks (e.g. obtained from recti-
fier non-linearities or maxout units) could represent functions with exponentially more
3
Here, a single sum-product hidden layer summarizes a layer of product units followed by a layer of
sum units.
297
piecewise-linear regions, as a function of depth, compared to shallow neural networks.
Figure 14.4 illustrates how a network with absolute value rectification creates mirror
images of the function computed on top of some hidden unit, with respect to the input
of that hidden unit. Each hidden unit specifies where to fold the input space in order
to create mirror responses (on both sides of the absolute value non-linearity). By com-
posing these folding operations, we obtain an exponentially large number of piecewise
linear regions which can capture all kinds of regular (e.g. repeating) patterns.
More precisely, the main theorem in Montufar et al. (2014) states that the number
of linear regions carved out by a deep rectifier network with d inputs, depth L, and n
units per hidden layer, is
O
n
d
d(L1)
n
d
,
i.e., exponential in the depth L. In the case of maxout networks with k filters per unit,
the number of linear regions is
O
k
(L1)+d
.
14.5 Priors Regarding The Underlying Factors
To close this chapter, we come back to the original question: what is a good representa-
tion? We proposed that an ideal representation is one that disentangles the underlying
causal factors of variation that generated the data, especially those factors that we care
about in our applications. It seems clear that if we have direct clues about these factors
(like if a factor y = h
i
, a label, is observed at the same time as an input x), then this
can help the learner separate these observed factors from the others. This is already
what supervised learning does. But in general, we may have a lot more unlabeled data
than labeled data: can we use other clues, other hints about the underlying factors, in
order to disentangle them more easily?
What we propose here is that indeed we can provide all kinds of broad priors which
are as many hints that can help the learner discover, identify and disentangle these
factors. The list of such priors is clearly not exhaustive, but it is a starting point, and
yet most learning algorithms in the machine learning literature only exploit a small
subset of these priors. With absolutely no priors, we know that it is not possible to
generalize: this is the essence of the no-free-lunch theorem for machine learning. In the
space of all functions, which is huge, with any finite training set, there is no general-
purpose learning recipe that would dominate all other learning algorithms. Whereas
some assumptions are required, when our goal is to build AI or understand human
intelligence, it is tempting to focus our attention on the most general and broad priors,
that are relevant for most of the tasks that humans are able to successfully learn.
This list was introduced in section 3.1 of Bengio et al. (2013b).
Smoothness: we want to learn functions f s.t. x y generally implies f(x)
f(y). This is the most basic prior and is present in most machine learning, but
298
is insufficient to get around the curse of dimensionality, as discussed abov and
in Bengio et al. (2013b). below.
Multiple explanatory factors: the data generating distribution is generated
by different underlying factors, and for the most part what one learns about one
factor generalizes in many configurations of the other factors. This assumption is
behind the idea of distributed representations, discussed in Section 14.2 above.
Depth, or a hierarchical organization of explanatory factors: the concepts
that are useful at describing the world around us can be defined in terms of other
concepts, in a hierarchy, with more abstract concepts higher in the hierarchy,
being defined in terms of less abstract ones. This is the assumption exploited by
having deep representations.
Causal factors: the input variables x are consequences, effects, while the explana-
tory factors are causes, and not vice-versa. As discussed above, this enables the
semi-supervised learning assumption, i.e., that P (x) is tied to P(y|x), mak-
ing it possible to improve the learning of P (y|x) via the learning of P (x). More
precisely, this entails that representations that are useful for P (x) are useful when
learning P (y|x), allowing sharing of statistical strength between the unsupervised
and supervised learning tasks.
Shared factors across tasks: in the context where we have many tasks, corre-
sponding to different y
i
’s sharing the same input x or where each task is associated
with a subset or a function f
i
(x) of a global input x, the assumption is that each
y
i
is associated with a different subset from a common pool of relevant factors h.
Because these subsets overlap, learning all the P (y
i
|x) via a shared intermediate
representation P (h|x) allows sharing of statistical strength between the tasks.
Manifolds: probability mass concentrates, and the regions in which it concen-
trates are locally connected and occupy a tiny volume. In the continuous case,
these regions can be approximated by low-dimensional manifolds that a much
smaller dimensionality than the original space where the data lives. This is the
manifold hypothesis and is covered in Chapter 13, especially with algorithms re-
lated to auto-encoders.
Natural clustering: different values of categorical variables such as object classes
4
are associated with separate manifolds. More precisely, the local variations on the
manifold tend to preserve the value of a category, and a linear interpolation be-
tween examples of different classes in general involves going through a low density
region, i.e., P (x|y = i) for different i tend to be well separated and not overlap
much. For example, this is exploited explicitly in the Manifold Tangent Classifier
discussed in Section 13.6. This hypothesis is consistent with the idea that humans
have named categories and classes because of such statistical structure (discovered
4
it is often the case that the y of interest is a category
299
by their brain and propagated by their culture), and machine learning tasks often
involves predicting such categorical variables.
Temporal and spatial coherence: this is similar to the cluster assumption but
concerns sequences or tuples of observations; consecutive or spatially nearby obser-
vations tend to be associated with the same value of relevant categorical concepts,
or result in a small move on the surface of the high-density manifold. More gen-
erally, different factors change at different temporal and spatial scales, and many
categorical concepts of interest change slowly. When attempting to capture such
categorical variables, this prior can be enforced by making the associated repre-
sentations slowly changing, i.e., penalizing changes in values over time or space.
This prior was introduced in Becker and Hinton (1992).
Sparsity: for any given observation x, only a small fraction of the possible factors
are relevant. In terms of representation, this could be represented by features that
are often zero (as initially proposed by Olshausen and Field (1996)), or by the
fact that most of the extracted features are insensitive to small variations of x.
This can be achieved with certain forms of priors on latent variables (peaked at
0), or by using a non-linearity whose value is often flat at 0 (i.e., 0 and with a
0 derivative), or simply by penalizing the magnitude of the Jacobian matrix (of
derivatives) of the function mapping input to representation. This is discussed in
Section 13.3.
Simplicity of Factor Dependencies: in good high-level representations, the
factors are related to each other through simple dependencies. The simplest possi-
ble is marginal independence, P (h) =
i
P (h
i
), but linear dependencies or those
captured by a shallow auto-encoder are also reasonable assumptions. This can be
seen in many laws of physics, and is assumed when plugging a linear predictor or
a factorized prior on top of a learned representation.
300