Chapter 14

Distributed Representations:

Disentangling the Underlying

Factors

What is a good representation? Many answers are possible, and this remains a question

to be further explored in future research. What we propose as answer in this book is that

in general, a good representation is one that makes further learning tasks easy. In an

unsupervised learning setting, this could mean that the joint distribution of the diﬀerent

elements of the representation (e.g., elements of the representation vector h) is one that

is easy to model (e.g., in the extreme, these elements are marginally independent of

each other). But that would not be enough: a representation that throws away all

information (e.g., h = 0 for all inputs x) is very easy to model but is also useless. Hence

we want to learn a representation that keeps the information (or at least all the relevant

information, in the supervised case) and makes it easy to learn functions of interest from

this representation.

What we put forward as a hypothesis, going a bit further, is that an ideal represen-

tation is one that disentangles the underlying causal factors of variation that generated

the observed data. Note that this may be diﬀerent from “easy to model”, but we further

assume that for most problems of interest, these two properties coincide: once we “un-

derstand” the underlying explanations for what we observe, it generally becomes easy

to predict one thing from others.

14.1 Causality and Semi-Supervised Learning

A very basic question is whether unsupervised learning on input variables x can yield

representations that are useful when later trying to learn to predict some target vari-

able y, given x. More generally, when does semi-supervised learning work? See also

Section 7.10 for an earlier discussion.

It turns out that the answer to this question is very diﬀerent dependent on the

underlying relationship between x and y. Put diﬀerently, the question is whether P(y|x),

288

seen as a function of x has anything to do with P (x). If not, then unsupervised learning

of P (x) can be of no help to learn P (y|x). Consider for example the case where P (x) is

uniformly distributed and E[y|x] is some function of interest. Clearly, observing x along

gives us no information about P (y|x). As a better case, consider the situation where x

arises from a mixture, with one component per value of y. If the mixture components

are well separated, then modeling P (x) tells us precisely where each component is, and

a single labeled example of each example will then be enough to perfectly learn P (y|x).

But more generally, what could make P (y|x) and P (x) tied together?

If y is closely associated with one of the causal factors of x, then, as ﬁrst argued

by Janzing et al. (2012), P (x) and P (y|x) will be strongly tied, and unsupervised

representation learning that tries to disentangle the underlying factors of variation is

likely to be useful as a semi-supervised learning strategy.

Consider the assumption that y is one of the causal factors of x, and let h represent

all those factors. Then the true generative process is structured according to this directed

graphical model, with h as the parent of x:

P (h, x) = P (x|h)P (h).

As a consequence, the data has marginal probability

P (x) =



P (x|h)p(h)dh

or, in the discrete case (like in the mixture example above):

P (x) =



P (x|h)P (h).

From this straightforward observation, we conclude that the best possible model of x

(from a generalization point of view) is the one that uncovers the above “true” structure,

with h as a latent variable that explains the observed variations in x. The “ideal”

representation learning discussed above should thus recover these latent factors. If y is

one of them (or closely related to one of them), then it will be very easy to learn to

predict y from such a representation. We also see that the conditional distribution of y

given x is tied by Bayes rule to the components in the above equation:

P (y|x) =

P (x|y)P(y)

P (x)

Thus the marginal P (x) is intimately tied to the conditional P(y|x) and knowledge of

the structure of the former should be helpeful to learn the latter, i.e., semi-supervised

learning works. Furthermore, not knowing which of the factors in h will be the one of

the interest, say y = h

, an unsupervised learning should learn a representation that

disentangles all the generative factors h

from each other, then making it easy to predict

y from h.

In addition, as pointed out by Janzing et al. (2012), if the true generative process

has x as an eﬀect and y as a cause, then modeling P (x|y) is robust to changes in P (y).

289

If the cause-eﬀect relationship was reversed, it would not be true, since by Bayes rule,

P (x|y) would be sensitive to changes in P (y). Very often, when we consider changes

in distribution due to diﬀerent domains, temporal non-stationarity, or changes in the

nature of the task, the causal mechanisms remain invariant (“the laws of the universe

are constant”) whereas what changes are the marginal distribution over the underlying

causes (or what factors are linked to our particular task). Hence, better generalization

and robustness to all kinds of changes can be expected via learning a generative model

that attempts to recover the causal factors h and P (x|h).

14.2 Assumption of Underlying Factors and Distributed

Representation

A very basic notion that comes out of the above discussion and of the notion of “dis-

entangled factors” is the very idea that there are underlying factors that generate the

observed data. It is a core assumption behind most neural network and deep learnin

research, more precisely relying on the notion of distributed representation.

What we call a distributed representation is one which can express an exponentially

large number of concepts by allowing to compose the activation of many features. An

example of distributed representation is a vector of n binary features, which can take 2

conﬁgurations, each potentially corresponding to a diﬀerent region in input space. This

can be compared with a symbolic representation, where the input is associated with a

single symbol or category. If there are n symbols in the dictionary, one can imagine n

feature detectors, each corresponding to the detection of the presence of the associated

category. In that case only n diﬀerent conﬁgurations of the representation-space are

possible, carving n diﬀerent regions in input space. Such a symbolic representation is

also called a one-hot representation, since it can be captured by a binary vector with

n bits that are mutually exclusive (only one of them can be active). These ideas are

developed further in the next section.

Examples of learning algorithms based on non-distributed representations include:

• Clustering methods, including the k-means algorithm: only one cluster “wins” the

competition.

• K-nearest neighbors algorithms: only one template or prototype example is asso-

ciated with a given input.

• Decision trees: only one leaf (and the nodes on the path from root to leaf) is

activated when an input is given.

• Gaussian mixtures and mixtures of experts: the templates (cluster centers) or

experts are now associated with a degree of activation, which makes the posterior

probability of components (or experts) given input look more like a distributed

representation. However, as discussed in the next section, these models still suﬀer

from a poor statistical scaling behavior compared to those based on distributed

representations (such as products of experts and RBMs).

290

• Kernel machines with a Gaussian kernel (or other similarly local kernel): al-

though the degree of activtion of each “support vector” or template example is

now continuous-valued, the same issue arises as with Gaussian mixtures.

• Language or translation models based on N-grams: the set of contexts (sequences

of symbols) is partitioned according to a tree structure of suﬃxes (e.g. a leaf may

correspond to the last two words being w

and w

), and separate parameters are

estimated for each leaf of the tree (with some sharing being possible of parameters

associated with internal nodes, between the leaves of the sub-tree rooted at the

same internal node).

291

Figure 14.1: Illustration of how a learning algorithm based on a non-distributed rep-

resentation breaks up the input space into regions, with a separate set of parameters

for each region. For example, a clustering algorithm or a 1-nearest-neighbor algorithm

associates one template (colored X) to each region. This is also true of decision trees,

mixture models, and kernel machines with a local (e.g., Gaussian) kernel. In the latter

algorithms, the output is not constant by parts but instead interpolates between neigh-

boring regions, but the relationship between the number of parameters (or examples)

and the number of regions they can deﬁne remains linear. The advantage is that a diﬀer-

ent answer (e.g., density function, predicted output, etc.) can be independently chosen

for each region. The disadvantage is that there is no generalization to new regions,

except by extending the answer for which there is data, exploiting solely a smoothness

prior. It makes it diﬃcult to learn a complicated function, with more ups and downs

than the available number of examples. Contrast this with a distributed representation,

Figure 14.2.

An important related concept that distinguishes a distributed representation from a

symbolic one is that generalization arises due to shared attributes between diﬀerent con-

cepts. As pure symbols, “ tt cat” and “dog” are as far from each other as any other two

symbols. However, if one associates them with a meaningful distributed representation,

then many of the things that can be said about cats can generalize to dogs and vice-

versa. This is what allows neural language models to generalize so well (Section 20.3).

Distributed representations induce a rich similarity space, in which semantically close

concepts (or inputs) are close in distance, a property that is absent from purely sym-

292

bolic representations. Of course, one would get a distributed representation if one would

associated multiple symbolic attributes to each symbol.

Figure 14.2: Illustration of how a learning algorithm based on a distributed represen-

tation breaks up the input space into regions, with exponentially more regions than

parameters. Instead of a single partition (as in the non-distributed case, Figure 14.1),

we have many partitions, one per “feature”, and all their possible intersections. In the

example of the ﬁgure, there are 3 binary features C1, C2, and C3, each corresponding

to partitioning the input space in two regions according to a hyperplane, i.e., each is

a linear classiﬁer. Each possible intersection of these half-planes forms a region, i.e.,

each region corresponds to a conﬁguration of the bits specifying whether each feature

is 0 or 1, on which side of their hyperplane is the input falling. If the input space is

large enough, the number of regions grows exponentially with the number of features,

i.e., of parameters. However, the way these regions carve the input space still depends

on few parameters: this huge number of regions are not placed independently of each

other. We can thus represent a function that looks complicated but actually has struc-

ture. Basically, the assumption is that one can learn about each feature without having

to see the examples for all the conﬁgurations of all the other features, i.e., these features

corespond to underlying factors explaining the data.

Note that a sparse representation is a distributed representation where the number of

attributes that are active together is small compared to the total number of attributes.

For example, in the case of binary representations, one might have only k  n of the

293

n bits that are non-zero. The power of representation grows exponentially with the

number of active attributes, e.g., O(n

) in the above example of binary vectors. At

the extreme, a symbolic representation is a very sparse representation where only one

attribute at a time can be active.

14.3 Exponential Gain in Representational Eﬃciency from

Distributed Representations

When and why can there be a statistical advantage from using a distributed represen-

tation as part of a learning algorithm?

Figures 14.1 and 14.2 explain that advantage in intuitive terms. The argument is

that a function that “looks complicated” can be compactly represented using a small

number of parameters, if some “structure” is uncovered by the learner. Traditional “non-

distributed” learning algorithms generalize only thanks to the smoothness assumption,

which states that if u ≈ v, then the target function f to be learned has the property that

f(u) ≈ f(v), in general. There are many ways of formalizing such an assumption, but

the end result is that if we have an example (x, y) for which we know that f(x) ≈ y, then

we choose an estimator

f that approximately satisﬁes these constraints while changing

as little as possible. This assumption is clearly very useful, but it suﬀers from the curse

of dimensionality: in order to learn a target function that takes many diﬀerent values

(e.g. many ups and downs) in a large number of regions

, we may need a number of

examples that is at least as large as the number of distinguishible regions. One can

think of each of these regions as a category or symbol: by having a separate degree of

freedom for each symbol (or region), we can learn an arbitrary mapping from symbol to

value. However, this does not allow us to generalize to new symbols, new regions.

If we are lucky, there may be some regularity in the target function, besides being

smooth. For example, the same pattern of variation may repeat itself many times (e.g.,

as in a periodic function or a checkerboard). If we only use the smoothness prior, we

will need additional examples for each repetition of that pattern. However, as discussed

by Montufar et al. (2014), a deep architecture could represent and discover such a repe-

tition pattern and generalize to new instances of it. Thus a small number of parameters

(and therefore, a small number of examples) could suﬃce to represent a function that

looks complicated (in the sense that it would be expensive to represent with a non-

distributed architecture). Figure 14.2 shows a simple example, where we have n binary

features in a d-dimensional space, and where each binary feature corresponds to a linear

classiﬁer that splits the input space in two parts. The exponentially large number of in-

tersections of n of the corresponding half-spaces corresponds to as many distinguishable

regions that a distributed representation learner could capture. How many regions are

generated by an arrangement of n hyperplanes in R

? This corresponds to the number

of regions that a shallow neural network (one hidden layer) can distinguish (Pascanu

e.g., exponentially many regions: in a d-dimensional space with at least 2 diﬀerent values to distin-

guish per dimension, we might want f to diﬀer in 2

diﬀerent regions, requiring O(2

) training examples.

294

et al., 2014c), which is



j=0





= O(n

following a more general result from Zaslavsky (1975), known as Zaslavsky’s theorem,

one of the central results from the theory of hyperplane arrangements. Therefore, we see

a growth that is exponential in the input size and polynomial in the number of hidden

units.

Although a distributed representation (e.g. a shallow neural net) can represent a

richer function with a smaller number of parameters, there is no free lunch: to construct

an arbitrary partition (say with 2

diﬀerent regions) one will need a correspondingly

large number of hidden units, i.e., of parameters and of examples. The use of a dis-

tributed representation therefore also corresponds to a prior, which comes on top of the

smoothness prior. To return to the hyperplanes examples of Figure 14.2, we see that

we are able to get this generalization because we can learn about the location of each

hyperplane with only O(d) examples: we do not need to see examples corresponding to

all O(n

) regions.

Let us consider a concrete example. Imagine that the input is the image of a person,

and that we have a classiﬁer that detects whether the person is a child or not, another

that detects if that person is a male or a female, another that detects whether that person

wears glasses or not, etc. Keep in mind that these features are discovered automatically,

not ﬁxed a priori. We can learn about the male vs female distinction, or about the

glasses vs no-classes case, without having to consider all of the conﬁgurations of the

n features. This form of statistical separability is what allows one to generalize to

new conﬁgurations of a person’s features that have never been seen during training. It

corresponds to the prior discussed above regarding the existence of multiple underlying

explanatory factors. This prior is very plausible for most of the data distributions

on which human intelligence would be useful, but it may not apply to every possible

distribution. However, this apparently innocuous assumption buys us a lot, statistically

speaking, because it allows the learner to discover structure with a reasonably small

number of examples that would otherwise require exponentially more training data.

Another interesting result illustrating the statistical eﬀect of a distributed represen-

tations versus a non-distributed one is the mathematical analysis (Montufar and Morton,

2014) of products of mixtures (which include the RBM as a special case) versus mixture

of products (such as the mixture of Gaussians). The analysis shows that a mixture of

products can require an exponentially larger number of parameters in order to represent

the probability distributions arising out of a product of mixtures.

14.4 Exponential Gain in Representational Eﬃciency from

Depth

In the above example with the input being an image of a person, it would not be

reasonable to expect factors such as gender, age, and the presence of glasses to be

295

detected simply from a linear classiﬁer, i.e., a shallow neural network. The kinds of

factors that can be chosen almost independently in order to generate data are more likely

to be very high-level and related in highly non-linear ways to the input. This demands

deep distributed representations, where the higher level features (seen as functions of

the input) or factors (seen as generative causes) are obtained through the composition

of many non-linearities.

It turns out that organizing computation through the composition of many non-

linearities and a hierarchy of re-used features can give another exponential boost to

statistical eﬃciency. Although 2-layer networks (e.g., with saturating non-linearities,

boolean gates, sum/products, or RBF units) can generally be shown to be universal

approximators

, the required number of hidden units may be very large. The main

results on the expressive power of deep architectures state that there are families of

functions that can be represented eﬃciently with a deep architecture (say depth k) but

would require an exponential number of components (with respect to the input size)

with insuﬃcient depth (depth 2 or depth k −1).

Figure 14.3: A sum-product network (Poon and Domingos, 2011) composes summing

units and product units, so that each node computes a polynomial. Consider the product

node computing x

: its value is re-used in its two immediate children, and indirectly

incorporated in its grand-children. In particular, in the top node shown the product

would arise 4 times if that node’s polynomial was expanded as a sum of products.

That number could double for each additional layer. In general a deep sum of product

can represent polynomials with a number of min-terms that is exponential in depth,

and some families of polynomials are represented eﬃciently with a deep sum-product

network but not eﬃciently representable with a simple sum of products, i.e., a 2-layer

network (Delalleau and Bengio, 2011).

More precisely, a feedforward neural network with a single hidden layer is a universal

with enough hidden units they can approximate a large class of functions (e.g. continuous functions)

up to some given tolerance level

296

approximator (of Borel measurable functions) (Hornik et al., 1989; Cybenko, 1989).

Other works have investigated universal approximation of probability distributions by

deep belief networks (Le Roux and Bengio, 2010; Mont´ufar and Ay, 2011), as well as

their approximation properties (Mont´ufar, 2014; Krause et al., 2013).

Regarding the advantage of depth, early theoretical results have focused on circuit

operations (neural net unit computations) that are substantially diﬀerent from those be-

ing used in real state-of-the-art deep learning applications, such as logic gates (H˚astad,

1986) and linear threshold units with non-negative weights (H˚astad and Goldmann,

1991). More recently, Delalleau and Bengio (2011) showed that a shallow network re-

quires exponentially many more sum-product hidden units

than a deep sum-product

network (Poon and Domingos, 2011) in order to compute certain families of polyno-

mials. Figure 14.3 illustrates a sum-product network for representing polynomials, and

how a deeper network can be exponentially more eﬃcient because the same computation

can be re-used exponentially (in depth) many times. Note however that Martens and

Medabalimi (2014) showed that sum-product networks were somewhat limited in their

expressive power, in the sense that there are distributions that can easily be represented

by other generative models but that cannot be eﬃciently represented under the decom-

posability and completeness conditions associated with the probabilistic interpretation

of sum-product networks (Poon and Domingos, 2011).

Figure 14.4: An absolute value rectiﬁcation unit has the same output for every pair

of mirror points in its input. The mirror axis of symmetry is given by the hyperplane

deﬁned by the weights and bias of the unit. If one considers a function computed on

top of that unit (the green decision surface), it will be formed of a mirror image of a

simpler pattern, across that axis of symmetry. The middle image shows how it can be

obtained by folding the space around that axis of symmetry, and the right image shows

how another repeating pattern can be folded on top of it (by another downstream unit)

to obtain another symmetry (which is now repeated four times, with two hidden layers).

This is an intuitive explanation of the exponential advantage of deeper rectiﬁer networks

formally shown in Pascanu et al. (2014b); Montufar et al. (2014).

Closer to the kinds of deep networks actually used in practice (Pascanu et al., 2014b;

Montufar et al., 2014) showed that piecewise linear networks (e.g. obtained from recti-

ﬁer non-linearities or maxout units) could represent functions with exponentially more

Here, a single sum-product hidden layer summarizes a layer of product units followed by a layer of

sum units.

297

piecewise-linear regions, as a function of depth, compared to shallow neural networks.

Figure 14.4 illustrates how a network with absolute value rectiﬁcation creates mirror

images of the function computed on top of some hidden unit, with respect to the input

of that hidden unit. Each hidden unit speciﬁes where to fold the input space in order

to create mirror responses (on both sides of the absolute value non-linearity). By com-

posing these folding operations, we obtain an exponentially large number of piecewise

linear regions which can capture all kinds of regular (e.g. repeating) patterns.

More precisely, the main theorem in Montufar et al. (2014) states that the number

of linear regions carved out by a deep rectiﬁer network with d inputs, depth L, and n

units per hidden layer, is







d(L−1)



i.e., exponential in the depth L. In the case of maxout networks with k ﬁlters per unit,

the number of linear regions is



(L−1)+d



14.5 Priors Regarding The Underlying Factors

To close this chapter, we come back to the original question: what is a good representa-

tion? We proposed that an ideal representation is one that disentangles the underlying

causal factors of variation that generated the data, especially those factors that we care

about in our applications. It seems clear that if we have direct clues about these factors

(like if a factor y = h

, a label, is observed at the same time as an input x), then this

can help the learner separate these observed factors from the others. This is already

what supervised learning does. But in general, we may have a lot more unlabeled data

than labeled data: can we use other clues, other hints about the underlying factors, in

order to disentangle them more easily?

What we propose here is that indeed we can provide all kinds of broad priors which

are as many hints that can help the learner discover, identify and disentangle these

factors. The list of such priors is clearly not exhaustive, but it is a starting point, and

yet most learning algorithms in the machine learning literature only exploit a small

subset of these priors. With absolutely no priors, we know that it is not possible to

generalize: this is the essence of the no-free-lunch theorem for machine learning. In the

space of all functions, which is huge, with any ﬁnite training set, there is no general-

purpose learning recipe that would dominate all other learning algorithms. Whereas

some assumptions are required, when our goal is to build AI or understand human

intelligence, it is tempting to focus our attention on the most general and broad priors,

that are relevant for most of the tasks that humans are able to successfully learn.

This list was introduced in section 3.1 of Bengio et al. (2013b).

• Smoothness: we want to learn functions f s.t. x ≈ y generally implies f(x) ≈

f(y). This is the most basic prior and is present in most machine learning, but

298

is insuﬃcient to get around the curse of dimensionality, as discussed abov and

in Bengio et al. (2013b). below.

• Multiple explanatory factors: the data generating distribution is generated

by diﬀerent underlying factors, and for the most part what one learns about one

factor generalizes in many conﬁgurations of the other factors. This assumption is

behind the idea of distributed representations, discussed in Section 14.2 above.

• Depth, or a hierarchical organization of explanatory factors: the concepts

that are useful at describing the world around us can be deﬁned in terms of other

concepts, in a hierarchy, with more abstract concepts higher in the hierarchy,

being deﬁned in terms of less abstract ones. This is the assumption exploited by

having deep representations.

• Causal factors: the input variables x are consequences, eﬀects, while the explana-

tory factors are causes, and not vice-versa. As discussed above, this enables the

semi-supervised learning assumption, i.e., that P (x) is tied to P(y|x), mak-

ing it possible to improve the learning of P (y|x) via the learning of P (x). More

precisely, this entails that representations that are useful for P (x) are useful when

learning P (y|x), allowing sharing of statistical strength between the unsupervised

and supervised learning tasks.

• Shared factors across tasks: in the context where we have many tasks, corre-

sponding to diﬀerent y

’s sharing the same input x or where each task is associated

with a subset or a function f

(x) of a global input x, the assumption is that each

is associated with a diﬀerent subset from a common pool of relevant factors h.

Because these subsets overlap, learning all the P (y

|x) via a shared intermediate

representation P (h|x) allows sharing of statistical strength between the tasks.

• Manifolds: probability mass concentrates, and the regions in which it concen-

trates are locally connected and occupy a tiny volume. In the continuous case,

these regions can be approximated by low-dimensional manifolds that a much

smaller dimensionality than the original space where the data lives. This is the

manifold hypothesis and is covered in Chapter 13, especially with algorithms re-

lated to auto-encoders.

• Natural clustering: diﬀerent values of categorical variables such as object classes

are associated with separate manifolds. More precisely, the local variations on the

manifold tend to preserve the value of a category, and a linear interpolation be-

tween examples of diﬀerent classes in general involves going through a low density

region, i.e., P (x|y = i) for diﬀerent i tend to be well separated and not overlap

much. For example, this is exploited explicitly in the Manifold Tangent Classiﬁer

discussed in Section 13.6. This hypothesis is consistent with the idea that humans

have named categories and classes because of such statistical structure (discovered

it is often the case that the y of interest is a category

299

by their brain and propagated by their culture), and machine learning tasks often

involves predicting such categorical variables.

• Temporal and spatial coherence: this is similar to the cluster assumption but

concerns sequences or tuples of observations; consecutive or spatially nearby obser-

vations tend to be associated with the same value of relevant categorical concepts,

or result in a small move on the surface of the high-density manifold. More gen-

erally, diﬀerent factors change at diﬀerent temporal and spatial scales, and many

categorical concepts of interest change slowly. When attempting to capture such

categorical variables, this prior can be enforced by making the associated repre-

sentations slowly changing, i.e., penalizing changes in values over time or space.

This prior was introduced in Becker and Hinton (1992).

• Sparsity: for any given observation x, only a small fraction of the possible factors

are relevant. In terms of representation, this could be represented by features that

are often zero (as initially proposed by Olshausen and Field (1996)), or by the

fact that most of the extracted features are insensitive to small variations of x.

This can be achieved with certain forms of priors on latent variables (peaked at

0), or by using a non-linearity whose value is often ﬂat at 0 (i.e., 0 and with a

0 derivative), or simply by penalizing the magnitude of the Jacobian matrix (of

derivatives) of the function mapping input to representation. This is discussed in

Section 13.3.

• Simplicity of Factor Dependencies: in good high-level representations, the

factors are related to each other through simple dependencies. The simplest possi-

ble is marginal independence, P (h) =



P (h

), but linear dependencies or those

captured by a shallow auto-encoder are also reasonable assumptions. This can be

seen in many laws of physics, and is assumed when plugging a linear predictor or

a factorized prior on top of a learned representation.

300