
n bits that are non-zero. The power of representation grows exponentially with the
number of active attributes, e.g., O(n
k
) in the above example of binary vectors. At
the extreme, a symbolic representation is a very sparse representation where only one
attribute at a time can be active.
14.3 Exponential Gain in Representational Efficiency from
Distributed Representations
When and why can there be a statistical advantage from using a distributed represen-
tation as part of a learning algorithm?
Figures 14.1 and 14.2 explain that advantage in intuitive terms. The argument is
that a function that “looks complicated” can be compactly represented using a small
number of parameters, if some “structure” is uncovered by the learner. Traditional “non-
distributed” learning algorithms generalize only thanks to the smoothness assumption,
which states that if u ≈ v, then the target function f to be learned has the property that
f(u) ≈ f(v), in general. There are many ways of formalizing such an assumption, but
the end result is that if we have an example (x, y) for which we know that f(x) ≈ y, then
we choose an estimator
ˆ
f that approximately satisfies these constraints while changing
as little as possible. This assumption is clearly very useful, but it suffers from the curse
of dimensionality: in order to learn a target function that takes many different values
(e.g. many ups and downs) in a large number of regions
1
, we may need a number of
examples that is at least as large as the number of distinguishible regions. One can
think of each of these regions as a category or symbol: by having a separate degree of
freedom for each symbol (or region), we can learn an arbitrary mapping from symbol to
value. However, this does not allow us to generalize to new symbols, new regions.
If we are lucky, there may be some regularity in the target function, besides being
smooth. For example, the same pattern of variation may repeat itself many times (e.g.,
as in a periodic function or a checkerboard). If we only use the smoothness prior, we
will need additional examples for each repetition of that pattern. However, as discussed
by Montufar et al. (2014), a deep architecture could represent and discover such a repe-
tition pattern and generalize to new instances of it. Thus a small number of parameters
(and therefore, a small number of examples) could suffice to represent a function that
looks complicated (in the sense that it would be expensive to represent with a non-
distributed architecture). Figure 14.2 shows a simple example, where we have n binary
features in a d-dimensional space, and where each binary feature corresponds to a linear
classifier that splits the input space in two parts. The exponentially large number of in-
tersections of n of the corresponding half-spaces corresponds to as many distinguishable
regions that a distributed representation learner could capture. How many regions are
generated by an arrangement of n hyperplanes in R
d
? This corresponds to the number
of regions that a shallow neural network (one hidden layer) can distinguish (Pascanu
1
e.g., exponentially many regions: in a d-dimensional space with at least 2 different values to distin-
guish per dimension, we might want f to differ in 2
d
different regions, requiring O(2
d
) training examples.
294