ease: we mostly care about the kinds of functions that are needed to represent the
world around us or the task of interest. The no-free-lunch theorems for machine
learning (Wolpert, 1996) essentially say that no learning algorithm is “universal”, i.e.,
dominates all the others against all possible ground truths (data generating distribution
or target function). However, for a given family of tasks, some learning algorithms are
definitely better. Choosing a learning algorithm is equivalent to choosing a prior, i.e.,
making a larger bet for some guesses of the underlying target function than for others.
If the target function is likely under the chosen prior, then this prior (i.e., this learning
algorithm) turns out to be a good choice. The best choice is the unrealistic prior that
puts all probability mass on the true target function. This would only happen if there is
nothing to be learned any more because we already know the target function. Having a
universal approximation property just means that this prior is broad enough to cover or
come close to any possible target function – and this is a useful property – but is, in itself,
clearly not sufficient to provide good generalization (not to speak of the optimization
issues and other computational concerns that may be associated with a choice of prior
and learning algorithm).
One of the central priors that is exploited in deep learning is that the target function
to be learned can be efficiently represented as a deep composition of simpler functions
(“features”), where features at one level can be re-used to define many features at the
next level. This is connected to the notion of underlying factors described in the next
section. One therefore assumes that these factors or features are organized at multiple
levels, corresponding to multiple levels of representation. The number of such levels is
what we call depth in this context. The computation of these features can therefore be
laid down in a flow graph or circuit, whose depth is the length of the longest path from
an input node to an output node. Note that we can define the operations performed
in each node in different ways. For example, do we consider a node that computes
the affine operations of a neural net followed by a node that computes the non-linear
neuron activation, or do we consider both of these operations as one node or one level
of the graph? Hence the notion of depth really depends on the allowed operations at
each node and one flow graph of a given depth can be converted into an equivalent
flow graph of a different depth by redefining which operations can be performed at each
node. However, for neural networks, we typically talk about a depth 1 network if there
is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal
approximation properties for neural nets basically tell us that depth 2 is sufficient to
approximate any reasonable function to any desired finite accuracy.
From the point of view of approximation properties, the important result is that one
can find families of functions which can be approximated very efficiently when a partic-
ular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such
results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-
negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectifier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
127