making a larger bet for some guesses of the underlying target function than for others.
If the target function is likely under the chosen prior, then this prior (i.e., this learning
algorithm) turns out to be a good choice. The best choice is the unrealistic prior that
puts all probability mass on the true target function. This would only happen if there
is nothing to be learned any more because we already know the target function. Having
a universal approximation property just means that this prior is broad enough to cover
or come close to any possible target function, and this is a useful property, but in itself
clearly not sufficient to provide good generalization (not to speak of the optimization
issues and other computational concerns that may be associated with a choice of prior
and learning algorithm).
One of the central priors that is exploited in deep learning is that the target function
to be learned can be efficiently represented as a deep composition of simpler functions
(“features”), where features at one level can be re-used to define many features at the
next level. This is connected to the notion of underlying factors described in the next
section. One therefore assumes that these factors or features are organized at multiple
levels, corresponding to multiple levels of representation. The number of such levels is
what we call depth in this context. The computation of these features can therefore be
laid down in a flow graph or circuit, whose depth is the length of the longest path from
an input node to an output node. Note that we can define the operations performed
in each node in different ways. For example, do we consider a node that computes
the affine operations of a neural net followed by a node that computes the non-linear
neuron activation, or do we consider both of these operations as one node or one level
of the graph? Hence the notion of depth really depends on the allowed operations at
each node and one flow graph of a given depth can be converted into an equivalent
flow graph of a different depth by redefining which operations can be performed at each
node. However, for neural networks, we typically talk about a depth 1 network if there
is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal
approximation properties for neural nets basically tell us that depth 2 is sufficient to
approximate any reasonable function to any desired finite accuracy.
From the point of view of approximation properties, the important result is that one
can find families of functions which can be approximated very efficiently when a partic-
ular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such
results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-
negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectifier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
learning (and in particular for AI) share such a property. However, a lot of experimental
evidence suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,
2012; Krizhevsky et al., 2012a; Sermanet et al., 2013; Farabet et al., 2013; Couprie et al.,
2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior and that
in order to take advantage of it, the learner’s family of function needs to allow multiple
123