
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
The computation of these features can therefore be laid down in a flow graph
or circuit, whose depth is the length of the longest path from an input node to
an output node. Note that we can define the operations performed in each node
in different ways. For example, do we consider a node that computes the affine
operations of a neural net followed by a node that computes the non-linear neuron
activation, or do we consider both of these operations as one node or one level of
the graph? Hence the notion of depth really depends on the allowed operations at
each node and one flow graph of a given depth can be converted into an equivalent
flow graph of a different depth by redefining which operations can be performed
at each node. However, for neural networks, we typically talk about a depth 1
network if there is no hidden layer, a depth 2 network if there is one hidden layer,
etc. The universal approximation properties for neural nets basically tell us that
depth 2 is sufficient to approximate any reasonable function to any desired finite
accuracy.
From the point of view of approximation properties, the important result
is that one can find families of functions which can be approximated very effi-
ciently when a particular depth is allowed, but which might require a much larger
(typically exponentially larger) model (e.g. more hidden units) if depth is insuf-
ficient (or is limited to 2). Such results have been proven for logic gates (H˚astad,
1986), linear threshold units with non-negative weights (H˚astad and Goldmann,
1991), polynomials (Delalleau and Bengio, 2011) organized as deep sum-product
networks (Poon and Domingos, 2011), and more recently, for deep networks of
rectifier units (Pascanu et al., 2013b). Of course, there is no guarantee that the
kinds of functions we want to learn in applications of machine learning (and in
particular for AI) share such a property. However, a lot of experimental evidence
suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan
et al., 2012; Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a;
Couprie et al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is
a useful prior and that in order to take advantage of it, the learner’s family of
function needs to allow multiple levels of representation.
6.5 Feature / Representation Learning
Let us consider again the single layer networks such as the perceptron, linear
regression and logistic regression: such linear models are appealing because train-
ing them involves a convex optimization problem
10
, i.e., an optimization problem
with some convergence guarantees towards a global optimum, irrespective of ini-
10
or even one for which an analytic solution can be computed, with linear regression or the
case of some Gaussian process regression models
171