
flow graph of a different depth by redefining which operations can be performed at each
node. However, for neural networks, we typically talk about a depth 1 network if there
is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal
approximation properties for neural nets basically tell us that depth 2 is sufficient to
approximate any reasonable function to any desired finite accuracy.
From the point of view of approximation properties, the important result is that one
can find families of functions which can be approximated very efficiently when a partic-
ular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such
results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-
negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectifier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
learning (and in particular for AI) share such a property. However, a lot of experimental
evidence suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,
2012; Krizhevsky et al., 2012a; Sermanet et al., 2013; Farabet et al., 2013a; Couprie
et al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior
and that in order to take advantage of it, the learner’s family of function needs to allow
multiple levels of representation.
6.5 Feature / Representation Learning
Let us consider again the single layer networks such as the Perceptron, linear regression
and logistic regression: such linear models are appealing because training them involves
a convex optimization problem
12
, i.e., an optimization problem with some convergence
guarantees towards a global optimum, irrespective of initial conditions. Simple and well-
understood optimization algorithms are available in this case. However, this limits the
representational capacity too much: many tasks, for a given choice of input representa-
tion x (the raw input features), cannot be solved by using only a linear predictor. What
are our options to avoid that limitation?
1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Sch¨olkopf
et al., 1999), i.e., to consider a fixed mapping from x to φ(x), where φ(x) is of much
higher dimension. In this case, f
θ
(x) = b+ w ·φ(x) can be linear in the parameters
(and in φ(x)) and optimization remains convex (or even analytic). Thanks to the
kernel trick, we can computationally handle a high-dimensional φ(x) (or even an
infinite-dimensional one) so long as the kernel K(u, v) = φ(u) · φ(v) (where · is
the appropriate dot product for the space of φ(·)) can be computed efficiently.
If φ(x) is of high enough dimension, we can always have enough capacity to fit
the training set, but generalization is not at all guaranteed: it will depend on the
12
or even one for which an analytic solution can be computed, with linear regression or the case of
some Gaussian process regression models
122