
From the point of view of approximation properties, the important result is that one
can find families of functions which can be approximated very efficiently when a partic-
ular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such
results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-
negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectifier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
learning (and in particular for AI) share such a property. However, a lot of experimental
evidence suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,
2012; Krizhevsky et al., 2012a; Sermanet et al., 2013; Farabet et al., 2013; Couprie et al.,
2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior and that
in order to take advantage of it, the learner’s family of function needs to allow multiple
levels of representation.
6.5 Feature / Representation Learning
Let us consider again the single layer networks such as the Perceptron, linear regression
and logistic regression: such linear models are appealing because training them involves a
convex optimization problem
9
, i.e., with some convergence guarantees towards a global
optimum, irrespective of initial conditions. Simple and well-understood optimization
algorithms are available in this case. However, this limits the representational capacity
too much and many tasks, for a given choice of input representation x (the raw input
features), cannot be solved by using only a linear predictor. What are our options to
avoid that limitation:
1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Sch¨olkopf
et al., 1999), i.e., to consider a fixed mapping from x to φ(x), where φ(x) is of much
higher dimension. In this case, f
θ
(x) = b+w ·φ(x) can be linear in the parameters
(and in φ(x)) and optimization remains convex (or even analytic). Thanks to the
kernel trick, we can computationally handle a high-dimensional φ(x) (or even an
infinite-dimensional one) so long as the kernel K(u, v) = φ(u) · φ(v) (where · is
the appropriate dot product for the space of φ(·)) can be computed efficiently.
If φ(x) is of high enough dimension, we can always have enough capacity to fit
the training set, but generalization is not at all guaranteed: it will depend on
the appropriateness of the choice of φ as a feature space for our task. Kernel
machines theory clearly identifies the choice of φ to the choice of a prior. This
leads to kernel engineering, which is equivalent to feature engineering, discussed
next. The other type of kernel (that is very commonly used) embodies a very broad
prior, such as smoothness, e.g., the Gaussian (or RBF) kernel K(u, v) = e
||u−v||/σ
2
.
9
or even one for which an analytic solution can be computed, with linear regression or the case of
some Gaussian process regression models
114