
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
its internal processing. Empirically, greater depth does seem to result in better
generalization for a wide variety of tasks (Bengio et al., 2007a; Erhan et al., 2009;
Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al., 2012;
Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a; Couprie
et al., 2013; Kahou et al., 2013; Goodfellow et al., 2014d; Szegedy et al., 2014a).
See Fig. 6.8 for an example of some of these empirical results. This suggests that
that using deep architectures does indeed express a useful prior over the space of
functions the model learn.
6.6 Feature / Representation Learning
Let us consider again the single layer networks such as the perceptron, linear
regression and logistic regression: such linear models are appealing because train-
ing them involves a convex optimization problem
14
, i.e., an optimization problem
with some convergence guarantees towards a global optimum, irrespective of ini-
tial conditions. Simple and well-understood optimization algorithms are available
in this case. However, this limits the representational capacity too much: many
tasks, for a given choice of input representation x (the raw input features), cannot
be solved by using only a linear predictor. What are our options to avoid that
limitation?
1. One option is to use a kernel machine (Williams and Rasmussen, 1996;
Sch¨olkopf et al., 1999), i.e., to consider a fixed mapping from x to φ(x),
where φ(x) is of much higher dimension. In this case, f
θ
(x) = b + w ·φ(x)
can be linear in the parameters (and in φ(x)) and optimization remains
convex (or even analytic). By exploiting the kernel trick, we can compu-
tationally handle a high-dimensional φ(x) (or even an infinite-dimensional
one) so long as the kernel k(u, v) = φ(u) · φ(v) (where · is the appropriate
dot product for the space of φ(·)) can be computed efficiently. If φ(x) is
of high enough dimension, we can always have enough capacity to fit the
training set, but generalization is not at all guaranteed: it will depend on
the appropriateness of the choice of φ as a feature space for our task. Kernel
machines theory clearly identifies the choice of φ to the choice of a prior.
This leads to kernel engineering, which is equivalent to feature engineer-
ing, discussed next. The other type of kernel (that is very commonly used)
embodies a very broad prior, such as smoothness, e.g., the Gaussian (or
RBF) kernel k(u, v) = exp
−||u − v||/σ
2
. Unfortunately, this prior may
be insufficient, i.e., too broad and sensitive to the curse of dimensionality,
as introduced in Section 5.13.1 and developed in more detail in Chapter 16.
14
or even one for which an analytic solution can be computed, with linear regression or the
case of some Gaussian process regression models
184