Chapter 6
Feedforward Deep Networks
6.1 Formalizing and Generalizing Neural Networks
Feedforward supervised neural networks were among the first and most successful learn-
ing algorithms (Rumelhart et al., 1986d,c). They are also called deep networks, multi-
layer Perceptron (MLP), or simply neural networks and the vanilla architecture with a
single hidden layer is illustrated in Figure 6.1.
V
W
Figure 6.1: Vanilla (shallow) MLP, with one sigmoid hidden layer, computing vector-
valued hidden unit vector h = sigmoid(c + W x) with weight matrix W and bias or
offset vector c. The output vector is obtained via another learned affine transformation
ˆ
y = b + V h, with weight matrix V and output bias (offset) vector b.
MLPs can learn powerful non-linear transformations: in fact, with enough hidden
units they can represent arbitrarily complex but smooth functions (see Section 6.4).
This is achieved by composing simpler but still non-linear learned transformations. By
104
transforming the data non-linearly into a new space, a classification problem that was
not linearly separable (not solvable by a linear classifier) can become separable, as
illustrated in Figure 6.2.
Figure 6.2: Each layer of a trained neural network non-linearly transforms its input,
distorting the space so that the task becomes easier to perform, e.g., linear classification
in the above figures. Top: a vanilla neural network with 2 hidden units in its single
hidden layer can transform the 2-D input space so that the two classes (in red/pink vs
blue) become linearly separable (the red and blue curves correspond to the “manifold”
near which examples concentrate). Bottom: with a larger hidden layer (100 here),
the MNIST digit images (with 28 × 28 = 784 pixels) can be transformed so that the
classes (each shown with a different color) can be much more easily classified by the
output linear+softmax layer (over 10 digit categories). Both figures are reproduced with
permission by Chris Olah from http://colah.github.io/, where many more insightful
visualizations can be found.
105
In addition to covering the basics of such networks, this chapter introduces a general
formalism for gradient-based optimization of parametrized families of functions, often
in the context of conditional maximum likelihood criteria (Section 6.2).
They bring together a number of important machine learning concepts already in-
troduced in the previous chapters:
Define a parametrized family of functions f
θ
describing how the learner will
behave on new examples, i.e., what output f
θ
(x) will produce given some input
x. Training consists in choosing the parameter θ (usually represented by a vector)
given some training examples (x, y) sampled from an unknown data generating
distribution P(X, Y ).
Define a loss function L describing what scalar loss L(
ˆ
y, y) is associated with
each supervised example (x, y), as a function of the learner’s output
ˆ
y = f
θ
(x)
and the target output y.
Define a training criterion and a regularizer. The objective of training is
ideally to minimize the expected loss E
X,Y
[L(f
θ
(X), Y )] over X, Y sampled from
the unknown data generating distribution P (X, Y ). However this is not possible
because the expectation makes use of the true underlying P (X, Y ) but we only
have access to a finite number of training examples, i.e. of pairs (X,Y ). Instead,
one defines a training criterion which usually includes an empirical average of the
loss over the training set, plus some additional terms (called regularizers) which
enforce some preferences over the choices of θ.
Define an optimization procedure to approximately minimize the training cri-
terion
1
. The simplest such optimization procedure is stochastic gradient descent,
described in Section 4.3.
Example 6.1.1 illustrates these concepts for the case of a vanilla neural network for
regression.
In chapter 17, we consider generalizations of the above framework to the unsuper-
vised and semi-supervised cases, where Y may not exist or may not always be present.
An unsupervised loss can then be defined solely as a function of the input x and some
function f
θ
(x) that one wishes to learn.
1
It is generally not possible to analytically obtain a global minimum of the training criterion, and
iterative numerical optimization methods are used instead.
106
Example 6.1.1. Vanilla (Shallow) Multi-Layer Neural Network for
Regression
Based on the above definitions, we could pick the family of input-output func-
tions to be
f
θ
(x) = b + V sigmoid(c + W x),
illustrated in Figure 6.1, where sigmoid(a) = 1/(1 + e
a
) is applied element-
wise, the input is the vector x R
n
i
, the hidden layer outputs are the elements
of the vector h = sigmoid(c + W x) with n
h
entries, the parameters are θ =
(b, c, V , W ) θ (viewed as the flattened vectorized version of the tuple) with
b R
n
o
a vector the same dimension as the output (n
o
), c R
n
h
of the same
dimension as h (number of hidden units), V R
n
o
×n
h
and W R
n
h
×n
i
being
weight matrices.
The loss function for this classical example could be the squared error
L(
ˆ
y, y) = ||
ˆ
y y||
2
(see Section 5.8.1 showing that it makes
ˆ
y an estima-
tor of E[Y |x]). The regularizer could be the ordinary L
2
weight decay
||ω||
2
= (
P
ij
W
2
ij
+
P
ki
V
2
ki
), where we define the set of weights ω as the con-
catenation of the elements of matrices W and V . The L
2
weight decay thus
penalizes the squared norm of the weights, with λ a scalar that is larger to pe-
nalize stronger and yield smaller weights. Combining the loss function and the
regularizer gives the training criterion, which is the objective function during
training:
C(θ) = λ||ω||
2
+
1
n
n
X
t=1
||y
(t)
(b + V sigmoid(c + W x
(t)
))||
2
.
where (x
(t)
, y
(t)
) is the t-th training example, an (input,target) pair. Finally,
the classical training procedure in this example is stochastic gradient descent,
which iteratively updates θ according to
ω ω
2λω +
ω
L(f
θ
(x
(t)
), y
(t)
)
β β
β
L(f
θ
(x
(t)
), y
(t)
),
where β = (b, c) contains the offset (bias) parameters, ω = (W , V ) the weight
matrices, is a learning rate and t is incremented after each training example,
modulo n.
107
6.2 Parametrizing a Learned Predictor
There are many ways to define the family of input-output functions, loss function,
regularizer and optimization procedure, and the most common ones are described below,
while more advanced ones are left to later chapters, in particular Chapters 10 and 16.
6.2.1 Family of Functions
A motivation for the family of functions defined by multi-layer neural networks is to
compose affine transformations and non-linearities, where the choice of parameter values
controls the degree and location of these non-linearities. As discussed in Section 6.4
below, with the appropriate choice of parameters, multi-layer neural networks can in
principle approximate smooth functions, with more hidden units allowing one to achieve
better approximations.
A multi-layer neural network with more than one hidden layer can be defined by
generalizing the above structure, e.g., as follows, where we chose to use hyperbolic
tangent
2
activation functions instead of sigmoid activation functions:
h
k
= tanh(b
k
+ W
k
h
k1
)
where h
0
= x is the input of the neural net, h
k
(for k > 0) is the output of the k-th
hidden layer, which has weight matrix W
k
and offset (or bias) vector b
k
. If we want
the output f
θ
(x) to lie in some desired range, then we typically define an output non-
linearity (which we did not have in the above Example 6.1.1). The non-linearity for the
output layer is of course generally different from the tanh, depending on the type of
output to be predicted and the associated loss function (see below).
As discussed in Chapter 12 (Section 12.5) there are several other non-linearities
besides the sigmoid and the hyperbolic tangent which have been successfully used with
neural networks. In particular, TODO discusses alternative non-linear units, such as
the rectified linear unit (max(0, b + w · x)) and the maxout unit (max
i
(b
i
+ W
:,i
· x))
which have been particularly successful in the case of deep feedforward or convolutional
networks. These and other non-linear neural network activation functions commonly
found in the literature are summarized below.
Rectifier or rectified linear unit (ReLU) or positive part: h(x) = max(0, b+
w · x), also written (b + w ·x)
+
.
Hyperbolic tangent: h(x) = tanh(b + w · x).
Sigmoid: h(x) = 1/(1 + e
(b+w·x)
).
Softmax: h(x) = softmax(b + W x). It is mostly used as output non-linearity for
predicting discrete probabilities. See definition and discussion below, Eq. 6.1.
2
which is linearly related to the sigmoid as follows: tanh(x) = 2 × sigmoid(2x) 1
108
Radial basis function or RBF unit: h(x) = e
−||wx||
2
2
. This is heavily used
in kernel SVMs (Boser et al., 1992; Scolkopf et al., 1999) and has the advantage
that such units can be easily initialized as a random subset of the input exam-
ples (Powell, 1987; Niranjan and Fallside, 1990).
Softplus: this is a smooth version of the rectifier, introduced in Dugas et al. (2001)
for function approximation and in Nair and Hinton (2010) in RBMs. Glorot et al.
(2011a) compared the softplus and rectifier and found better results with the latter,
in spite of the very similar shape and the differentiability and non-zero derivative
of the softplus everywhere, contrary to the rectifier.
Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the
latter, it is bounded, h(x) = max(1, min(1, x)). It was introduced by Collobert
(2004).
Absolute value rectification: h(x) = |x| (may be applied on the dot product
output or on the output of a tanh unit). It is also related to the rectifier and
has been used for object recognition from images (Jarrett et al., 2009b), where
it makes sense to seek features that are invariant under a polarity reversal of the
input illumination.
Maxout: this is discussed in more detail in TODO It generalizes the rectifier but
introduces multiple weight vectors w
i
(called filters) for each unit. h(x) = max
i
(b
i
+ w
i
· x).
This is not an exhaustive list but covers most of the non-linearities and unit computa-
tions seen in the deep learning and neural nets literature. Many variants are possible.
As discussed in Section 6.3, the structure (also called architecture) of the family of
input-output functions can be varied in many ways, which calls for a generic principle
for efficiently computing gradients, described in that section. For example, a common
variation is to connect layers that are not adjacent, with so-called skip connections,
which are found in the visual cortex (for which the word “layer” should be replaced by
the word “area”). Other common variations depart from a full connectivity between
adjacent layers. For example, each unit at layer k may be connected to only a subset of
units at layer k 1. A particular case of such form of sparse connectivity is discussed
in chapter 9 with convolutional networks. In general, the set of connections between
units of the whole network only needs to form a directed acyclic graph in order to define
a meaningful computation (see the flow graph formalism below, Section 6.3). When
units of the network are connected to themselves through a cycle, one has to properly
define what computation is to be done, and this is what is done in the case of recurrent
networks, treated in Chapter 10.
6.2.2 Loss Function and Conditional Log-Likelihood
In the 80’s and 90’s the most commonly used loss function was the squared error
L(f
θ
(x), y) = ||f
θ
(x) y||
2
. As discussed in Section 5.8.1, if f is unrestricted (non-
parametric), minimizing the expected value of the loss function over some data-generating
109
distribution P (X, Y ) yields f(x) = E[Y |X], the true conditional expectation of Y given
X
3
. See also Section 5.8.1 for more details. This tells us what the neural network
is trying to learn. Replacing the squared error by an absolute value makes the neural
network try to estimate not the conditional expectation but the conditional median.
However, when y is a discrete label, i.e., for classification problems, other loss func-
tions such as the Bernoulli negative log-likelihood
4
have been found to be more ap-
propriate than the squared error. In the case where y {0, 1} is binary this gives
L(f
θ
(x), y) = y log f
θ
(x) (1 y) log(1 f
θ
(x)) and it can be shown that the optimal
(non-parametric) f minimizing this criterion is f(x) = P (Y = 1|x). Note that in order
for the above expression of the criterion to make sense, f
θ
(x) must be strictly between 0
and 1 (an undefined or infinite value would otherwise arise). To achieve this, it is com-
mon to use the sigmoid as non-linearity for the output layer, which matches well with
the Binomial negative log-likelihood criterion
5
. As explained below (Section 6.2.2), the
cross-entropy criterion allows gradients to pass through the output non-linearity even
when the neural network produces a confidently wrong answer, unlike the squared error
criterion coupled with a sigmoid or softmax non-linearity.
Learning a Conditional Probability Model
More generally, one can define a loss function as corresponding to a conditional log-
likelihood, i.e., the negative log-likelihood (NLL) criterion
L
NLL
(f
θ
(x), y) = log P
θ
(Y = y|X = x).
See Section 5.8.2 which shows that this criterion yields an estimator of the true condi-
tional probability of Y given X. For example, if Y is a continuous random variable and
we assume that, given X, it has a Gaussian distribution with mean f
θ
(X) and variance
σ
2
, then
log P
θ
(Y |X) =
1
2
(f
θ
(X) Y )
2
2
+ log(2πσ
2
).
Up to an additive and multiplicative constant (which would give the same choice of θ),
this is therefore equivalent to the squared error loss. Once we understand this principle,
we can readily generalize it to other distributions, as appropriate. For example, it
is straightforward to generalize the univariate Gaussian to the multivariate case, and
under appropriate parametrization consider the variance to be a parameter or even
parametrized function of x (for example with output units that are guaranteed to be
positive, or forming a positive definite matrix).
Similarly, for discrete variables, the Binomial negative log-likelihood criterion corre-
sponds to the conditional log-likelihood associated with the Bernoulli distribution with
3
Proving this is not difficult and is very instructive.
4
This is often called cross entropy in the literature, even though the term cross entropy should also
apply to many other losses that can be viewed as negative log-likelihood, discussed below in more detail.
TODO: where do we actually discuss this? may as well discuss it in maximum likelihood section
5
In reference to statistical models, this “match” between the loss function and the output non-
linearity is similar to the choice of a link function in generalized linear models (McCullagh and Nelder,
1989).
110
probability p = f
θ
(x) of generating Y = 1 given X = x (and probability 1 p of
generating Y = 0):
log P
θ
(Y |X) = 1
Y =1
log p1
Y =0
log(1 p) = Y log f
θ
(X)(1Y ) log(1f
θ
(X)).
Softmax
When y is discrete (in some finite set, say {1, . . . , N }) but not binary, the Bernoulli
distribution is extended to the Multinoulli (Murphy, 2012), Section 3.10.2
6
. This dis-
tribution is specified by a vector of N 1 probabilities whose sum is less or equal to
1, each element of which provides the probability p
i
= P (y = i|x). We thus need a
vector-valued output non-linearity that produces such a vector of probabilities, and a
commonly used non-linearity for this purpose is the softmax non-linearity introduced
earlier.
p = softmax(a) p
i
=
e
a
i
P
j
e
a
j
. (6.1)
where typically a = b + W h is the vector of scores whose elements a
i
are associated
with each category i. The corresponding loss function is therefore L
NLL
(p, y) = log p
y
.
Note how minimizing this loss will push a
y
up (increase the score a
y
associated with
the correct label y) while pushing a
i
(for i 6= y) down (decreasing the score of the other
labels, in the context x). The first effect comes from the numerator of the softmax
while the second effect comes from the normalizing denominator. These forces cancel on
a specific example only if p
y
= 1 and they cancel in average over examples (say sharing
the same x) if p
i
equals the fraction of times that y = i for this value x. The same
principles and the role of the normalization constant (or “partition function”) can be
seen at play in the training of Markov Random Fields, Boltzmann machines and RBMs,
in Chapter 14. Note other interesting properties of the softmax. First of all, the gradient
of the Multinoulli negative log-likelihood with respect to the softmax argument a is
a
L
NLL
(p(a), y) =
p
y
(log p
y
(a)) ·
p
y
a
=
1
p
y
(a)
· p
y
(a)(1
i=y
p)
= (p 1
i=y
) (6.2)
which does not saturate, the gradient does not vanish when the output probabilities
approach 0 or 1, except when the model is providing the correct answer. Specifically,
lets consider the case where the correct label is i, i.e. y = i. The element of the gradient
associated with an erroneous label, say j 6= i, is
a
j
L
NLL
(p(a), y) = p
j
. (6.3)
6
The Multinoulli is also known as the categorical or generalized Bernoulli distribution, which is a
special case of the Multinomial distribution with one trial, and the term multinomial is often used even
in this case.
111
So if the model correctly predicts a low probability that the ry = j, i.e. that p
j
0,
then the gradient is also close to zero.
There are other loss functions such as the squared error applied to softmax (or
sigmoid) outputs (which was popular in the 80’s and 90’s) which have vanishing gradient
when an output unit saturates (when the derivative of the non-linearity is near 0), even
if the output is completely wrong (Solla et al., 1988). This may be a problem because it
means that the parameters will basically not change, even though the output is wrong.
To see how the squared error interacts with the softmax output, we need to introduce
a one-hot encoding of the label, y = [0, . . . , 0, 1, 0, . . . , 0], i.e for the label y = i, we have
y
i
= 1 and y
j
= 0, j 6= i. We will again consider that we have the output of the
network to be p = softmax(a), where, as before, a is the input to the softmax function
( i.e. a = b + W h with h the output of the last hidden layer).
For the squared error loss L
2
(p(a), y) = ||p(a) y||
2
, the gradient of the loss with
respect to the input vector to the softmax, a, is given by:
a
L(p(a), y) =
p
p
= 2(p(a) y) p (1 p),
where indicates an element-wise multiplication. Let us again consider the case where
y = i, the gradient associated with an erroneous label, say j 6= i, is
a
j
L
2
(p(a), y) = 2(p(a) y) p (1 p). (6.4)
So if the model correctly predicts a low probability that the ry = j, i.e. that p
j
0,
then the gradient is also close to zero.
Another property that could potentially be interesting is that the softmax output is
invariant under additive changes of its input vector: softmax(a) = softmax(a + b) when
b is a scalar added to all the elements of vector a. Finally, it is interesting to think
of the softmax as a way to create a form of competition between the units (typically
output units, but not necessarily) that participate in it: the strongest filter output
a
i
gets reinforced (because the exponential increases faster for larger values) and the
other units get inhibited. This is analogous to the lateral inhibition that is believed to
exist between nearby neurons in cortex, and at the extreme (when the a
i
’s are large in
magnitude) becomes a form of winner-take-all (one of the outputs is 1 and the others is
0). A more computationally expensive form of competition is found with sparse coding,
described in Section 20.3.
Neural Net Outputs as Parameters of a Conditional Distribution
In general, for any parametric probability distribution P (y|ω) with parameters ω, we
can construct a conditional distribution P (y|x) by making ω a parametrized function of
x, and learning that function:
P (y|ω = f
θ
(x))
112
where f
θ
(x) is the output of a predictor, x is its input, and y can be thought of as a
target. This established choice of words comes from the common cases of classification
and regression, where f
θ
(X) is really a prediction associated with Y , or its expected
value. However, in general ω = f
θ
(X) may contain parameters of the distribution of Y
other than its expected value. For example, it could contain its variance or covariance,
in the case where Y is conditionally Gaussian. In the above examples, with the squared
error loss, ω is the mean of the Gaussian which captures the conditional distribution
of Y (which means that the variance is considered fixed, not a function of X). In
the common classification case, ω contains the probabilities associated with the various
events of interest.
Once we view things in this way, we automatically get as the natural cost function
the negative log-likelihood L(X, Y ) = log P (Y |ω = f
θ
(X)). Besides θ, there could be
other parameters that control the distribution of Y . For example, we may wish to learn
the variance of a conditional Gaussian, even when the latter is not a function of X. In
this case, the solution can be computed analytically because the maximum likelihood
estimator of variance is simply the empirical mean of the squared difference between
observations Y and their expected value (here f
θ
(X)). In other words, the conditional
variance can be estimated from the mean squared error.
Besides the Gaussian, a simple and common example is the case where Y is binary
(i.e. Bernoulli distributed), where it is enough to specify ω = P (Y = 1|X). In the
Multinoulli case, ω is generally specified by a vector of probabilities summing to 1, e.g.,
via the softmax non-linearity discussed above.
Another interesting and powerful example of output distribution for neural networks
is the mixture model, and in particular the Gaussian mixture model, introduced in Sec-
tion 3.10.5. Neural networks that compute the parameters of a mixture model were
introduced in Jacobs et al. (1991); Bishop (1994). In the case of the Gaussian mixture
model with N components,
P (y|x) =
N
X
i=1
P (C = i|x)N(y|µ
i
(x), Σ
i
(x)),
the neural network must have three kinds of outputs, P (C = i|x), µ
i
(x), and Σ
i
(x),
which must satisfy different constraints:
1. Mixture components P (C = i|x): these form a Multinoulli distribution over the
N different components C, and can typically be obtained by a softmax over an
N-vector, to guarantee that these outputs are positive and sum to 1.
2. Means µ
i
(x): these indicate the center or mean associated with each Gaussian
component, and are unconstrained (typically with no non-linearity at all for these
output units). If Y is a d-vector, then the network must output an N ×d matrix
containing all these N d-vectors.
3. Covariances Σ
i
(x): these specify the covariance matrix for each component i. In
many models the variance is unconditional (does not depend on x) and diagonal or
113
even scalar. If it is diagonal or scalar, only positivity must be enforced (e.g., using
the softplus non-linearity). If it is full and conditional, then a parametrization must
be chosen that guarantees positive-definiteness of the predicted covariance matrix.
This can be achieved by writing Σ
i
(x) = B
i
(x)B
>
i
(x), where B
i
is an unconstrained
square matrix. One practical issue if the the matrix is full is that computing
the likelihood is expensive, requiring O(d
3
) computation for the determinant and
inverse of Σ
i
(x) (or equivalently, and more commonly done, its eigendecomposition
or that of B
i
(x)).
It has been reported that gradient-based optimization of conditional Gaussian mixtures
(on the output of neural networks) can be finicky, in part because one gets divisions (by
the variance) which can be numerically unstable (when some variance gets to be small
for a particular example, yielding very large gradients). One solution is to clip gradients
(see Section 10.6.7 and Mikolov (2012); Pascanu and Bengio (2012); Graves (2013);
Pascanu et al. (2013a)), while another is to scale the gradients heuristically (Murray
and Larochelle, 2014).
Multiple Output Variables
When Y is actually a tuple formed by multiple random variables Y
1
, Y
2
, . . . , Y
k
, then one
has to choose an appropriate form for their joint distribution, conditional on X = x. The
simplest and most common choice is to assume that the Y
i
are conditionally independent,
i.e.,
P (Y
1
, Y
2
, . . . , Y
k
|x) =
k
Y
i=1
P (Y
i
|x).
This brings us back to the single variable case, especially since the log-likelihood now
decomposes into a sum of terms log P (Y
i
|x). If each P (Y
i
|x) is separately parametrized
(e.g. a different neural network), then we can train these neural networks independently.
However, a more common and powerful choice assumes that the different variables Y
i
share some common factors that can be represented in some hidden layer of the network
(such as the top hidden layer). See Sections 6.5 and 7.12 for a deeper treatment of the
notion of underlying factors of variation and multi-task training. See also Figure 7.6
illustrating these concepts.
If the conditional independence assumption is considered too strong, what can we do?
At this point it is useful to step back and consider everything we know about learning
a joint probability distribution. Any probability distribution P
ω
(Y ) parametrized by
parameters ω can be turned into a conditional distribution P
θ
(Y |x) by making ω a
function ω = f
θ
(x) parametrized by θ. This is not only true of the simple parametric
distributions we have seen above (Gaussian, Bernoulli, Multinoulli) but also of more
complex joint distributions typically represented by a graphical model. Much of this
book is devoted to such graphical models (see Chapters 14, 19, 20, 21). For more
concrete examples of such structured output models with deep learning, see Section 13.4.
114
6.2.3 Training Criterion and Regularizer
The loss function (often interpretable as a negative log-likelihood) tells us what we would
like the learner to capture. Maximizing the conditional log-likelihood over the true
distribution, i.e., minimizing the expected loss E
X,Y
[log P
θ
(Y |X)], makes P
θ
(Y |X)
estimate the true P (Y |X) associated with the unknown data generating distribution,
within the boundaries of what the chosen family of functions allows. See Section 5.8.2
for a proof. In practice we cannot minimize this expectation because we do not know
P (X, Y ) and because computing and minimizing this integral exactly would generally
be intractable. Instead we are going to approximately minimize a training criterion
C(θ) based on the empirical average of the loss (over the training set). The simplest
such criterion is the average training loss
1
n
P
n
t=1
L(f
θ
(x
(t)
), y
(t)
), where the training
set is a set of n examples (x
(t)
, y
(t)
). However, better results can often be achieved by
crippling the learner and preventing it from simply finding the best θ that minimizes the
average training loss. This means that we combine the evidence coming from the data
(the training set average loss) with some a priori preference on the different values that
θ can take (the regularizer). If this concept (and the related concepts of generalization,
overfitting and underfitting) are not clear, please return to Sections 5.3 and 5.6.2 for a
refresher.
The most common regularizer is simply an additive term equal to a regularization
coefficient λ times the squared norm of the parameters
7
, ||θ||
2
2
. This regularizer is often
called the weight decay or L2 weight decay or shrinkage because adding the squared
L2 norm to the training criterion pushes the weights towards zero, in proportion to
their magnitude. For example, when using stochastic gradient descent, each step of the
update with regularizer term
λ
2
||θ||
2
would look like
θ θ
θ
L(f
θ
(x), y) λθ
where is the learning rate (see Section 4.3). This can be rewritten
θ θ(1 λ)
θ
L(f
θ
(x), y)
where we see that the first term systematically shaves off a small proportion of θ before
adding the gradient.
Another commonly used regularizer is the so-called L1 regularizer, which adds a term
proportional to the L1 (absolute value) norm, |θ|
1
=
P
i
|θ
i
|. The L1 regularizer also
prefers values that are small, but whereas the L2 weight decay has only a weak preference
for 0 rather than small values, the L1 regularizer continues pushing the parameters
towards 0 with a constant gradient even when they get very small. As a consequence, it
tends to bring some of the parameters to exactly 0 (how many depends on how large the
regularization coefficient λ is chosen). When optimizing with an approximate iterative
and noisy method such as stochastic gradient descent, no actual 0 is obtained. On the
7
In principle, one could have different priors on different parameters, e.g., it is common to treat
the output weights with a separate regularization coefficient, but the more hyper-parameters, the more
difficult is their optimization, discussed in Chapter 12.
115
other hand, the L1 regularizer tolerates large values of some parameters (only additively
removing a small quantity compared to the situation without regularization) whereas
the L2 weight decay aggressively punishes and prevents large parameter values. The L1
and L2 regularizers can be combined, as in the so-called elastic net (Zou and Hastie,
2005), and this is commonly done in deep networks
8
.
Note how in the context of maximum likelihood, regularizers can generally be in-
terpreted as Bayesian priors on the parameters P (θ) or on the learned function, as
discussed in Sections 5.6.2 and 5.7. Later in this book we discuss regularizers that are
data-dependent (i.e., cannot be expressed easily as a pure prior), such as the contractive
penalty (Chapter 16) as well as regularization methods that are difficult to interpret
simply as added terms in the training criterion, such as dropout (Section 7.11).
6.2.4 Optimization Procedure
Once a training criterion is defined, we enter the realm of iterative optimization proce-
dures, with the slight twist that if possible we want to monitor held-out performance
(for example the average loss on a validation set, which usually does not include the
regularizing terms) and not just the value of the training criterion. Monitoring the val-
idation set performance allows one to stop training at a point where generalization is
estimated to be the best among the training iterations. This is called early stopping and
is a standard machine learning technique, discussed in Sec. 7.8.
Section 4.3 has already covered the basics of gradient-based optimization. The sim-
plest such technique is stochastic gradient descent (with minibatches), in which the
parameters are updated after computing the gradient of the average loss over a mini-
batch of examples (e.g. 128 examples) and making an update in the direction opposite
to that gradient (or opposite to some accumulated average of such gradients, i.e., the
momentum technique, reviewed in Section 12.5.3). The most commonly used optimiza-
tion procedures for multi-layer neural networks and deep learning in general are either
variants of stochastic gradient descent (typically with minibatches), second-order meth-
ods (the most commonly used being L-BFGS and nonlinear conjugate gradients) applied
on large minibatches (e.g. of size 10000) or on the whole training set at once, or nat-
ural gradient methods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu
and Bengio, 2013). Exact second-order and natural gradient methods are computation-
ally too expensive for large neural networks because they involve matrices of dimension
equal to the square of the number of parameters. Approximate methods are discussed
in Section 8.4.
On smaller datasets or when computing can be parallelized, second-order methods
have a computational advantage because they are easy to parallelize and can still perform
many updates per unit of time. On larger datasets (and in the limit of an infinite dataset,
i.e., a stream of training examples) one cannot afford to wait for seeing the whole dataset
before making an update, so that favors either stochastic gradient descent variants
8
See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.
html#l1-and-l2-regularization
116
(possibly with minibatches to take advantage of fast or parallelized implementations of
matrix-matrix products) or second-order methods with minibatches.
A more detailed discussion of issues arising with optimization methods in deep learn-
ing can be found in Chapter 8. In particular, many design choices in the construction
of the family of functions, loss function and regularizer can have a major impact on
the difficulty of optimizing the training criterion. Furthermore, instead of using generic
optimization techniques, one can design optimization procedures that are specific to
the learning problem and chosen architecture of the family of functions, for example by
initializing the parameters of the final optimization routine from the result of a differ-
ent optimization (or a series of optimizations, each initialized from the previous one).
Because of the non-convexity of the training criterion, the initial conditions can make
a very important difference, and this is what is exploited in the various pre-training
strategies, Sections 8.6.4 and 17.1, as well as with curriculum learning (Bengio et al.,
2009), Section 8.7.
6.3 Flow Graphs and Back-Propagation
The term back-propagation is often misunderstood as meaning the whole learning algo-
rithm for multi-layer neural networks. Actually it just means the method for computing
gradients in such networks. Furthermore, it is generally understood as something very
specific to multi-layer neural networks, but once its derivation is understood, it can eas-
ily be generalized to arbitrary functions (for which computing a gradient is meaningful),
and we describe this generalization here, focusing on the case of interest in machine
learning where the output of the function to differentiate (e.g., the loss L or the training
criterion C) is a scalar and we are interested in its derivative with respect to a set of pa-
rameters (considered to be the elements of a vector θ). The partial derivative of C with
respect to θ (called the gradient) tells us whether θ should be increased or decreased
in order to decrease C, and is a crucial tool in optimizing the training objective. It
can be readily proven that the back-propagation algorithm has optimal computational
complexity in the sense that there is no algorithm that can compute the gradient faster
(in the O(·) sense, i.e., up to an additive and multiplicative constant).
The basic idea of the back-propagation algorithm is that the partial derivative of
the cost C with respect to parameters θ can be decomposed recursively by taking into
consideration the composition of functions that relate θ to C, via intermediate quantities
that mediate that influence, e.g., the activations of hidden units in a deep neural network.
In this section we call these intermediate values u
j
(indexed by j) and consider the
general case in which they form a directed acyclic graph that has C as its final node
u
N
, that depends of all the other nodes u
j
. The back-propagation algorithm exploits
the chain rule for derivatives to compute
C
u
j
when
C
u
i
has already been computed for
successors u
i
of u
j
in the graph, e.g., the hidden units in the next layer downstream.
This recursion can be initialized by noting that
C
u
N
= 1 and at each step only requires
to use the partial derivatives associated with each arc of the graph,
u
i
u
j
, when u
i
is a
successor of u
j
.
117
Figure 6.3: The chain rule, illustrated in the simplest possible case, with z a scalar
function of y, which is itself a scalar function of x. A small change x in x gets turned
into a small change y in y through the partial derivative
y
x
, from the first-order Taylor
approximation of y(x), and similarly for z(y). Plugging the equation for y into the
equation for z yields the chain rule.
6.3.1 Chain Rule
In order to apply the back-propagation algorithm, we take advantage of the chain rule:
θ
C(g(θ)) =
g(θ)
C(g(θ))
g(θ)
θ
(6.5)
which works also when C, g or θ are vectors rather than scalars (in which case the
corresponding partial derivatives are understood as Jacobian matrices of the appropriate
dimensions). In the purely scalar case we can understand the chain rule as follows: a
small change in θ will propagate into a small change in g(θ) by getting multiplied by
g(θ)
θ
. Similarly, a small change in g(θ) will propagate into a small change in C(g(θ))
by getting multiplied by
g(θ)
C(g(θ)). Hence a small change in θ first gets multiplied
by
g(θ)
θ
to obtain the change in g(θ) and this then gets multiplied by
g(θ)
C(g(θ)) to
obtain the change in C(g(θ)). Hence the ratio of the change in C(g(θ)) to the change in
θ is the product of these partial derivatives. The partial derivative measures the locally
linear influence of a variable on another. This is illustrated in Figure 6.3, with x = θ,
y = g(θ), and z = C(g(θ)).
Now, if g is a vector, we can rewrite the above as follows:
θ
C(g(θ)) =
X
i
C(g(θ))
g
i
(θ)
g
i
(θ)
θ
which sums over the influences of θ on C(g(θ)) through all the intermediate variables
g
i
(θ). This is illustrated in Figure 6.4 with x = θ, y
i
= g
i
(θ), and z = C(g(θ)).
118
Figure 6.4: Top: The chain rule, when there are two intermediate variables y
1
and y
2
between x and z, creating two paths for changes in x to propagate and yield changes in
z. Bottom: more general case, with n intermediate variables y
1
to y
n
.
6.3.2 Back-Propagation in an MLP
Whereas example 6.1.1 illustrated the case of of an MLP with a single hidden layer let
us consider in this section back-propagation for an ordinary but deep MLP, i.e., like the
above vanilla MLP but with several hidden layers. For this purpose, we will recursively
apply the chain rule illustrated in Figure 6.4. The algorithm proceeds by first computing
the gradient of the cost C with respect to output units, and these are used to compute
the gradient of C with respect to the top hidden layer activations. We can then continue
computing the gradients of lower level hidden units one at a time in the same way The
gradients on hidden and output units can be used to compute the gradient of C with
respect to the parameters (e.g. weights and biases) of each layer (i.e., that directly
contribute to the output of that layer).
Algorithm 6.1 describes in matrix-vector form the forward propagation computation
for a classical multi-layer network with M layers, where each layer computes an affine
119
Algorithm 6.1 Forward computation associated with input x for a deep neural network
with ordinary affine layers composed with an arbitrary elementwise differentiable (almost
everywhere) non-linearity f. There are M such layers, each mapping their vector-
valued input h
k
to a pre-activation vector a
k
via a weight matrix W
(k)
which is then
transformed via f into h
k+1
. The input vector x corresponds to h
0
and the predicted
outputs
ˆ
y corresponds to h
M
. The cost function L(
ˆ
y, y) depends on the output
ˆ
y and
on a target y (see Section 6.2.2 for examples of loss functions). The loss may be added
to a regularizer Ω (see Section 6.2.3 and Chapter 7) to obtain the example-wise cost C.
Algorithm 6.2 shows how to compute gradients of C with respect to parameters W and
b.
h
0
= x
for k = 1 . . . , M do
a
(k)
= b
(k)
+ W
(k)
h
(k1)
h
(k)
= f(a
(k)
)
end for
ˆ
y = h
(M)
C = L(
ˆ
y, y) + λ
transformation (defined by a bias vector b
(k)
and a weight matrix W
(k)
followed by a
non-linearity f . In general, the non-linearity may be different on different layers, and
this is typically the case for the output layer (see Section 6.2.1). Hence each unit at
layer k computes an output h
(k)
i
as follows:
a
(k)
i
= b
(k)
i
+
X
j
W
(k)
ij
h
(k1)
j
h
(k)
i
= f(a
(k)
) (6.6)
where we separate the affine transformation from the non-linear activation operations
for ease of exposition of the back-propagation computations.
These are described in matrix-vector form by Algorithm 6.2 and proceed from the
output layer towards the first hidden layer, as outlined above.
6.3.3 Back-Propagation in a General Flow Graph
More generally, we can think about decomposing a function C(θ) into a more complicated
graph of computations. This graph is called a flow graph. Each node u
i
of the graph
denotes a numerical quantity that is obtained by performing a computation requiring
the values u
j
of other nodes, with j < i. The nodes satisfy a partial order which
dictates in what order the computation can proceed. In practical implementations of
such functions (e.g. with the criterion C(θ) or its value estimated on a minibatch), the
final computation is obtained as the composition of simple functions taken from a given
set (such as the set of numerical operations that the numpy library can perform on arrays
of numbers).
120
Algorithm 6.2 Backward computation for the deep neural network of Algorithm 6.1,
which uses in addition to the input x a target y. This computation yields the gradients on
the activations a
(k)
for each layer k, starting from the output layer and going backwards
to the first hidden layer. From these gradients, which can be interpreted as an indication
of how each layer’s output should change to reduce error, one can obtain the gradient on
the parameters of each layer. The gradients on weights and biases can be immediately
used as part of a stochastic gradient update (performing the update right after the
gradients have been computed) or used with other gradient-based optimization methods.
After the forward computation, compute the gradient on the output layer:
g
ˆ
y
C =
ˆ
y
L(
ˆ
y, y) + λ
ˆ
y
(typically is only a function of parameters not activations, so the last term would
be zero)
for k = M down to 1 do
Convert the gradient on the layer’s output into a gradient into the pre-nonlinearity
activation (element-wise multiplication if f is element-wise):
g
a
(k)
C = g f
0
(a
(k)
)
Compute gradients on weights and biases (including the regularization term, where
needed):
b
(k)
C = g + λ
b
(k)
W
(k)
C = g h
(k1)>
+ λ
W
(k)
Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:
g
h
(k1)
C = W
(k)>
a
(k)
C
end for
We will define the back-propagation in a general flow-graph, using the following
generic notation: u
i
= f
i
(a
i
), where a
i
is a list of arguments for the application of f
i
to
the values u
j
for the parents of i in the graph: a
i
= (u
j
)
jparents(i)
. This is illustrated
in Figure 6.5.
121
u
j
a
i
u
1
f
i
u
i
= f
i
(a
i
)
u
N
u
2
Figure 6.5: Illustration of recursive forward computation, where at each node u
i
we
compute a value u
i
= f
i
(a
i
), with a
i
being the list of values from parents u
j
of node
u
i
. Following Algorithm 6.3, the overall inputs to the graph are u
1
. . . , u
M
(e.g., the
parameters we may want to tune during training), and there is a single scalar output
u
N
(e.g., the loss which we want to minimize).
The overall computation of the function represented by the flow graph can thus be
summarized by the forward computation algorithm, Algorithm 6.3.
In addition to having some code that tells us how to compute f
i
(a
i
) for some val-
ues in the vector a
i
, we also need some code that tells us how to compute its partial
derivatives,
f
i
(a
i
)
a
ik
with respect to any immediate argument a
ik
. Let k = π(i, j) denote
the index of u
j
in the list a
i
. Note that u
j
could influence u
i
through multiple paths.
Whereas
u
i
u
j
would denote the total gradient adding up all of these influences,
f
i
(a
i
)
a
ik
only denotes the derivative of f
i
with respect to its specific k-th argument, keeping the
other arguments fixed, i.e., only considering the influence through the arc from u
j
to
u
i
. In general, when manipulating partial derivatives, one should keep clear in one’s
mind (and implementation) the notational and semantic distinction between a partial
derivative that includes all paths and one that includes only the immediate effect of a
function’s argument on the function output, with the other arguments considered fixed.
For example consider f
3
(a
3,1
, a
3,2
) = e
a
3,1
+a
3,2
and f
2
(a
2,1
) = a
2
2,1
, while u
3
= f
3
(u
2
, u
1
)
122
Algorithm 6.3 Flow graph forward computation. Each node computes numerical value
u
i
by applying a function f
i
to its argument list a
i
that comprises the values of previous
nodes u
j
, j < i, with j parents(i). The input to the flow graph is the vector x, and is
set into the first M nodes u
1
to u
M
. The output of the flow graph is read off the last
(output) node u
N
.
for i = 1 . . . , M do
u
i
x
i
end for
for i = M + 1 . . . , N do
a
i
(u
j
)
jparents(i)
u
i
f
i
(a
i
)
end for
return u
N
and u
2
= f
2
(u
1
), illustrated in Figure 6.6. The direct derivative of f
3
with respect to its
argument a
3,2
is
f
3
a
3,2
= e
a
3,1
+a
3,2
while if we consider the variables u
3
and u
1
to which
these correspond, there are two paths from u
1
to u
3
, and we obtain as derivative the
sum of partial derivatives over these two paths,
u
3
u
1
= e
u
1
+u
2
(1 + 2u
1
). The results are
different because
u
3
u
1
involves not just the direct dependency of u
3
on u
1
but also the
indirect dependency through u
2
.
u
1
u
2
u
3
Figure 6.6: Illustration of indirect effect and direct effect of variable u
1
on variable u
3
in a flow graph, which means that the derivative of u
3
with respect to u
1
must include
the sum of two terms, one for the direct effect (derivative of u
3
with respect to its
first argument) and one for the indirect effect through u
2
(involving the product of
the derivative of u
3
with respect to u
2
times the derivative of u
2
with respect to u
1
).
Forward computation of u
i
’s (as in Figure 6.5) is indicated with upward full arrows,
while backward computation (of derivatives with respect to u
i
’s, as in Figure 6.7) is
indicated with downward dashed arrows.
123
@u
N
@u
N
=1
@u
N
@u
j
@u
N
@u
i
Figure 6.7: Illustration of recursive backward computation, where we associate to each
node j not just the values u
j
computed in the forward pass (Figure 6.5, bold upward
arrows) but also the gradient
u
N
u
j
with respect to the output scalar node u
N
. These
gradients are recursively computed in exactly the opposite order, as described in Algo-
rithm 6.4 by using the already computed
u
N
u
i
of the children i of j (dashed downward
arrows).
Armed with this understanding, we can define the back-propagation algorithm as
follows, in Algorithm 6.4, which would be computed after the forward propagation
(Algorithm 6.3) has been performed. Note the recursive nature of the application of the
chain rule, in Algorithm 6.4: we compute the gradient on node j by re-using the already
computed gradient for children nodes i, starting the recurrence from the trivial
u
N
u
N
= 1
that sets the gradient for the output node. This is illustrated in Figure 6.7.
124
Algorithm 6.4 Back-propagation computation of a flow graph (full, upward arrows),
which itself produces an additional flow graph (dashed, backward arrows). See the
forward propagation in a flow-graph (Algorithm 6.3, to be performed first) and the
required data structure. In addition, a quantity
u
N
u
i
needs to be stored (and computed)
at each node, for the purpose of gradient back-propagation. Below the notation π(i, j)
is the index of u
j
as an argument to f
i
. The back-propagation algorithm efficiently
computes
u
N
u
i
for all i’s (traversing the graph backwards this time), and in particular
we are interested in the derivatives of the output node u
N
with respect to the “inputs”
u
1
. . . , u
M
(which could be the parameters, in a learning setup). The cost of the overall
computation is proportional to the number of arcs in the graph, assuming that the
partial derivative associated with each arc requires a constant time. This is of the same
order as the number of computations for the forward propagation.
u
N
u
N
1
for j = N 1 down to 1 do
u
N
u
j
P
i:jparents(i)
u
N
u
i
f
i
(a
i
)
a
i,π(i,j)
end for
return
u
N
u
i
M
i=1
This recursion is a form of efficient factorization of the total gradient, i.e., it is an
application of the principles of dynamic programming
9
. Indeed, the derivative of the
output node with respect to any node can also be written down in this intractable form:
u
N
u
i
=
X
paths u
k
1
...,u
k
n
: k
1
=i,k
n
=N
n
Y
j=2
u
k
j
u
k
j1
where the paths u
k
1
. . . , u
k
n
go from the node k
1
= i to the final node k
n
= N in the flow
graph. Computing the sum as above would be intractable because the number of possible
paths can be exponential in the depth of the graph. The back-propagation algorithm
is efficient because it employs a dynamic programming strategy to re-use rather than
re-compute partial sums associated with the gradients on intermediate nodes.
Although the above was stated as if the u
i
’s were scalars, exactly the same procedure
can be run with u
i
’s being tuples of numbers (more easily represented by vectors). In
that case the equations remain valid, and the multiplication of scalar partial derivatives
becomes the multiplication of a row vector of gradients
u
N
u
i
with a Jacobian of partial
derivatives associated with the j i arc of the graph,
f
i
(a
i
)
a
i,π(i,j)
. In the case where
minibatches are used during training, u
i
would actually be a whole matrix (the extra
dimension being for the examples in the minibatch). This would then turn the basic
computation into matrix-matrix products rather than matrix-vector products, and the
9
Here we refer to “dynamic programming” in the sense of table-filling algorithms that avoid re-
computing frequently used subexpressions. In the context of machine learning, “dynamic programming”
can also refer to iterating Bellman’s equations. That is not the kind of dynamic programming we refer
to here.
125
former can be computed much more efficiently than a sequence of matrix-vector products
(e.g. with the BLAS library), especially so on modern computers and GPUs, that rely
more and more on parallelization through many cores.
More implementation issues regarding the back-propagation algorithm are discussed
in Chapters 11 and 12, regarding respectively GPU implementations (Section 11.2)
and debugging tricks (Section 12.5.1). Section 12.5.2 also discusses a natural general-
ization of the back-propagation algorithm in which one manipulates not numbers but
symbolic expressions, i.e., turning a program that performs a computation (decomposed
as a flow graph expressed in the forward propagation algorithm) into another program
(another flow graph) that performs gradient computation (i.e. automatically generat-
ing the back-propagation program, given the forward propagation graph). Using such
symbolic automatic differentiation procedures (implemented for example in the Theano
library
10
), one can compute second derivatives and other kinds of derivatives, as well as
apply numerical stabilization and simplifications on a flow graph.
6.4 Universal Approximation Properties and Depth
When there is no hidden layer, and for convex loss functions and regularizers (which is
typically the case), we obtain a convex training criterion. We end up with linear regres-
sion, logistic regression or other log-linear models that are used in many applications of
machine learning. This is appealing (no local minima, no saddle point) and convenient
(no strong dependency on initial conditions) but comes with a major disadvantage: such
learners are very limited in their ability to represent more complex functions. In the
case of classification tasks, we are limited to linear decision surfaces. In practice, this
limitation can be somewhat circumvented by handcrafting a large enough or discrimi-
nant enough set of hardwired features extracted from the raw input and on which the
linear predictor can be easily trained. See next section for a longer discussion about
feature learning.
To make the family of functions rich enough, the other option is to introduce one
or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi, 1994)
that with a single hidden layer of a sufficient size and a reasonable choice of non-
linearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can represent
any smooth function to any desired accuracy (the greater the required accuracy, the
more hidden units are required). However, these theorems generally do not tell us
how many hidden units will be required to achieve a given accuracy for particular data
distributions: in fact the worse case scenario is generally going to require an exponential
number of hidden units (to basically record every input configuration that needs to be
distinguished). It is easier to understand how bad things can be by considering the case
of binary input vectors: the number of possible binary functions on vectors v {0, 1}
d
is
2
2
d
and selecting one such function requires 2
d
bits, which will in general require O(2
d
)
degrees of freedom.
However, machine learning is not about learning any possible function with equal
10
http://deeplearning.net/software/theano/
126
ease: we mostly care about the kinds of functions that are needed to represent the
world around us or the task of interest. The no-free-lunch theorems for machine
learning (Wolpert, 1996) essentially say that no learning algorithm is “universal”, i.e.,
dominates all the others against all possible ground truths (data generating distribution
or target function). However, for a given family of tasks, some learning algorithms are
definitely better. Choosing a learning algorithm is equivalent to choosing a prior, i.e.,
making a larger bet for some guesses of the underlying target function than for others.
If the target function is likely under the chosen prior, then this prior (i.e., this learning
algorithm) turns out to be a good choice. The best choice is the unrealistic prior that
puts all probability mass on the true target function. This would only happen if there is
nothing to be learned any more because we already know the target function. Having a
universal approximation property just means that this prior is broad enough to cover or
come close to any possible target function – and this is a useful property – but is, in itself,
clearly not sufficient to provide good generalization (not to speak of the optimization
issues and other computational concerns that may be associated with a choice of prior
and learning algorithm).
One of the central priors that is exploited in deep learning is that the target function
to be learned can be efficiently represented as a deep composition of simpler functions
(“features”), where features at one level can be re-used to define many features at the
next level. This is connected to the notion of underlying factors described in the next
section. One therefore assumes that these factors or features are organized at multiple
levels, corresponding to multiple levels of representation. The number of such levels is
what we call depth in this context. The computation of these features can therefore be
laid down in a flow graph or circuit, whose depth is the length of the longest path from
an input node to an output node. Note that we can define the operations performed
in each node in different ways. For example, do we consider a node that computes
the affine operations of a neural net followed by a node that computes the non-linear
neuron activation, or do we consider both of these operations as one node or one level
of the graph? Hence the notion of depth really depends on the allowed operations at
each node and one flow graph of a given depth can be converted into an equivalent
flow graph of a different depth by redefining which operations can be performed at each
node. However, for neural networks, we typically talk about a depth 1 network if there
is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal
approximation properties for neural nets basically tell us that depth 2 is sufficient to
approximate any reasonable function to any desired finite accuracy.
From the point of view of approximation properties, the important result is that one
can find families of functions which can be approximated very efficiently when a partic-
ular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such
results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-
negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectifier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
127
learning (and in particular for AI) share such a property. However, a lot of experimental
evidence suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,
2012; Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a; Couprie
et al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior
and that in order to take advantage of it, the learner’s family of function needs to allow
multiple levels of representation.
6.5 Feature / Representation Learning
Let us consider again the single layer networks such as the Perceptron, linear regression
and logistic regression: such linear models are appealing because training them involves
a convex optimization problem
11
, i.e., an optimization problem with some convergence
guarantees towards a global optimum, irrespective of initial conditions. Simple and well-
understood optimization algorithms are available in this case. However, this limits the
representational capacity too much: many tasks, for a given choice of input representa-
tion x (the raw input features), cannot be solved by using only a linear predictor. What
are our options to avoid that limitation?
1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Scolkopf
et al., 1999), i.e., to consider a fixed mapping from x to φ(x), where φ(x) is of
much higher dimension. In this case, f
θ
(x) = b + w · φ(x) can be linear in the
parameters (and in φ(x)) and optimization remains convex (or even analytic). By
the exploiting the kernel trick, we can computationally handle a high-dimensional
φ(x) (or even an infinite-dimensional one) so long as the kernel K(u, v) = φ(u)·φ(v)
(where · is the appropriate dot product for the space of φ(·)) can be computed
efficiently. If φ(x) is of high enough dimension, we can always have enough capacity
to fit the training set, but generalization is not at all guaranteed: it will depend
on the appropriateness of the choice of φ as a feature space for our task. Kernel
machines theory clearly identifies the choice of φ to the choice of a prior. This leads
to kernel engineering, which is equivalent to feature engineering, discussed next.
The other type of kernel (that is very commonly used) embodies a very broad prior,
such as smoothness, e.g., the Gaussian (or RBF) kernel K(u, v) = e
−||uv||
2
.
Unfortunately, this prior may be insufficient, i.e., too broad and sensitive to the
curse of dimensionality, as introduced in Section 5.12 and developed in more detail
in Chapter 17.
2. Another option is to manually engineer the representation or features φ(x). Most
industrial applications of machine learning rely on hand-crafted features and most
of the research and development effort (as well as a very large fraction of the
scientific literature in machine learning and its applications) goes into designing
new features that are most appropriate to the task at hand. Clearly, faced with
11
or even one for which an analytic solution can be computed, with linear regression or the case of
some Gaussian process regression models
128
a problem to solve and some prior knowledge in the form of representations that
are believed to be relevant, the prior knowledge can be very useful. This approach
is therefore common in practice, but is not completely satisfying because it in-
volves a very task-specific engineering work and a laborious never-ending effort to
improve systems by designing better features. If there were some more general
feature learning approaches that could be applied to a large set of related tasks
(such as those involved in AI), we would certainly like to take advantage of them.
Since humans seem to be able to learn a lot of new tasks (for which they were
not programmed by evolution), it seems that such broad priors do exist. This
whole question is discussed in a lot more detail in Bengio and LeCun (2007b), and
motivates the third option.
3. The third option is to learn the features, or learn the representation. In a sense,
it allows one to interpolate between the almost agnostic approach of a kernel ma-
chine with a general-purpose smoothness kernel (such as RBF SVMs and other
non-parametric statistical models) and full designer-provided knowledge in the
form of a fixed representation that is perfectly tailored to the task. This is equiv-
alent to the idea of learning the kernel, except that whereas most kernel learning
methods only allow very few degrees of freedom in the learned kernel, representa-
tion learning methods such as those discussed in this book (including multi-layer
neural networks) allow the feature function φ(·) to be very rich (with a num-
ber of parameters that can be in the millions or more, depending on the amount
of data available). This is equivalent to learning the hidden layers, in the case
of a multi-layer neural network. Besides smoothness (which comes for example
from regularizers such as weight decay), other priors can be incorporated in this
feature learning. The most celebrated of these priors is depth, discussed above
(Section 6.4). Other priors are discussed in Chapter 17.
This whole discussion is clearly not specific to neural networks and supervised learn-
ing, and is one of the central motivations for this book.
6.6 Piecewise Linear Hidden Units
Most of the recent improvement in the performance of deep neural networks can be
attributed to increases in computational power and the size of datasets. The machine
learning algorithms involved in recent state-of-the-art systems have mostly existed since
the 1980s, with very few recent conceptual advances contributing significantly to in-
creased performance.
One of the main algorithmic improvements that has had a significant impact is the
use of piecewise linear units, such as absolute value rectifiers, rectified linear units. Such
units consist of two linear pieces and their behavior is driven by a single weight vector.
Jarrett et al. (2009a) observed that “using a rectifying non-linearity is the single most
important factor in improving the performance of a recognition system” among several
different factors of neural network architecture design.
129
For small datasets, Jarrett et al. (2009a) observed that rectifying non-linearities is
even more important than learning the weights of the hidden layers. Random weights are
sufficient to propagate useful information through a rectified linear network, allowing the
classifier layer at the top to learn how to map different feature vectors to class identities.
When more data is available, learning becomes relevant because it can extract more
knowledge from it, and learning typically beats fixed or random settings of parameters.
Glorot et al. (2011b) showed that learning is far easier in deep rectified linear networks
than in deep networks that have curvature or two-sided saturation in their activation
functions. Because the behavior of the unit is linear over half of its domain, it is easy for
an optimization algorithm to tell how to improve the behavior of a unit, even when the
unit’s activations are far from optimal. Just as piecewise linear networks are good at
propagating information forward, back-propagation in such a network is also piecewise
linear and propagates information about the error derivatives to all of the gradients in
the network. Each piecewise linear function can be decomposed into different regions
corresponding to different linear pieces. When we change a parameter of the network,
the resulting change in the network’s activity is linear until the point that it causes some
unit to go from one linear piece to another. Traditional units such as sigmoids are more
prone to discarding information due to saturation both in forward propagation and in
back-propagation, and the response of such a network to a change in a single parameter
may be highly nonlinear even in a small neighborhood.
One drawback to rectified linear units is that they cannot learn via gradient-based
methods on examples for which their activation is zero. This problem can be mitigated
by initializing the biases to a small positive number, but it is still possible for a rectified
linear unit to learn to de-activate and then never be activated again. Goodfellow et al.
(2013a) introduced maxout units and showed that maxout units can successfully learn
in conditions where rectified linear units become stuck. Maxout units are also piecewise
linear, but unlike rectified linear units, each piece of the linear function has its own
weight vector, so whichever piece is active can always learn.
This same general principle of using linear behavior to obtain easier optimization also
applies in other contexts besides deep linear networks. Recurrent networks, which need
to propagate information through several time steps, are much easier to train when they
are more linear. One of the best-performing recurrent network architectures, the LSTM,
propagates information through time via summation–a particular straightforward kind
of linear activation. This is discussed further in Section 10.6.4.
In addition to helping to propagate information and making optimization easier,
piecewise linear units also have some nice properties that can make them easier to
regularize. This is discussed further in Section 7.11.
Sigmoidal non-linearities still perform well in some contexts, but piecewise linear
units are now by far the most popular kind of hidden units.
6.7 Historical Notes
130