Chapter 6
Feedforward Deep Networks
6.1 From Fixed Features to Learned Features
In Chapter 5 we considered linear regression, linear classifiers and logistic regres-
sion, and introduced kernel machines, all of which involve a fixed set of features
on which a linear predictor is trained. These models perform non-linear trans-
formations of data, but the non-linear part is pre-defined. How can we learn
non-linear transformations of the data that create a new feature space? How can
we automate feature learning? This is what neural networks with hidden layers
allow us to do.
Feedforward supervised neural networks were among the first and most suc-
cessful learning algorithms (Rumelhart et al., 1986e,c). They are also called
deep networks, multi-layer perceptron (MLP), or simply neural networks and
the vanilla architecture with a single hidden layer is illustrated in Figure 6.1. A
deeper version is obtained by simply having more hidden layers. Each hidden
layer corresponds to a new learned representation of the input vector, trained
towards some objective, such as making it easier to produce desired answers in
output.
MLPs can learn powerful non-linear transformations: in fact, with enough
hidden units they can represent arbitrarily complex but smooth functions (see
Section 6.5). This is achieved by composing simpler but still non-linear learned
transformations. By transforming the data non-linearly into a new space, a classi-
fication problem that was not linearly separable (not solvable by a linear classifier)
can become separable, as illustrated in Figure 6.2.
6.1.1 Estimating Conditional Statistics
To gently move from linear predictors to non-linear ones, let us consider the
squared error loss function studied in the previous chapter, where the learning
149
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
V
W
Figure 6.1: Vanilla (shallow) MLP, with one sigmoid hidden layer, computing vector-
valued hidden unit vector h = sigmoid(c + W x) with weight matrix W and offset vector
c. The output vector is obtained via another learned affine transformation
ˆ
y = b + V h,
with weight matrix V and output offset vector b. The vector of hidden unit values h
provides a new set of features, i.e., a new representation, derived from the raw input x.
task is the estimation of the expected value of y given x. In the context of linear
regression, the conditional expectation of y is used as the mean of a Gaussian
distribution that we fit with maximum likelihood. We can generalize linear re-
gression to regression via any function f by defining the mean squared error of
f:
E[||y f(x)||
2
]
where the expectation is over the training set during training, and over the data
generating distribution to obtain generalization error.
We can generalize its interpretation beyond the case where f is linear or
affine, uncovering an interesting property: minimizing it yields an estimator of
the conditional expectation of the output variable y given the input variable x,
i.e.,
arg min
fH
E
p(x,y)
[||y f(x)||
2
] = E
p(x,y)
[y|x]. (6.1)
provided that our set of function H contains E
p(x,y)
[y | x]. (If you would like to
work out the proof yourself, it is easy to do using calculus of variations, which we
describe in Chapter 19.4.2).
Similarly, we can generalize conditional maximum likelihood (introduced in
Section 5.8.1) to other distributions than the Gaussian, as discussed below when
defining the objective function for MLPs.
150
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Figure 6.2: Each layer of a trained neural network non-linearly transforms its input,
distorting the space so that the task becomes easier to perform, e.g., linear classification
in the new feature space, in the above figures. Top: a vanilla neural network with 2 hidden
units in its single hidden layer can transform the 2-D input space (shown in the blue and
pink square figure) so that the examples from the two classes (shown as the points on the
red and blue curves) become linearly separable (the red and blue curves correspond to the
“manifold” near which examples concentrate, while the red and pink areas correspond to
the regions where the neural network classifies an input as either blue or red). Bottom:
with a larger hidden layer (100 here), the MNIST digit images (with 28 × 28 = 784
pixels) can be transformed so that the classes (each shown with a different color) can be
much more easily classified by the output layer (over 10 digit categories). Both figures
are reproduced with permission by Chris Olah from http://colah.github.io/, where
many more insightful visualizations can be found.
151
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
6.2 Formalizing and Generalizing Neural Networks
In addition to covering the basics of such networks, this chapter introduces a
general formalism for gradient-based optimization of parametrized families of
functions, often in the context of conditional maximum likelihood criteria (Sec-
tion 6.3).
MLPs bring together a number of important machine learning concepts al-
ready introduced in the previous chapters:
Define a parametrized family of functions f
θ
describing how the learner
will behave on new examples, i.e., what output the learned function f
θ
(x)
will produce given some input x. Training consists in choosing the parame-
ter θ (usually represented by a vector) given some training examples (x, y)
sampled from an unknown data generating distribution P (X, Y ).
Define a loss function L describing what scalar loss L(
ˆ
y, y) is associated
with each supervised example (x, y), as a function of the learner’s output
ˆ
y = f
θ
(x) and the target output y.
Define a training criterion and a regularizer. The objective of train-
ing is ideally to minimize the expected loss E
X,Y
[L(f
θ
(X), Y )] over X, Y
sampled from the unknown data generating distribution P (X, Y ). However
this is not possible because the expectation makes use of the true underlying
P (X, Y ) but we only have access to a finite number of training examples,
i.e. of pairs (X,Y ). Instead, one defines a training criterion which usually
includes an empirical average of the loss over the training set, plus some
additional terms (called regularizers) which enforce some preferences over
the choices of θ.
Define an optimization procedure to approximately minimize the training
criterion
1
. The simplest such optimization procedure is a variant of gradient
descent (gradient descent was introduced in Section 4.3) called stochastic
gradient descent, described in Section 8.3.2.
Example 6.2.1 illustrates these concepts for the case of a vanilla neural network
for regression.
In chapter 16, we consider generalizations of the above framework to the
unsupervised and semi-supervised cases, where Y may not exist or may not always
be present. An unsupervised loss can then be defined solely as a function of the
input x and some function f
θ
(x) that one wishes to learn.
1
It is generally not possible to analytically obtain a global minimum of the training criterion,
so iterative numerical optimization methods are used instead.
152
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Example 6.2.1. Vanilla (Shallow) Multi-Layer Neural Network
for Regression
Based on the above definitions, we could pick the family of input-output
functions to be
f
θ
(x) = b + V sigmoid(c + W x),
illustrated in Figure 6.1, where sigmoid(a) = 1/(1 + e
a
) is applied
element-wise, the input is the vector x R
n
i
, the hidden layer outputs
are the elements of the vector h = sigmoid(c + W x) with n
h
entries, the
parameters are θ = (b, c, V , W ) (with θ also viewed as the flattened vec-
torized version of the tuple) with b R
n
o
a vector the same dimension as
the output (n
o
), c R
n
h
of the same dimension as h (number of hidden
units), V R
n
o
×n
h
and W R
n
h
×n
i
being weight matrices.
The loss function for this classical example could be the squared error
L(
ˆ
y, y) = ||
ˆ
y y||
2
(see Section 6.1.1 discussing how it makes
ˆ
y an esti-
mator of E[Y | x]). The regularizer could be the ordinary L
2
weight decay
||ω||
2
= (
P
ij
W
2
ij
+
P
ki
V
2
ki
), where we define the set of weights ω as the
concatenation of the elements of matrices W and V . The L
2
weight de-
cay thus penalizes the squared norm of the weights, with λ a scalar that is
larger to penalize stronger weights, thus yielding smaller weights. Com-
bining the loss function and the regularizer gives the training criterion,
which is the objective function during training:
J(θ) = λ||ω||
2
+
1
n
n
X
t=1
||y
(t)
(b + V sigmoid(c + W x
(t)
))||
2
.
where (x
(t)
, y
(t)
) is the t-th training example, an (input,target) pair. Fi-
nally, the classical training procedure in this example is stochastic gradi-
ent descent, which iteratively updates θ according to
ω ω
2λω +
ω
L(f
θ
(x
(t)
), y
(t)
)
β β
β
L(f
θ
(x
(t)
), y
(t)
),
where β = (b, c) contains the offset
2
parameters, ω = (W , V ) the weight
matrices, is a learning rate and t is incremented after each training
example, modulo n. Section 6.4 shows how gradients can be computed
efficiently thanks to backpropagation.
153
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
6.3 Parametrizing a Learned Predictor
There are many ways to define the family of input-output functions, loss function,
regularizer and optimization procedure, and the most common ones are described
below, while more advanced ones are left to later chapters, in particular Chap-
ters 10 and 15.
6.3.1 Family of Functions
A motivation for the family of functions defined by multi-layer neural networks
is to compose simple transformations in order to obtain highly non-linear ones. In
particular, MLPs compose affine transformations and element-wise non-linearities.
As discussed in Section 6.5 below, with the appropriate choice of parameters,
multi-layer neural networks can in principle approximate any smooth function,
with more hidden units allowing one to achieve better approximations.
A multi-layer neural network with more than one hidden layer can be de-
fined by generalizing the above structure, e.g., as follows, where we chose to use
hyperbolic tangent
3
activation functions instead of sigmoid activation functions:
h
k
= tanh(b
k
+ W
k
h
k1
)
where h
0
= x is the input of the neural net, h
k
(for k > 0) is the output of the
k-th hidden layer, which has weight matrix W
k
and offset (or bias) vector b
k
. If
we want the output f
θ
(x) to lie in some desired range, then we typically define
an output non-linearity (which we did not have in the above Example 6.2.1). The
non-linearity for the output layer is generally different from the tanh, depending
on the type of output to be predicted and the associated loss function (see below).
There are several other non-linearities besides the sigmoid and the hyperbolic
tangent which have been successfully used with neural networks. In particular,
we introduce some piece-wise linear units below such as the the rectified linear
unit (max(0, b +w ·x)) and the maxout unit (max
i
(b
i
+W
:,i
·x)) which have been
particularly successful in the case of deep feedforward or convolutional networks.
A longer discussion of these can be found in Section 6.7.
These and other non-linear neural network activation functions commonly
found in the literature are summarized below. Most of them are typically com-
bined with an affine transformation a = b + W x and applied element-wise:
h = φ(a) h
i
= φ(a
i
) = φ(b
i
+ W
i,:
x). (6.2)
3
which is linearly related to the sigmoid via tanh(x) = 2 × sigmoid(2x) 1 and typically
yields easier optimization with stochastic gradient descent (Glorot and Bengio, 2010).
154
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Rectifier or rectified linear unit (ReLU) or positive part: transfor-
mation of the output of the previous layer: φ(a) = max(0, a), also written
φ(a) = (a)
+
.
Hyperbolic tangent: φ(a) = tanh(a).
Sigmoid: φ(a) = 1/(1 + e
a
).
Softmax: This is a vector-to-vector transformation φ(a) = softmax(a) =
e
a
i
/
P
j
e
a
j
such that
P
i
φ
i
(a) = 1 and φ
i
(a) > 0, i.e., the softmax output
can be considered as a probability distribution over a finite set of outcomes.
Note that it is not applied element-wise but on a whole vector of “scores”.
It is mostly used as output non-linearity for predicting discrete probabilities
over output categories. See definition and discussion below, around Eq. 6.4.
Radial basis function or RBF unit: this one is not applied after a general
affine transformation but acts on x using a different form that corresponds
to a template matching, i.e., h
i
= exp
−||w
i
x||
2
2
i
(or typically with
all the σ
i
set to the same value). This is heavily used in kernel SVMs (Boser
et al., 1992; Sch¨olkopf et al., 1999) and has the advantage that such units
can be easily initialized (Powell, 1987; Niranjan and Fallside, 1990) as a
random (or selected) subset of the input examples, i.e., w
i
= x
(t)
for some
assignment of examples t to hidden unit templates i.
Softplus: φ(a) = log(1 + e
a
). This is a smooth version of the rectifier,
introduced in Dugas et al. (2001) for function approximation and in Nair
and Hinton (2010a) in RBMs. Glorot et al. (2011a) compared the softplus
and rectifier and found better results with the latter, in spite of the very
similar shape and the differentiability and non-zero derivative of the softplus
everywhere, contrary to the rectifier.
Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike
the latter, it is bounded, φ(a) = max(1, min(1, a)). It was introduced
by Collobert (2004).
Absolute value rectification: φ(a) = |a| (may be applied on the affine
dot product or on the output of a tanh unit). It is also a rectifier and has
been used for object recognition from images (Jarrett et al., 2009a), where
it makes sense to seek features that are invariant under a polarity reversal
of the input illumination.
Maxout: this is discussed in more detail in Section 6.7. It generalizes the
rectifier but introduces multiple weight vectors w
i
(called filters) for each
hidden unit. h
i
= max
i
(b
i
+ w
i
· x).
155
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
This is not an exhaustive list but covers most of the non-linearities and unit
computations seen in the deep learning and neural nets literature. Many variants
are possible.
As discussed in Section 6.4, the structure (also called architecture) of the
family of input-output functions can be varied in many ways, which calls for a
generic principle for efficiently computing gradients, described in that section.
For example, a common variation is to connect layers that are not adjacent, with
so-called skip connections, which are found in the visual cortex (where the word
“layer” should be replaced by the word “area”). Other common variations depart
from a full connectivity between adjacent layers. For example, each unit at layer k
may be connected to only a subset of units at layer k1. A particular case of such
form of sparse connectivity is discussed in chapter 9 with convolutional networks.
In general, the set of connections between units of the whole network only needs
to form a directed acyclic graph in order to define a meaningful computation
(see the flow graph formalism below, Section 6.4). When units of the network
are connected to themselves through a cycle, one has to properly define what
computation is to be done, and this is what is done in the case of recurrent
networks, treated in Chapter 10. Another example of non-full connectivity is the
deep recurrent network, Section 10.4.
6.3.2 Loss Function and Conditional Log-Likelihood
In the 80’s and 90’s the most commonly used loss function was the squared error
L(f
θ
(x), y) = ||f
θ
(x) y||
2
. As discussed in Section 6.1.1, if f is unrestricted
(non-parametric), minimizing the expected value of the loss function over some
data-generating distribution P (x, y) yields f(x) = E[y | x = x], the true condi-
tional expectation of y given x. This tells us what the neural network is trying to
learn. Replacing the squared error by an absolute value makes the neural network
try to estimate not the conditional expectation but the conditional median
4
.
However, when y is a discrete label, i.e., for classification problems, other loss
functions such as the Bernoulli negative log-likelihood
5
have been found to be
more appropriate than the squared error. In the case where y {0, 1} is binary
this gives
L(f
θ
(x), y) = y log f
θ
(x) (1 y) log(1 f
θ
(x)) (6.3)
also known as cross entropy objective function. It can be shown that the optimal
(non-parametric) f minimizing this criterion is f(x) = P (y = 1 | x). In other
words, when training the conditional log-likelihood objective function, we are
4
Showing this is another interesting exercise.
5
This is often called cross entropy in the literature, even though the term cross entropy,
defined at Eq. 5.23, should also apply to many other losses that can be viewed as negative
log-likelihood, discussed below in more detail.
156
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
training the neural net output to estimate conditional probabilities as well as
possible in the sense of the KL divergence (see Section 3.9, Eq. 3.3). Note that
in order for the above expression of the criterion to make sense, f
θ
(x) must be
strictly between 0 and 1 (an undefined or infinite value would otherwise arise).
To achieve this, it is common to use the sigmoid as non-linearity for the output
layer, which matches well with the Binomial negative log-likelihood criterion
6
. As
explained below (Softmax subsection), the cross entropy criterion allows gradients
to pass through the output non-linearity even when the neural network produces
a confidently wrong answer, unlike the squared error criterion coupled with a
sigmoid or softmax non-linearity.
Learning a Conditional Probability Model
More generally, one can define a loss function as corresponding to a conditional
log-likelihood, i.e., the negative log-likelihood (NLL) criterion
L
NLL
(f
θ
(x), y) = log P (y = y | x = x; θ).
See Section 5.8.1 (and the one before) which shows that this criterion corresponds
to minimizing the KL divergence between the model P of the conditional prob-
ability of y given x and the data generating distribution Q, approximated here
by the finite training set, i.e., the empirical distribution of pairs (x, y). Hence,
minimizing this objective, as the amount of data increases, yields an estimator of
the true conditional probability of y given x.
For example, if y is a continuous random variable and we assume that, given
x, it has a Gaussian distribution with mean f
θ
(x) and variance σ
2
, then
log P(y | x; θ) =
1
2
(f
θ
(x) y)
2
2
+ log(2πσ
2
).
Up to an additive and multiplicative constant (which would give the same choice
of θ), minimizing this negative log-likelihood is therefore equivalent to minimizing
the squared error loss. Once we understand this principle, we can readily general-
ize it to other distributions, as appropriate. For example, it is straightforward to
generalize the univariate Gaussian to the multivariate case, and under appropriate
parametrization consider the variance to be a parameter or even a parametrized
function of x (for example with output units that are guaranteed to be positive,
or forming a positive definite matrix, as outlined below, Section 6.3.2).
6
In reference to statistical models, this “match” between the loss function and the output
non-linearity is similar to the choice of a link function in generalized linear models (McCullagh
and Nelder, 1989).
157
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Similarly, for discrete variables, the Binomial negative log-likelihood criterion
corresponds to the conditional log-likelihood associated with the Bernoulli distri-
bution (also known as cross entropy) with probability p = f
θ
(x) of generating
y = 1 given x = x (and probability 1 p of generating y = 0):
L
NLL
= log P(y | x; θ) = 1
y=1
log p 1
y=0
log(1 p)
= y log f
θ
(x) (1 y) log(1 f
θ
(x)).
where 1
y=1
is the usual binary indicator.
Softmax
When y is discrete and has a finite domain (say {1, . . . , N}) but is not binary,
the Bernoulli distribution is extended to the multinoulli distribution (defined in
Section 3.10.2). This distribution is specified by a vector of N 1 probabilities
whose sum is less or equal to 1, each element of which provides the probability
p
i
= P (y = i | x). Equivalently, one can (more conveniently) specify a vector of
N probabilities whose sum is exactly 1. The softmax non-linearity was designed
for this purpose (Bridle, 1990):
p = softmax(a) p
i
=
e
a
i
P
j
e
a
j
, (6.4)
where typically a = b + W h is the vector of scores whose elements a
i
are associ-
ated with each category i, with larger relative scores yielding exponentially larger
probabilities. The corresponding loss function is therefore L
NLL
(p, y) = log p
y
.
Note how minimizing this loss will push a
y
up (increase the score a
y
associated
with the correct label y) while pushing down a
i
for i 6= y (decreasing the score
of the other labels, in the context x). The first effect comes from the numerator
of the softmax while the second effect comes from the normalizing denominator.
These forces cancel on a specific example only if p
y
= 1 and they cancel in average
over examples (say sharing the same x) if p
i
equals the fraction of times that y = i
for this value x. To see this, consider the gradient with respect to the scores a:
a
k
L
NLL
(p, y) =
a
k
(log p
y
) =
a
k
(a
y
+ log
X
j
e
a
j
)
= 1
y=k
+
e
a
k
P
j
e
a
j
= p
k
1
y=k
or
a
L
NLL
(p, y) = (p e
y
) (6.5)
158
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
where e
y
= [0, . . . , 0, 1, 0, . . . , 0] is the one-hot vector with a 1 at position y.
Examples that share the same x share the same a, so the average gradient on a
over these examples is 0 when the average of the above expression cancels out,
i.e., p = E
y
[e
y
| x] where the expectation is over these examples. Thus the
optimal p
i
for these examples is the average number of times that y = i among
those examples. Over an infinite number of examples, we would obtain that the
gradient is 0 when p
i
perfectly estimates the true P (y = i | x). What the above
gradient decomposition teaches us as well is the division of the total gradient into
(1) a term due to the numerator (the e
y
) and dependent on the actually observed
target y and (2) a term independent of y but which corresponds to the gradient of
the softmax denominator. The same principles and the role of the normalization
constant (or “partition function”) can be seen at play in the training of Markov
Random Fields, Boltzmann machines and RBMs, in Chapter 13.
Note other interesting properties of the softmax. First of all, the gradient
with respect to a does not saturate, i.e., the gradient does not vanish when the
output probabilities approach 0 or 1 (a very confident model), except when the
model is providing the correct answer. Specifically, let us consider the case where
the correct label is i, i.e. y = i. The element of the gradient associated with an
erroneous label, say j 6= i, is
a
j
L
NLL
(p, y) = p
j
. (6.6)
So if the model correctly predicts a low probability that the y = j, i.e. that
p
j
0, then the gradient is also close to zero. But if the model incorrectly
and confidently predicts that j is the correct class, i.e., p
j
1, there will be a
strong push to reduce a
j
. Conversely, if the model incorrectly and confidently
predicts that the correct class y should have a low probability, i.e., p
y
0, there
will be a strong push (a gradient of about -1) to push a
y
up. One way to see
these is to imagine doing gradient descent on the a
j
’s themselves (that is what
backprop is really based on): the update on a
j
would be proportional to minus one
times the gradient on a
j
, so a positive gradient on a
j
(e.g., incorrectly confident
that p
j
1) pushes a
j
down, while a negative gradient on a
j
(e.g., incorrectly
confident that p
y
0) pushes a
y
up. In fact note how a
y
is always pushed up
because p
y
1
y=y
= p
y
1 < 0, and the other scores a
j
(for j 6= y) are always
pushed up, because their gradient is p
j
> 0.
There are other loss functions such as the squared error applied to softmax (or
sigmoid) outputs (which was popular in the 80’s and 90’s) which have vanishing
gradient when an output unit saturates (when the derivative of the non-linearity
is near 0), even if the output is completely wrong (Solla et al., 1988). This may be
a problem because it means that the parameters will basically not change, even
though the output is wrong.
159
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
To see how the squared error interacts with the softmax output, we need to
introduce a one-hot encoding of the label, y = e
i
= [0, . . . , 0, 1, 0, . . . , 0], i.e for
the label y = i, we have y
i
= 1 and y
j
= 0, j 6= i. We will again consider that
we have the output of the network to be p = softmax(a), where, as before, a is
the input to the softmax function ( e.g. a = b + W h with h the output of the
last hidden layer).
For the squared error loss L
2
(p(a), y) = ||p(a) y||
2
, the gradient of the loss
with respect to the input vector to the softmax, a, is given by:
a
i
L
2
(p(a), y) =
L(p(a), y)
p(a)
p(a)
a
i
=
X
j
2(p
j
(a) y
j
)p
j
(1
i=j
p
i
). (6.7)
So if the model incorrectly predicts a low probability for the correct class y = i,
i.e., if p
y
= p
i
0, then the score for the correct class, a
y
, does not get pushed
up in spite of a large error, i.e.,
a
y
L
2
(p(a), y) 0. For this reason, practitioners
prefer to use the negative log-likelihood (cross entropy) criterion, with the softmax
non-linearity (as well as with the sigmoid non-linearity), rather than applying the
squared error criterion to these probabilities.
Another property that could potentially be interesting is that the softmax out-
put is invariant under additive changes of its input vector: softmax(a) = softmax(a + b)
when b is a scalar added to all the elements of vector a. Finally, it is interesting
to think of the softmax as a way to create a form of competition between the units
(typically output units, but not necessarily) that participate in it: the softmax
function reinforces strongest filter output a
i
(because the exponential increases
faster for larger values) and the other units that get inhibited. This is analogous
to the lateral inhibition that is believed to exist between nearby neurons in cortex,
and at the extreme (when the a
i
’s are large in magnitude) it becomes a form of
winner-take-all (one of the outputs is nearly 1 and the others are nearly 0). A
more computationally expensive form of competition is found with sparse coding,
described in Section 19.3.
Neural Net Outputs as Parameters of a Conditional Distribution
In general, for any parametric probability distribution p(y | ω) with param-
eters ω, we can construct a conditional distribution p(y | x) by making ω a
parametrized function of x, and learning that function:
p(y | ω = f
θ
(x))
where f
θ
(x) is the output of a predictor, x is its input, and y can be thought
of as a “target”. The use of the word “target” comes from the common cases of
160
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
classification and regression, where f
θ
(x) is really a prediction associated with
random variable y, or with its expected value. However, in general ω = f
θ
(x)
may contain parameters of the distribution of y other than its expected value.
For example, it could contain its variance or covariance, in the case where y is
conditionally Gaussian. In the above examples, with the squared error loss, ω is
the mean of the Gaussian which captures the conditional distribution of y (which
means that the variance is considered fixed, not a function of x). In the common
classification case, ω contains the probabilities associated with the various events
of interest.
Once we view things in this way, if we apply the principle of maximum likeli-
hood in the conditional case (Section 5.8.1), we automatically get as the natural
cost function the negative log-likelihood L(x, y) = log p(y | ω = f
θ
(x)). Be-
sides the expected value of y, there could be other parameters of the conditional
distribution of y that control the distribution of y, given x. For example, we
may wish to learn the variance of a conditional Gaussian for y, given x, and that
variance could be a function that varies with x or that is a constant with respect
to x. If the variance σ
2
of y given x is not a function of x, its maximum likelihood
value can be computed analytically because the maximum likelihood estimator of
variance is simply the empirical mean of the squared difference between observa-
tions y and their expected value (here estimated by f
θ
(x)). In the scalar case,
we could estimate σ as follows:
σ
2
1
n
n
X
i=1
(y
(t)
f
θ
(x
(t)
))
2
(6.8)
where
(t)
indicates the t-th training example (x
(t)
, y
(t)
). In other words, the
conditional variance can simply be estimated from the mean squared error. If
y is a d-vector and the conditional covariance is σ
2
times the identity, then the
above formula should be modified as follows, again by setting the gradient of the
log-likelihood with respect to σ to zero:
σ
2
1
nd
n
X
i=1
||y
(t)
f
θ
(x
(t)
)||
2
. (6.9)
In the multivariate case with a diagonal covariance matrix with entries σ
2
i
, we
obtain
σ
2
i
1
n
n
X
i=1
(y
(t)
i
f
θ,i
(x
(t)
))
2
. (6.10)
In the multivariate case with a full covariancae matrix, we have
Σ
1
n
n
X
i=1
(y
(t)
i
f
θ,i
(x
(t)
))(y
(t)
i
f
θ,i
(x
(t)
))
>
(6.11)
161
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
If the variance Σ(x) is a function of x, there is in general no analytic solution
the maximizing the likelihood, but we can compute the gradient of the objective
function with respect to the parameters θ that contribute to definining the map-
ping from the input x to Σ(x). If Σ(x) is diagonal or scalar, only positivity must
be enforced, e.g., using the softplus non-linearity:
σ
i
(x) = softplus(g
θ
(x)).
where g
θ
(x) may be a neural network that takes x as input. A positive non-
linearity may also be useful in the case where σ is a function of x, if we do not
seek the maximum likelihood solution (for example we do not have immediate
observed targets associated with that Gaussian distribution, because the sam-
ples from the Gaussian are used as input for further computation). Then we can
make the free parameter ω defining the variance the argument of the positive non-
linearity, e.g., σ
i
(x) = softplus(ω
i
). If the covariance is full and conditional, then
a parametrization must be chosen that guarantees positive-definiteness of the pre-
dicted covariance matrix. This can be achieved by writing Σ(x) = B(x)B
>
(x),
where B is an unconstrained square matrix. One practical issue if the the matrix
is full is that computing the likelihood is expensive, requiring O(d
3
) computation
for the determinant and inverse of Σ(x) (or equivalently, and more commonly
done, its eigendecomposition or that of B(x)).
Besides the Gaussian, a simple and common example is the case where y is
binary (i.e. Bernoulli distributed), where it is enough to specify ω = p(y = 1 | x).
In the multinoulli case (multiple discrete values), ω is generally specified by a
vector of probabilities (one per possible discrete value) summing to 1, e.g., via
the softmax non-linearity discussed above.
Another interesting and powerful example of output distribution for neural
networks is the mixture model, and in particular the Gaussian mixture model,
introduced in Section 3.10.5. Neural networks that compute the parameters of
a mixture model were introduced in Jacobs et al. (1991); Bishop (1994). In the
case of the Gaussian mixture model with N components,
p(y | x) =
N
X
i=1
p(c = i | x)N(y | µ
i
(x), Σ
i
(x)).
The neural network must have three kinds of outputs, p(c = i | x), µ
i
(x), and
Σ
i
(x), which must satisfy different constraints:
1. Mixture components p(c = i | x): these form a multinoulli distribution
over the N different components associated with latent
7
variable c, and can
7
c is called latent because we do not observe it in the data: given input x and target y, it
is not 100% clear which Gaussian component was responsible for y, but we can imagine that y
was generated by picking one of them, and make that unobserved choice a random variable.
162
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
typically be obtained by a softmax over an N-vector, to guarantee that
these outputs are positive and sum to 1.
2. Means µ
i
(x): these indicate the center or mean associated with the i-th
Gaussian component, and are unconstrained (typically with no non-linearity
at all for these output units). If y is a d-vector, then the network must
output an N × d matrix containing all these N d-vectors.
3. Covariances Σ
i
(x): these specify the covariance matrix for each component
i. For the general case of an unconditional (does not depend on x) but full
covariance matrix, see Eq. 6.11 to set it by maximum likelihood. In many
models the variance is both unconditional and diagonal (like assumed with
Eq. 6.10) or even scalar (like assumed with Eq. 6.8 or 6.9).
It has been reported that gradient-based optimization of conditional Gaussian
mixtures (on the output of neural networks) can be finicky, in part because one
gets divisions (by the variance) which can be numerically unstable (when some
variance gets to be small for a particular example, yielding very large gradients).
One solution is to clip gradients (see Section 10.7.6 and Mikolov (2012); Pascanu
and Bengio (2012); Graves (2013); Pascanu et al. (2013a)), while another is to
scale the gradients heuristically (Murray and Larochelle, 2014).
Multiple Output Variables
When y is actually a tuple formed by multiple random variables y = (y
1
, y
2
, . . . , y
k
),
then one has to choose an appropriate form for their joint distribution, conditional
on x = x. The simplest and most common choice is to assume that the y
i
are
conditionally independent, i.e.,
p(y
1
, y
2
, . . . , y
k
| x) =
k
Y
i=1
p(y
i
| x).
This brings us back to the single variable case, especially since the log-likelihood
now decomposes into a sum of terms log p(y
i
| x). If each p(y
i
| x) is separately
parametrized (e.g. a different neural network), then we can train these neural
networks independently. However, a more common and powerful choice assumes
that the different variables y
i
share some common factors, given x, that can be
represented in some hidden layer of the network (such as the top hidden layer).
See Sections 6.6 and 7.12 for a deeper treatment of the notion of underlying factors
of variation and multi-task training: each (x, y
i
) pair of random variables can be
associated with a different learning task, but it might be possible to exploit what
these tasks have in common. See also Figure 7.6 illustrating these concepts.
163
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
If the conditional independence assumption is considered too strong, what can
we do? At this point it is useful to step back and consider everything we know
about learning a joint probability distribution. Since any probability distribution
p(y; ω) parametrized by parameters ω can be turned into a conditional distri-
bution p(y | x; θ) (by making ω a function ω = f
θ
(x) parametrized by θ), we
can go beyond the simple parametric distributions we have seen above (Gaussian,
Bernoulli, multinoulli), and use more complex joint distributions. If the set of
values that y can take is small enough (e.g., we have 8 binary variables y
i
, i.e., a
joint distribution involving 2
8
= 256 possible values), then we can simply model
all these joint occurences as separate values, e.g., with a softmax and multinoulli
over all these configurations. However, when the set of values that y
i
can take
cannot be easily enumerated and the joint distribution is not unimodal or fac-
torized, we need other tools. The third part of this book is about the frontier of
research in deep learning, and much of it is devoted to modeling such complex
joint distributions, also called graphical models: see Chapters 13, 18, 19, 20. In
particular, Section 12.5 discusses how sophisticated joint probability models with
parameters ω can be coupled with neural networks that compute ω as a function
of inputs x, yielding structured output models conditioned with deep learning.
6.3.3 Training Criterion and Regularizer
TODO: the next few paragraphs seem reference-heavy. are they really saying
anything on their own, or is just a reminder to study these concepts in Ch. 5 suf-
ficient? listing too many specific sections of Ch.5 can be distracting/overwhelming
The loss function (often interpretable as a negative log-likelihood) tells us
what we would like the learner to capture. Maximizing the conditional log-
likelihood of model P over the true distribution Q, i.e., minimizing the expected
loss E
Q(x,y)
[log p(y | x); θ] = E
Q(x,y)
[L(f
θ
(x), y)], makes p(y | x; θ) estimate
the true Q(y | x) associated with the unknown data generating distribution,
within the boundaries of what the chosen family of functions allows. See the end
of Section 5.8 and Section 5.8.1 for a longer discussion. In practice we cannot
minimize this expectation because we do not know p(x, y) and because comput-
ing and minimizing this integral exactly would generally be intractable. Instead
we are going to approximately minimize a training criterion J(θ) based on the
empirical average of the loss (over the training set). The simplest such criterion
is the average training loss
1
n
P
n
t=1
L(f
θ
(x
(t)
), y
(t)
), where the training set is a
set of n examples (x
(t)
, y
(t)
). However, better results can often be achieved by
crippling the learner and preventing it from simply finding the best θ that mini-
mizes the average training loss. This means that we combine the evidence coming
from the data (the training set average loss) with some a priori preference on
the different values that θ or f
θ
can take (the regularizer). If this concept (and
164
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
the related concepts of generalization, overfitting and underfitting) are not clear,
please return to Sections 5.3 and 5.5 for a refresher.
TODO: next several paragraphs have serious redundancy with other sections,
e.g. regularization.tex and ml.tex if we do keep this detailed of a discussion
here, make sure it’s clear that weight decay only affects the weights, not all the
“parameters’ as this says now.
The most common regularizer is simply an additive term equal to a regu-
larization coefficient λ times the squared norm of the parameters
8
, ||θ||
2
2
. This
regularizer is often called the weight decay or L2 weight decay or shrinkage
because adding the squared L2 norm to the training criterion pushes the weights
towards zero, in proportion to their magnitude. For example, when using stochas-
tic gradient descent, each step of the update with regularizer term
λ
2
||θ||
2
would
look like
θ θ
θ
L(f
θ
(x), y) λθ
where is the learning rate (see Section 4.3). This can be rewritten
θ θ(1 λ)
θ
L(f
θ
(x), y)
where we see that the first term systematically shaves off a small proportion of θ
before adding the gradient.
Another commonly used regularizer is the so-called L1 regularizer, which adds
to the training criterion a term proportional to the L1 (absolute value) norm,
|θ|
1
=
P
i
|θ
i
|. The L1 regularizer also prefers values that are small, but whereas
the L2 weight decay has only a weak preference for 0 compared to small but
non-zero values, the L1 regularizer continues pushing the parameters towards
0 with a constant gradient even when they get very small. As a consequence,
it tends to bring some of the parameters to exactly 0 (how many depends on
how large the regularization coefficient λ is chosen). When optimizing with an
approximate iterative and noisy method such as stochastic gradient descent, no
actual 0 is obtained, but near-zero values observed. On the other hand, the L1
regularizer tolerates large values of some parameters (only additively removing
a small quantity compared to the situation without regularization) whereas the
L2 weight decay aggressively punishes and prevents large parameter values. The
L1 and L2 regularizers can be combined, as in the so-called elastic net (Zou and
Hastie, 2005), and this is commonly done in deep networks
9
.
Note how in the context of maximum likelihood, regularizers can generally be
interpreted as Bayesian priors p(θ) on the parameters θ or on the learned function,
8
In principle, one could have different priors on different parameters, e.g., it is common to treat
the output weights with a separate regularization coefficient, but the more hyperparameters, the
more difficult is their optimization, discussed in Chapter 11.
9
See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.
html#l1-and-l2-regularization
165
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
as discussed in Chapter 5.9. Later in this book we discuss regularizers that are
data-dependent (i.e., cannot be expressed easily as a pure prior on parameters),
such as the contractive penalty (Chapter 15) as well as regularization methods
that are difficult to interpret simply as added terms in the training criterion, such
as dropout (Section 7.11).
6.3.4 Optimization Procedure
Once a training criterion is defined, we enter the realm of iterative optimization
procedures, with the slight twist that if possible we want to monitor held-out
performance (for example the average loss on a validation set, which usually does
not include the regularizing terms) and not just the value of the training crite-
rion. Monitoring the validation set performance allows one to stop training at a
point where generalization is estimated to be the best among the training itera-
tions. This is called early stopping and is a standard machine learning technique,
discussed in Sec. 7.7.
Section 4.3 has already covered the basics of gradient-based optimization.
The simplest such technique is stochastic gradient descent (with minibatches), in
which the parameters are updated after computing the gradient of the average
loss over a minibatch of examples (e.g. 128 examples) and making an update in
the direction opposite to that gradient (or opposite to some accumulated average
of such gradients, i.e., the momentum technique, reviewed in Section 8.3.3). The
most commonly used optimization procedures for multi-layer neural networks and
deep learning in general are either variants of stochastic gradient descent (typi-
cally with minibatches), second-order methods (the most commonly used being
L-BFGS and nonlinear conjugate gradients) applied on large minibatches (e.g.
of size 10000) or on the whole training set at once, or natural gradient meth-
ods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu and Bengio,
2013). Exact second-order and natural gradient methods are computationally
too expensive for large neural networks because they involve matrices of dimen-
sion equal to the square of the number of parameters. Approximate methods are
discussed in Section 8.4.
On smaller datasets or when computing can be parallelized, second-order
methods have a computational advantage because they are easy to parallelize
and can still perform many updates per unit of time. On larger datasets (and in
the limit of an infinite dataset, i.e., a stream of training examples) one cannot
afford to wait for seeing the whole dataset before making an update, so that fa-
vors either stochastic gradient descent variants (possibly with minibatches to take
advantage of fast or parallelized implementations of matrix-matrix products) or
second-order methods with minibatches.
A more detailed discussion of issues arising with optimization methods in deep
166
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
learning can be found in Chapter 8. Note that black-box optimization techniques
are not the only tools to improve the optimization of deep networks. Many design
choices in the construction of the family of functions, loss function and regularizer
can have a major impact on the difficulty of optimizing the training criterion.
Furthermore, instead of using generic optimization techniques, one can design
optimization procedures that are specific to the learning problem and chosen
architecture of the family of functions, for example by initializing the parameters
of the final optimization routine from the result of a different optimization (or a
series of optimizations, each initialized from the previous one). Because of the
non-convexity of the training criterion, the initial conditions can make a very
important difference, and this is what is exploited in the various pre-training
strategies, Sections 8.6.4 and 16.1, as well as with curriculum learning (Bengio
et al., 2009), Section 8.7.
6.4 Flow Graphs and Back-Propagation
The term back-propagation is often misunderstood as meaning the whole learning
algorithm for multi-layer neural networks. Actually it just means the method for
computing gradients in such networks. Furthermore, it is generally understood as
something very specific to multi-layer neural networks, but once its derivation is
understood, it can easily be generalized to arbitrary functions (for which comput-
ing a gradient is meaningful), and we describe this generalization here, focusing
on the case of interest in machine learning where the output of the function to
differentiate (e.g., the loss L or the training criterion J) is a scalar and we are in-
terested in its derivative with respect to a set of parameters (considered to be the
elements of a vector θ), or equivalently, a set of inputs
10
. The partial derivative
of J with respect to θ (called the gradient) tells us whether θ should be increased
or decreased in order to decrease J, and is a crucial tool in optimizing the train-
ing objective. It can be readily proven that the back-propagation algorithm has
optimal computational complexity in the sense that there is no algorithm that
can compute the gradient faster (in the O(·) sense, i.e., up to an additive and
multiplicative constant).
The basic idea of the back-propagation algorithm is that the partial derivative
of the cost J with respect to parameters θ can be decomposed recursively by taking
into consideration the composition of functions that relate θ to J, via intermediate
quantities that mediate that influence, e.g., the activations of hidden units in a
deep neural network.
10
It is useful to know which inputs contributed most to the output or error made, and the sign
of the derivative is also interesting in that context.
167
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
6.4.1 Chain Rule
The basic mathematical tool for considering derivatives through compositions of
functions is the chain rule, illustrated in Figure 6.3. The partial derivative
y
x
measures the locally linear influence of a variable x on another one y, while we
denote
θ
J for the gradient vector of a scalar J with respect to some vector of
variables θ. If x influences y which influences z, we are interested in how a tiny
change in x propagates into a tiny change in z via a tiny change in y. In our case
of interest, the “output” is the cost, or objective function z = J(g(θ)), we want
the gradient with respect to some parameters x = θ, and there are intermediate
quantities y = g(θ) such as neural net activations. The gradient of interest can
then be decomposed, according to the chain rule, into
θ
J(g(θ)) =
g(θ)
J(g(θ))
g(θ)
θ
(6.12)
which works also when J, g or θ are vectors rather than scalars (in which case
the corresponding partial derivatives are understood as Jacobian matrices of the
appropriate dimensions). In the purely scalar case we can understand the chain
rule as follows: a small change in θ will propagate into a small change in g(θ) by
getting multiplied by
g(θ)
θ
. Similarly, a small change in g(θ) will propagate into
a small change in J(g(θ)) by getting multiplied by
g(θ)
J(g(θ)). Hence a small
change in θ first gets multiplied by
g(θ)
θ
to obtain the change in g(θ) and this
then gets multiplied by
g(θ)
J(g(θ)) to obtain the change in J(g(θ)). Hence the
ratio of the change in J(g(θ)) to the change in θ is the product of these partial
derivatives.
Figure 6.3: The chain rule, illustrated in the simplest possible case, with z a scalar
function of y, which is itself a scalar function of x. A small change x in x gets turned
into a small change ∆y in y through the partial derivative
y
x
, from the first-order Taylor
approximation of y(x), and similarly for z(y). Plugging the equation for y into the
equation for z yields the chain rule.
Now, if g is a vector, we can rewrite the above as follows:
θ
J(g(θ)) =
X
i
J(g(θ))
g
i
(θ)
g
i
(θ)
θ
168
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Figure 6.4: Top: The chain rule, when there are two intermediate variables y
1
and y
2
between x and z, creating two paths for changes in x to propagate and yield changes in
z. Bottom: more general case, with n intermediate variables y
1
to y
n
.
which sums over the influences of θ on J(g(θ)) through all the intermediate
variables g
i
(θ). This is illustrated in Figure 6.4 with x = θ, y
i
= g
i
(θ), and
z = J(g(θ)).
6.4.2 Back-Propagation in an MLP
Whereas example 6.2.1 illustrated the case of of an MLP with a single hidden
layer let us consider in this section back-propagation for an ordinary but deep
MLP, i.e., like the above vanilla MLP but with several hidden layers. For this
purpose, we will recursively apply the chain rule illustrated in Figure 6.4. The
algorithm proceeds by first computing the gradient of the cost J with respect to
output units, and these are used to compute the gradient of J with respect to the
top hidden layer activations, which directly influence the outputs. We can then
169
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Algorithm 6.1 Forward computation associated with input x for a deep neural
network with ordinary affine layers composed with an arbitrary elementwise dif-
ferentiable (almost everywhere) non-linearity f . There are M such layers, each
mapping their vector-valued input h
k
to a pre-activation vector a
k
via a weight
matrix W
(k)
which is then transformed via f into h
k+1
. The input vector x
corresponds to h
0
and the predicted outputs
ˆ
y corresponds to h
M
. The cost
function L(
ˆ
y, y) depends on the output
ˆ
y and on a target y (see Section 6.3.2
for examples of loss functions). The loss may be added to a regularizer (see
Section 6.3.3 and Chapter 7) to obtain the example-wise cost J. Algorithm 6.2
shows how to compute gradients of J with respect to parameters W and b. For
computational efficiency on modern computers (especially GPUs), it is important
to implement these equations minibatch-wise, i.e., h
(k)
(and similary a
(k)
) should
really be a matrix whose second dimension is the example index in the minibatch.
Accordingly, y and
ˆ
y should have an additional dimension for the example index
in the minibatch, while J remains a scalar, i.e., the average of the costs over all
the minibatch examples.
h
0
= x
for k = 1 . . . , M do
a
(k)
= b
(k)
+ W
(k)
h
(k1)
h
(k)
= f(a
(k)
)
end for
ˆ
y = h
(M)
J = L(
ˆ
y, y) + λ
continue computing the gradients of lower level hidden units one at a time in the
same way The gradients on hidden and output units can be used to compute the
gradient of J with respect to the parameters (e.g. weights and biases) of each
layer (i.e., that directly contribute to the output of that layer).
Algorithm 6.1 describes in matrix-vector form the forward propagation com-
putation for a classical multi-layer network with M layers, where each layer com-
putes an affine transformation (defined by a bias vector b
(k)
and a weight matrix
W
(k)
) followed by a non-linearity f. In general, the non-linearity may be dif-
ferent on different layers, and this is typically the case for the output layer (see
Section 6.3.1). Hence each unit at layer k computes an output h
(k)
i
as follows:
a
(k)
i
= b
(k)
i
+
X
j
W
(k)
ij
h
(k1)
j
h
(k)
i
= f(a
(k)
)v (6.13)
where we separate the affine transformation from the non-linear activation oper-
170
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Algorithm 6.2 Backward computation for the deep neural network of Algo-
rithm 6.1, which uses in addition to the input x a target y. This computation
yields the gradients on the activations a
(k)
for each layer k, starting from the
output layer and going backwards to the first hidden layer. From these gradi-
ents, which can be interpreted as an indication of how each layer’s output should
change to reduce error, one can obtain the gradient on the parameters of each
layer. The gradients on weights and biases can be immediately used as part of a
stochastic gradient update (performing the update right after the gradients have
been computed) or used with other gradient-based optimization methods.
After the forward computation, compute the gradient on the output layer:
g
ˆ
y
J =
ˆ
y
L(
ˆ
y, y) + λ
ˆ
y
(typically is only a function of parameters not activations, so the last term
would be zero)
for k = M down to 1 do
Convert the gradient on the layer’s output into a gradient into the pre-
nonlinearity activation (element-wise multiplication if f is element-wise):
g
a
(k)
J = g f
0
(a
(k)
)
Compute gradients on weights and biases (including the regularization term,
where needed):
b
(k)
J = g + λ
b
(k)
W
(k)
J = g h
(k1)>
+ λ
W
(k)
Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:
g
h
(k1)
J = W
(k)>
g
end for
ations for ease of exposition of the back-propagation computations.
These are described in matrix-vector form by Algorithm 6.2 and proceed from
the output layer towards the first hidden layer, as outlined above.
6.4.3 Back-Propagation in a General Flow Graph
In this section we call the intermediate quantities between inputs (parameters
θ) and output (cost J) of the graph nodes u
j
(indexed by j) and consider the
general case in which they form a directed acyclic graph that has J as its final
node u
N
, that depends of all the other nodes u
j
. The back-propagation algorithm
exploits the chain rule for derivatives to compute
J
u
j
when
J
u
i
has already been
computed for successors u
i
of u
j
in the graph, e.g., the hidden units in the next
layer downstream. This recursion can be initialized by noting that
J
u
N
=
J
J
= 1
and at each step only requires to use the partial derivatives associated with each
arc of the graph,
u
i
u
j
, when u
i
is a successor of u
j
.
171
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
u
j
a
i
u
1
f
i
u
i
= f
i
(a
i
)
u
2
Figure 6.5: Illustration of recursive forward computation, where at each node u
i
we
compute a value u
i
= f
i
(a
i
), with a
i
being the list of values from parents u
j
of node
u
i
. Following Algorithm 6.3, the overall inputs to the graph are u
1
. . . , u
M
(e.g., the
parameters we may want to tune during training), and there is a single scalar output u
N
(e.g., the loss which we want to minimize).
Algorithm 6.3 Flow graph forward computation. Each node computes numerical
value u
i
by applying a function f
i
to its argument list a
i
that comprises the values
of previous nodes u
j
, j < i, with j parents(i). The input to the flow graph is
the vector x, and is set into the first M nodes u
1
to u
M
. The output of the flow
graph is read off the last (output) node u
N
.
for i = 1 . . . , M do
u
i
x
i
end for
for i = M + 1 . . . , N do
a
i
(u
j
)
jparents(i)
u
i
f
i
(a
i
)
end for
return u
N
172
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
More generally than multi-layered networks, we can think about decomposing
a function J(θ) into a more complicated graph of computations. This graph is
called a flow graph. Each node u
i
of the graph denotes a numerical quantity
that is obtained by performing a computation requiring the values u
j
of other
nodes, with j < i. The nodes satisfy a partial order which dictates in what order
the computation can proceed. In practical implementations of such functions
(e.g. with the criterion J(θ) or its value estimated on a minibatch), the final
computation is obtained as the composition of simple functions taken from a
given set (such as the set of numerical operations that the numpy library can
perform on arrays of numbers).
u
1
u
2
u
3
Figure 6.6: Illustration of indirect effect and direct effect of variable u
1
on variable u
3
in a
flow graph, which means that the derivative of u
3
with respect to u
1
must include the sum
of two terms, one for the direct effect (derivative of u
3
with respect to its first argument)
and one for the indirect effect through u
2
(involving the product of the derivative of u
3
with respect to u
2
times the derivative of u
2
with respect to u
1
). Forward computation of
u
i
’s (as in Figure 6.5) is indicated with upward full arrows, while backward computation
(of derivatives with respect to u
i
’s, as in Figure 6.7) is indicated with downward dashed
arrows.
We will define the back-propagation in a general flow-graph, using the follow-
ing generic notation: u
i
= f
i
(a
i
), where a
i
is a list of arguments for the application
of f
i
to the values u
j
for the parents of i in the graph: a
i
= (u
j
)
jparents(i)
. This
is illustrated in Figure 6.5.
The overall computation of the function represented by the flow graph can
thus be summarized by the forward computation algorithm, Algorithm 6.3.
In addition to having some code that tells us how to compute f
i
(a
i
) for some
values in the vector a
i
, we also need some code that tells us how to compute
its partial derivatives,
f
i
(a
i
)
a
ik
with respect to any immediate argument a
ik
. Let
k = π(i, j) denote the index of u
j
in the list a
i
. Note that u
j
could influence u
i
through multiple paths. Whereas
u
i
u
j
would denote the total gradient adding up
all of these influences,
f
i
(a
i
)
a
ik
only denotes the derivative of f
i
with respect to its
specific k-th argument, keeping the other arguments fixed, i.e., only considering
the influence through the arc from u
j
to u
i
. In general, when manipulating
173
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
partial derivatives, one should keep clear in one’s mind (and implementation) the
notational and semantic distinction between a partial derivative that includes all
paths and one that includes only the immediate effect of a function’s argument
on the function output, with the other arguments considered fixed. For example
consider f
3
(a
3,1
, a
3,2
) = e
a
3,1
+a
3,2
and f
2
(a
2,1
) = a
2
2,1
, while u
3
= f
3
(u
2
, u
1
) and
u
2
= f
2
(u
1
), illustrated in Figure 6.6. The direct derivative of f
3
with respect
to its argument a
3,2
is
f
3
a
3,2
= e
a
3,1
+a
3,2
while if we consider the variables u
3
and u
1
to which these correspond, there are two paths from u
1
to u
3
, and we
obtain as derivative the sum of partial derivatives over these two paths,
u
3
u
1
=
e
u
1
+u
2
(1 + 2u
1
). The results are different because
u
3
u
1
involves not just the direct
dependency of u
3
on u
1
but also the indirect dependency through u
2
.
u
N
u
N
=1
u
N
u
i
Figure 6.7: Illustration of recursive backward computation, where we associate to each
node j not just the values u
j
computed in the forward pass (Figure 6.5, bold upward
arrows) but also the gradient
u
N
u
j
with respect to the output scalar node u
N
. These
gradients are recursively computed in exactly the opposite order, as described in Algo-
rithm 6.4 by using the already computed
u
N
u
i
of the children i of j (dashed downward
arrows).
Armed with this understanding, we can define the back-propagation algorithm
as follows, in Algorithm 6.4, which would be computed after the forward prop-
agation (Algorithm 6.3) has been performed. Note the recursive nature of the
application of the chain rule, in Algorithm 6.4: we compute the gradient on node
j by re-using the already computed gradient for children nodes i, starting the
recurrence from the trivial
u
N
u
N
= 1 that sets the gradient for the output node.
This is illustrated in Figure 6.7.
This recursion is a form of efficient factorization of the total gradient, i.e.,
174
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Algorithm 6.4 Back-propagation computation of a flow graph (full, upward ar-
rows, Figs.6.7 and 6.5), which itself produces an additional flow graph (dashed,
backward arrows). See the forward propagation in a flow-graph (Algorithm 6.3,
to be performed first) and the required data structure. In addition, a quantity
u
N
u
i
needs to be stored (and computed) at each node, for the purpose of gradient
back-propagation. Below, the notation π(i, j) is the index of u
j
as an argument to
f
i
. The back-propagation algorithm efficiently computes
u
N
u
i
for all i’s (travers-
ing the graph backwards this time), and in particular we are interested in the
derivatives of the output node u
N
with respect to the “inputs” u
1
. . . , u
M
(which
could be the parameters, in a learning setup). The cost of the overall computa-
tion is proportional to the number of arcs in the graph, assuming that the partial
derivative associated with each arc requires a constant time. This is of the same
order as the number of computations for the forward propagation.
u
N
u
N
1
for j = N 1 down to 1 do
u
N
u
j
P
i:jparents(i)
u
N
u
i
f
i
(a
i
)
a
i,π(i,j)
end for
return
u
N
u
i
M
i=1
it is an application of the principles of dynamic programming
11
. Indeed, the
derivative of the output node with respect to any node can also be written down
in this intractable form:
u
N
u
i
=
X
paths u
k
1
...,u
k
n
: k
1
=i,k
n
=N
n
Y
j=2
u
k
j
u
k
j1
where the paths u
k
1
. . . , u
k
n
go from the node k
1
= i to the final node k
n
= N
in the flow graph and
u
k
j
u
k
j1
refers only to the immediate derivative considering
u
k
j1
as the argument number π(k
j
, k
j1
) of a
k
j
into u
k
j
, i.e.,
u
k
j
u
k
j1
=
f
k
j
(a
k
j
)
a
k
j
(k
j
,k
j1
)
.
Computing the sum as above would be intractable because the number of possible
paths can be exponential in the depth of the graph. The back-propagation algo-
rithm is efficient because it employs a dynamic programming strategy to reuse
11
Here we refer to “dynamic programming” in the sense of table-filling algorithms that avoid
re-computing frequently used subexpressions. In the context of machine learning, “dynamic
programming” can also refer to iterating Bellman’s equations. That is not the kind of dynamic
programming we refer to here.
175
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
rather than re-compute partial sums associated with the gradients on intermediate
nodes.
Although the above was stated as if the u
i
’s were scalars, exactly the same
procedure can be run with u
i
’s being tuples of numbers (more easily represented
by vectors). In that case the equations remain valid, and the multiplication of
scalar partial derivatives becomes the multiplication of a row vector of gradients
u
N
u
i
with a Jacobian of partial derivatives associated with the j i arc of the
graph,
f
i
(a
i
)
a
i,π(i,j)
. In the case where minibatches are used during training, u
i
would
actually be a whole matrix (the extra dimension being for the examples in the
minibatch). This would then turn the basic computation into matrix-matrix
products rather than matrix-vector products, and the former can be computed
much more efficiently than a sequence of matrix-vector products (e.g. with the
BLAS library), especially so on modern computers and GPUs, that rely more and
more on parallelization through many cores (the processing for each example in
the minibatch can essentially be done in parallel).
6.4.4 Symbolic Back-propagation and Automatic Differentiation
The algorithm for generalized back-propagation (Alg. 6.4) was presented with the
interpretation that actual computations take place at each step of the algorithm.
This generalized form of back-propagation is just a particular way to perform
automatic differentiation (Rall, 1981) in computational flow graphs defined by
Algorithm 6.3. Automatic differentiation automatically obtains derivatives of a
given expression and has numerous uses in machine learning (Baydin et al., 2015).
As an alternative (and often as a debugging tool) derivatives could be obtained
by numerical methods based on measuring the effects of small changes, called
numerical differentiation (Lyness and Moler, 1967). For example, a finite differ-
ence approximation of the gradient follows from the definition of derivative as a
ratio of the change in output that results in a change in input, divided by the
change in input. Methods based on random perturbations also exist which ran-
domly jiggle all the input variables (e.g. parameters) and associate these random
input changes with the resulting overall change in the output variable in order to
estimate the gradient (Spall, 1992). However, for obtaining a gradient (i.e., with
respect to many variables, e.g., parameters of a neural network), back-propagation
has two advantages over numerical differentiation: (1) it performs exact computa-
tion (up to machine precision), and (2) it is computationally much more efficient,
obtaining all the required derivatives in one go. Instead, numerical differentiation
methods either require to redo the forward propagation separately for each pa-
rameter (keeping the other ones fixed) or they yield stochastic estimators (from
a random perturbation of all parameters) whose variances grows linearly with
176
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
the number of parameters. Automatic differentiation of a function with d inputs
and m outputs can be done either by carrying derivatives forward, or carrying
them backwards. The former is more efficient when d < m and the latter is more
efficient when d > m. In our use case, the output is a scalar (the cost), and the
backwards approach, called reverse accumulation, i.e., back-propagation, is much
more efficient than the approach of propagating derivatives forward in the graph.
Although Algorithm 6.4 can be seen as a form of automatic differentiation, it
has another interpretation: each step symbolically specifies how gradient compu-
tation could be done, given a symbolic specification of the function to differenti-
ate, i.e., it can be used to perform symbolic differentiation. Whereas automatic
differentiation manipulates and outputs numbers, given a symbolic expression
(the program specifying the function to be computed and differentiated), sym-
bolic differentiation manipulates and outputs symbolic expressions, i.e., pieces of
program, producing a symbolic expression for computing the derivatives. The
popular Torch library
12
for deep learning, as well as most other open source deep
learning libraries are a limited form of doing automatic differentiation restricted
to the “programs” obtained by composing a predefined set of operations, each cor-
responding to a “module”. The set of these modules is designed such that many
neural network architectures and computations can be performed by composing
the building blocks represented by each of these modules. Each module is defined
by two main functions, (1) one that computes the outputs y of the module given
its inputs x, e.g., with an fprop function
y = module.fprop(x),
and (2), one that computes the gradient
J
x
of a scalar (typically the minibatch
cost J) with respect to the inputs x, given the gradient
J
y
with respect to the
outputs, e.g., with a bprop function
J
x
= module.bprop
J
y
.
The bprop function thus implicitly knows the Jacobian of the x to y mapping,
y
x
, at x, and knows how to multiply it with a given vector, which for the back-
propagation computation will be the gradient on the output,
J
y
, i.e., it computes
J
x
=
y
x
>
J
y
if we take the convention that gradient vectors are column vectors. In practice,
implementations work in parallel over a whole minibatch (transforming matrix-
vector operations into matrix-matrix operations) and may operate on objects
12
See torch.ch.
177
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
which are not vectors (maybe higher-order tensors like those involved with im-
ages or sequences of vectors). Furthermore, the bprop function does not have to
explicitly compute the Jacobian matrix
y
x
and perform an actual matrix mul-
tiplication: it can do that matrix multiplication implicitly, which is often more
efficient. For example, if the true Jacobian is diagonal, then the actual number
of computations required is much less than the size of the Jacobian matrix.
To keep computations efficient and avoid the overhead of the glue required to
compose modules together, neural net packages such as Torch define modules that
perform coarse-grained operations such as the cross-entropy loss, a convolution,
the affine operation associated with a fully-connected neural network layer, or a
softmax. It means that if one wants to write differentiable code for some compu-
tation that is not covered by the existing set of modules, one has to write their
own code for a new module, providing both the code for fprop and the code for
bprop. This is in contrast with standard automatic differentiation systems, which
know how to compute derivatives through all the operations in a general-purpose
programming language such as C.
Instead interpreting of Algorithm 6.4 as a recipe for backwards automatic dif-
ferentation, it can be interpreted as a recipe for backwards symbolic differentation,
and this is what the Theano (Bergstra et al., 2010b; Bastien et al., 2012) library
13
is doing. Like Torch, it only covers a predefined set of operations (i.e., a language
that is a subset of usual programming languages), but it is a much larger and fine-
grained set of operations, covering most of the operations on tensors and linear
algebra defined in Python’s numpy library of numerical computation. It is thus
very rare that a user would need to write a new module for Theano, except if they
want to provide an alternative implementation (say, more efficient or numerically
stable in some cases). An immediate advantage of symbolic differentiation is that
because it maps symbolic expressions to symbolic expressions, it can be applied
multiple times and yield high-order derivatives. Another immediate advantage
is that it can take advantage of the other tools of symbolic computation (Buch-
berger et al., 1983), such as simplification (to make computation faster and more
memory-efficient) and transformations that make the computation more numer-
ically stable (Bergstra et al., 2010b). These simplification operations make it
still very efficient in terms of computation and memory usage even with a set of
fine-grained operations such as individual tensor additions and multiplications.
Theano also provides a compiler of the resulting expressions into C for CPUs and
GPUs, i.e., the same high-level expression can be implemented in different ways
depending of the underlying hardware.
13
See http://deeplearning.net/software/theano/.
178
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
6.4.5 Back-propagation Through Random Operations and Graph-
ical Models
Whereas traditional neural networks perform deterministic computation, they
can be extended to perform stochastic computation. In this case, we can think of
the network as defining a sampling process that deterministically transforms some
random values. We can then apply backpropagation as usual, with the underlying
random values as inputs to the network.
As an example, let us consider the operation consisting of drawing samples
zfrom a Gaussian distribution with mean µ and variance σ
2
:
z N(µ, σ
2
).
Because an individual sample of z is not produced by a function, but rather by
a sampling process whose output changes every time we query it, it may seem
counterintuitive to take the derivatives of z with respect to the parameters of
its distribution, µ and σ
2
. However, we can rewrite the sampling process as
transforming an underlying random value eta N(0, 1) to obtain a sample from
the desired distribution:
z = µ + ση (6.14)
We are now able to backpropagate through the sampling operation, by regard-
ing it as a deterministic operation with an extra input. Crucially, the extra input
is a random variable whose distribution is not a function of any of the variables
whose derivatives we want to calculate. The result tells us how an infinitesi-
mal change in µ or σ would change the output if we could repeat the sampling
operation again with the same value of η.
Being able to backpropagate through this sampling operation allows us to
incorporate it into a larger graph; e.g. we can compute the derivatives of some loss
function J(z). Moreover, we can introduce functions that shape the distribution,
e.g. µ = f(x; θ) and σ = g(x; θ) and use back-propagation through this functions
to derive
θ
J(z).
The principal used in this Gaussian sampling example is true in general, i.e.,
given a value z sampled from distribution p(z | ω) whose parameters ω may
depend on other quantities of interest, we can rewrite
z p(z | ω)
as
z = f(ω, η)
where η is a source of randomness that is independent of any of the variables that
influence ω. TODO– add discussion of discrete random variables, REINFORCE.
179
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
it is true that you can express the generation process this way for any variable,
but for discrete variables it can be pointless since the gradient of the thresholding
operation is zero or undefined everywhere. in that case you can fall back to
REINFORCE and the expected loss
In neural network applications, we typically choose η to be drawn from some
simple distribution, such as a unit uniform or unit Gaussian distribution, and
achieve more complex distributions by allowing the deterministic portion of the
network to reshape its input. This is actually how the random generators for
parametric distributions are implemented in software, i.e., by performing opera-
tions on approximately independent sources of noise (such as random bits). So
long as the function f in the above equation is differentiable with respect to ω,
we can back-propagate through the sampling operation.
The idea of propagating gradients or optimizing through stochastic operations
is old (Price, 1958; Bonnet, 1964), first used for machine learning in the context
of reinforcement learning (Williams, 1992), variational approximations (Opper
and Archambeau, 2009), and more recently, stochastic or generative neural net-
works (Bengio et al., 2013a; Kingma, 2013; Kingma and Welling, 2014b,a; Rezende
et al., 2014; Goodfellow et al., 2014c). Many networks, such as denoising autoen-
coders or networks regularized with dropout, are also naturally designed to take
noise as an input without requiring any special reparameterization to make the
noise independent from the model.
6.5 Universal Approximation Properties and Depth
A linear model, mapping from features to outputs via matrix multiplication, can
by definition represent only linear functions. It has the advantage of being easy
to train because many loss functions result in a convex optimization problem
when applied to linear models. Unfortunately, we often want to learn non-linear
functions.
At first glance, we might presume that learning a non-linear function requires
designing a specialized model family for the kind of non-linearity we want to
learn. However, it turns out that feedforward networks with hidden layers provide
a universal approximation framework. Specifically, the universal approximation
theorem (Hornik et al., 1989; Cybenko, 1989) states that a feedforward network
with a linear output layer and at least one hidden layer with any “squashing”
activation function (such as the logistic sigmoid activation function) can approxi-
mate any Borel measurable function from one finite-dimensional space to another
with any desired non-zero amount of error, provided that the network is given
enough hidden units. The derivatives of the feedforward network can also approx-
imate the derivatives of the function arbitrarily well (Hornik et al., 1990). The
180
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
concept of Borel measurability is beyond the scope of this book; for our purposes
it suffices to say that any continuous function on a closed and bounded subset of
R
n
is Borel measurable and therefore may be approximated by a neural network.
A neural network may also approximate any function mapping from any finite
dimensional discrete space to another.
The universal approximation theorem means that regardless of what function
we are trying to learn, we know that a large MLP will be able to represent this
function. However, we are not guaranteed that the training algorithm will be
able to learn that function. Even if the MLP is able to represent the function,
learning can fail for two different reasons. First, the optimization algorithm used
for training may not be able to find the value of the parameters that corresponds
to the desired function. Second, the training algorithm might choose the wrong
function due to overfitting. Recall from Chapter 5.4 that the “no free lunch”
theorem shows that there is no universal machine learning algorithm. Even though
feedforward networks provide a universal system for representing functions, there
is no universal procedure for examining a training set and choosing the right set
of functions among the family of functions our approximator can represent: there
could be many functions within our family that fit well the data and we need to
choose one (this is basically the overfitting scenario).
Another but related problem facing our universal approximation scheme is
the size of the model needed to represent a given function. The universal ap-
proximation theorem says that there exists a network large enough to achieve any
degree of accuracy we desire, but it does not say how large this network will be.
Barron (1993) provides some bounds on the size of a single-layer network needed
to approximate a broad class of functions. Unfortunately, in the worse case, an
exponential number of hidden units (to basically record every input configuration
that needs to be distinguished) may be required. This is easiest to see in the
binary case: the number of possible binary functions on vectors v {0, 1}
n
is
2
2
n
and selecting one such function requires 2
n
bits, which will in general require
O(2
n
) degrees of freedom.
In summary, a feedforward network with a single layer is sufficient to represent
any function, but it may be infeasibly large and may fail to learn and generalize
correctly. Both of these failure modes suggest that we may want to use deeper
models.
First, we may want to choose a model with more than one hidden layer in
order to avoid needing to make the model infeasibly large. There exist families
of functions which can be approximated efficiently by an architecture with depth
greater than some value d, but require a much larger model if depth is restricted
to be less than or equal to d. In many cases, the number of hidden units required
by the shallow model is exponential in n. Such results have been proven for logic
181
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
gates (H˚astad, 1986), linear threshold units with non-negative weights (H˚astad
and Goldmann, 1991), polynomials (Delalleau and Bengio, 2011) organized as
deep sum-product networks (Poon and Domingos, 2011), and more recently, for
deep networks of rectifier units (Pascanu et al., 2013b). Of course, there is no
guarantee that the kinds of functions we want to learn in applications of machine
learning (and in particular for AI) share such a property.
We may also want to choose a deep model for statistical reasons. Any time
we choose a specific machine learning algorithm, we are implicitly stating some
set of prior beliefs we have about what kind of function the algorithm should
learn. Choosing a deep model encodes a very general belief that the function
we want to learn should involve composition of several simpler functions. This
can be interpreted from a representation learning point of view as saying that
we believe the learning problem consists of discovering a set of underlying factors
of variation that can in turn be described in terms of other, simpler underlying
factors of variation. Alternately, we can interpret the use of a deep architecture
as expressing a belief that the function we want to learn is a computer program
consisting of multiple steps, where each step makes use of the previous step’s
output. These intermediate outputs are not necessarily factors of variation, but
can instead be analogous to counters or pointers that the network uses to organize
its internal processing. Empirically, greater depth does seem to result in better
generalization for a wide variety of tasks (Bengio et al., 2007b; Erhan et al., 2009;
Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al., 2012;
Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a; Couprie
et al., 2013; Kahou et al., 2013; Goodfellow et al., 2014d; Szegedy et al., 2014a).
See Fig. 6.8 for an example of some of these empirical results. This suggests that
that using deep architectures does indeed express a useful prior over the space of
functions the model learn.
This is related to our desire to choose a deep model for statistical reasons. Any
time we choose a specific machine learning algorithm, we are implicitly stating
some set of prior beliefs we have about what kind of function the algorithm should
learn. Choosing a deep model encodes a very general belief that the function we
want to learn should involve composition of several simpler functions. This can
be interpreted from a representation learning point of view as saying that we
believe the learning problem consists of discovering a set of underlying factors
of variation that can in turn be described in terms of other, simpler underlying
factors of variation. Alternately, we can interpret the use of a deep architecture
as expressing a belief that the function we want to learn is a computer program
consisting of multiple steps, where each step makes use of the previous step’s
output. These intermediate outputs are not necessarily factors of variation, but
can instead be analogous to counters or pointers that the network uses to organize
182
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
Figure 6.8: Empirical results showing that deeper networks generalize better when used
to transcribe multi-digit numbers from photographs of addresses. Reproduced with per-
mission from Goodfellow et al. (2014d). Left) The test set accuracy consistently increases
with increasing depth. Right) This effect cannot be explained simply by the model being
larger; one can also increase the model size by increasing the width of each layer. The test
accuracy cannot be increased nearly as well by increasing the width, only by increasing
the depth. This suggests that using a deep model expresses a useful preference over the
space of functions the model can learn. Specifically, it expresses a belief that the function
should consist of many simpler functions composed together. This could result either
in learning a representation that is composed in turn of simpler representations (e.g.,
corners defined in terms of edges) or in learning a program with sequentially dependent
steps (e.g., first locate a set of objects, then segment them from each other, then recognize
them).
183
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
its internal processing. Empirically, greater depth does seem to result in better
generalization for a wide variety of tasks (Bengio et al., 2007a; Erhan et al., 2009;
Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al., 2012;
Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a; Couprie
et al., 2013; Kahou et al., 2013; Goodfellow et al., 2014d; Szegedy et al., 2014a).
See Fig. 6.8 for an example of some of these empirical results. This suggests that
that using deep architectures does indeed express a useful prior over the space of
functions the model learn.
6.6 Feature / Representation Learning
Let us consider again the single layer networks such as the perceptron, linear
regression and logistic regression: such linear models are appealing because train-
ing them involves a convex optimization problem
14
, i.e., an optimization problem
with some convergence guarantees towards a global optimum, irrespective of ini-
tial conditions. Simple and well-understood optimization algorithms are available
in this case. However, this limits the representational capacity too much: many
tasks, for a given choice of input representation x (the raw input features), cannot
be solved by using only a linear predictor. What are our options to avoid that
limitation?
1. One option is to use a kernel machine (Williams and Rasmussen, 1996;
Scolkopf et al., 1999), i.e., to consider a fixed mapping from x to φ(x),
where φ(x) is of much higher dimension. In this case, f
θ
(x) = b + w ·φ(x)
can be linear in the parameters (and in φ(x)) and optimization remains
convex (or even analytic). By exploiting the kernel trick, we can compu-
tationally handle a high-dimensional φ(x) (or even an infinite-dimensional
one) so long as the kernel k(u, v) = φ(u) · φ(v) (where · is the appropriate
dot product for the space of φ(·)) can be computed efficiently. If φ(x) is
of high enough dimension, we can always have enough capacity to fit the
training set, but generalization is not at all guaranteed: it will depend on
the appropriateness of the choice of φ as a feature space for our task. Kernel
machines theory clearly identifies the choice of φ to the choice of a prior.
This leads to kernel engineering, which is equivalent to feature engineer-
ing, discussed next. The other type of kernel (that is very commonly used)
embodies a very broad prior, such as smoothness, e.g., the Gaussian (or
RBF) kernel k(u, v) = exp
−||u v||
2
. Unfortunately, this prior may
be insufficient, i.e., too broad and sensitive to the curse of dimensionality,
as introduced in Section 5.13.1 and developed in more detail in Chapter 16.
14
or even one for which an analytic solution can be computed, with linear regression or the
case of some Gaussian process regression models
184
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
2. Another option is to manually engineer the representation or features φ(x).
Most industrial applications of machine learning rely on hand-crafted fea-
tures and most of the research and development effort (as well as a very
large fraction of the scientific literature in machine learning and its applica-
tions) goes into designing new features that are most appropriate to the task
at hand. Clearly, faced with a problem to solve and some prior knowledge
in the form of representations that are believed to be relevant, the prior
knowledge can be very useful. This approach is therefore common in prac-
tice, but is not completely satisfying because it involves a very task-specific
engineering work and a laborious never-ending effort to improve systems by
designing better features. If there were some more general feature learning
approaches that could be applied to a large set of related tasks (such as
those involved in AI), we would certainly like to take advantage of them.
Since humans seem to be able to learn a lot of new tasks (for which they
were not programmed by evolution), it seems that such broad priors do
exist. This whole question is discussed in a lot more detail in Bengio and
LeCun (2007a), and motivates the third option.
3. The third option is to learn the features, or learn the representation. In a
sense, it allows one to interpolate between the almost agnostic approach
of a kernel machine with a general-purpose smoothness kernel (such as
RBF SVMs and other non-parametric statistical models) and full designer-
provided knowledge in the form of a fixed representation that is perfectly
tailored to the task. This is equivalent to the idea of learning the kernel, ex-
cept that whereas most kernel learning methods only allow very few degrees
of freedom in the learned kernel, representation learning methods such as
those discussed in this book (including multi-layer neural networks) allow
the feature function φ(·) to be very rich (with a number of parameters that
can be in the millions or more, depending on the amount of data available).
This is equivalent to learning the hidden layers, in the case of a multi-layer
neural network. Besides smoothness (which comes for example from reg-
ularizers such as weight decay), other priors can be incorporated in this
feature learning. The most celebrated of these priors is depth, discussed
above (Section 6.5). Other priors are discussed in Chapter 16.
This whole discussion is clearly not specific to neural networks and supervised
learning, and is one of the central motivations for this book.
185
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
6.7 Piecewise Linear Hidden Units
Most of the recent improvement in the performance of deep neural networks can be
attributed to increases in computational power and the size of datasets. The ma-
chine learning algorithms involved in recent state-of-the-art systems have mostly
existed since the 1980s, with a few recent conceptual advances contributing sig-
nificantly to increased performance.
One of the main algorithmic improvements that has had a significant impact
is the use of piecewise linear units, such as absolute value rectifiers and rectified
linear units. Such units consist of two linear pieces and their behavior is driven
by a single weight vector. Jarrett et al. (2009b) observed that “using a rectifying
non-linearity is the single most important factor in improving the performance of a
recognition system” among several different factors of neural network architecture
design.
For small datasets, Jarrett et al. (2009b) observed that rectifying non-linearities
is even more important than learning the weights of the hidden layers. Random
weights are sufficient to propagate useful information through a rectified linear
network, allowing the classifier layer at the top to learn how to map different
feature vectors to class identities.
When more data is available, learning becomes relevant because it can extract
more knowledge from it, and learning typically beats fixed or random settings of
parameters. Glorot et al. (2011b) showed that learning is far easier in deep
rectified linear networks than in deep networks that have curvature or two-sided
saturation in their activation functions. Because the behavior of the unit is linear
over half of its domain, it is easy for an optimization algorithm to tell how to im-
prove the behavior of a unit, even when the unit’s activations are far from optimal.
Just as piecewise linear networks are good at propagating information forward,
back-propagation in such a network is also piecewise linear and propagates infor-
mation about the error derivatives to all of the gradients in the network. Each
piecewise linear function can be decomposed into different regions corresponding
to different linear pieces. When we change a parameter of the network, the result-
ing change in the network’s activity is linear until the point that it causes some
unit to go from one linear piece to another. Traditional units such as sigmoids
are more prone to discarding information due to saturation both in forward prop-
agation and in back-propagation, and the response of such a network to a change
in a single parameter may be highly nonlinear even in a small neighborhood.
Glorot et al. (2011b) motivate rectified linear units from biological considera-
tions. The half-rectifying non-linearity was intended to capture these properties
of biological neurons: 1) For some inputs, biological neurons are completely in-
active. 2) For some inputs, a biological neuron’s output is proportional to its
input. 3) Most of the time, biological neurons operate in the regime where they
186
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
are inactive (e.g., they should have sparse activations).
One drawback to rectified linear units is that they cannot learn via gradient-
based methods on examples for which their activation is zero. This problem
can be mitigated by initializing the biases to a small positive number, but it is
still possible for a rectified linear unit to learn to de-activate and then never be
activated again. Goodfellow et al. (2013a) introduced maxout units and showed
that maxout units can successfully learn in conditions where rectified linear units
become stuck. Maxout units are also piecewise linear, but unlike rectified linear
units, each piece of the linear function has its own weight vector, so whichever
piece is active can always learn. Due to the greater number of weight vectors,
maxout units typically need extra regularization such as dropout, though they can
work satisfactorily if the training set is large and the number of pieces per unit
is kept low (Cai et al., 2013). Maxout units have a few other benefits. In some
cases, one can gain some statistical and computational advantages by requiring
fewer parameters. Specifically, if the features captured by n different linear filters
can be summarized without losing information by taking the max over each group
of k features, then the next layer can get by with k times fewer weights. Because
each unit is driven by multiple filters, maxout units have some redundancy that
helps them to resist forgetting how to perform tasks that they were trained on in
the past. Neural networks trained with stochastic gradient descent are generally
believed to suffer from a phenomenon called catastrophic forgetting but maxout
units tend to exhibit only mild forgetting (Goodfellow et al., 2014a). Maxout
units can also be seen as learning the activation function itself rather than just
the relationship between units. With large enough k, a maxout unit can learn to
approximate any convex function with arbitrary fidelity. In particular, maxout
with two pieces can learn to implement the rectified linear activation function or
the absolute value rectification function.
This same general principle of using linear behavior to obtain easier opti-
mization also applies in other contexts besides deep linear networks. Recurrent
networks can learn from sequences and produce a sequence of states and outputs.
When training them, one needs to propagate information through several time
steps, which is much easier when some linear computations (with some directional
derivatives being of magnitude near 1) are involved. One of the best-performing
recurrent network architectures, the LSTM, propagates information through time
via summation–a particular straightforward kind of such linear activation. This
is discussed further in Section 10.7.4.
In addition to helping to propagate information and making optimization
easier, piecewise linear units also have some nice properties that can make them
easier to regularize. This is discussed further in Section 7.11.
Sigmoidal non-linearities still perform well in some contexts and are required
187
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
when a hidden unit must compute a number guaranteed to be in a bounded
interval (like in the (0,1) interval), but piecewise linear units are now by far the
most popular kind of hidden units.
6.8 Historical Notes
Section 1.2 already gave an overview of the history of neural networks and deep
learning. Here we focus on historical notes regarding back-propagation and the
connectionist ideas that are still at the heart of today’s research in deep learning.
The chain rule was invented in the 17th century (Leibniz, 1676; L’Hˆopital,
1696) and gradient descent in the 19th centry (Cauchy, 1847b). Efficient applica-
tions of the chain rule which exploit the dynamic programming structure described
in this chapter are found already in the 1960’s and 1970’s, mostly for control ap-
plications (Kelley, 1960; Bryson and Denham, 1961; Dreyfus, 1962; Bryson and
Ho, 1969; Dreyfus, 1973) but also for sensitivity analysis (Linnainmaa, 1976).
Bringing these ideas to the optimization of weights of artificial neural networks
with continuous-valued outputs was introduced by Werbos (1981) and rediscov-
ered independently in different ways as well as actually simulated successfully
by LeCun (1985); Parker (1985); Rumelhart et al. (1986a). In particular, Rumel-
hart et al. (1986a) and a corresponding chapter (Rumelhart et al., 1986b) in the
PDP book (Rumelhart et al., 1986d) greatly contributed to popularize the idea
of back-propagation and initiated a very active period of research in multi-layer
neural networks. However, the ideas put forward by the authors of that book and
in particular by Rumelhart and Hinton go much beyond back-propagation. They
include crucial ideas about the possible computational implementation of several
central aspects of cognition and learning, which came under the name of “con-
nectionism” because of the importance given the connections between neurons as
the locus of learning and memory. In particular, these ideas include the notion of
distributed representation, introduced in Chapter 1 and developed a lot more in
part III of this book, with Chapter 16, which is at the heart of the generalization
ability of neural networks. As discussed with the historical survey in Section 1.2,
the boom of AI and machine learning research which followed on the connec-
tionist ideas reached a peak in the early 1990’s, as far as neural networks are
concerned, while other machine learning techniques become more popular in the
late 1990’s and remained so for the first decade of this century. Neural networks
research in the AI and machine learning community almost vanished then, only
to be reborn ten years later (starting in 2006) with a novel focus on the depth of
representation and the current wave of research on deep learning. In addition to
back-propagation and distributed representations, the connectionists brought the
idea of iterative inference (they used different words), viewing neural computation
188
CHAPTER 6. FEEDFORWARD DEEP NETWORKS
in the brain as a way to look for a configuration of neurons that best satisfy all
the relevant pieces of knowledge implicitly captured in the weights of the neural
network. This view turns out to be central in the topics covered in part III of
this book regarding probabilistic models and inference.
189