Chapter 6
Feedforward Deep Networks
6.1 Formalizing and Generalizing Neural Networks
Feedforward supervised neural networks were among the first and most successful learn-
ing algorithms (Rumelhart et al., 1986c,b). They are also called deep networks, multi-
layer Perceptron (MLP), or simply neural networks. In addition to covering the basics of
such networks, this chapter introduces a general formalism for gradient-based optimiza-
tion of parametrized families of functions, often in the context of conditional maximum
likelihood criteria (Section 6.2).
They bring together a number of important machine learning concepts already in-
troduced in the previous chapters:
Define a parametrized family of functions f
θ
describing how the learner will
behave on new examples, i.e., what output f
θ
(x) it will produce given some input
x. Training consists in choosing the parameter θ (usually represented by a vector)
given some training examples (x, y) sampled from an unknown data generating
distribution P(X, Y ).
Define a loss function L describing what scalar loss L(ˆy, y) is associated with
each supervised example (x, y), as a function of the learner’s output ˆy = f
θ
(x) and
the target output y.
Define a training criterion and a regularizer. The objective of training is to
ideally to minimize the expected loss E
X,Y
[L(f
θ
(X), Y )] over X, Y sampled from
the unknown data generating distribution P (X, Y ). However this is not possible
because the expectation makes use of the true underlying P(X, Y ) but we only
have access to a finite number of training examples, i.e. of pairs (X,Y ). Instead,
one defines a training criterion which usually includes an empirical average of the
loss over the training set, plus some additional terms (called regularizers) which
enforce some preferences over the choices of θ.
Define an optimization procedure to approximately minimize the training cri-
95
terion
1
. The simplest such optimization procedure is stochastic gradient descent,
described in Section 4.3.
Example 6.1.1 illustrates these concepts for the case of a vanilla neural network for
regression.
In chapter 10 we consider generalizations of the above framework to the unsupervised
and semi-supervised cases, where Y may not exist or may not always be present. An
unsupervised loss can then be defined solely as a function of the input x and some
function f
θ
(x) that one wishes to learn.
V
W
Figure 6.1: Vanilla MLP, with one sigmoid hidden layer, computing vector-valued hidden
unit vector h = sigmoid(c + W x) with weight matrix W and bias or offset vector c. The
output vector is obtained via another learned affine transformation ˆy = b + V h, with
weight matrix V and output bias (offset) vector b.
1
It is generally not possible to analytically obtain a global minimum of the training criterion, and
iterative numerical optimization methods are used instead.
96
Example 6.1.1. Vanilla Multi-Layer Neural Network for Regression
Based on the above definitions, we could pick the family of input-output func-
tions to be
f
θ
(x) = b + V sigmoid(c + W x),
illustrated in Figure 6.1, where sigmoid(a) = 1/(1+e
a
) is applied element-wise,
the input is the vector x R
n
i
, the hidden layer outputs are the elements of the
vector h = sigmoid(c + Wx) with n
h
entries, the parameters are θ = (b, c, V, W )
with b R
n
o
a vector the same dimension as the output (n
o
), c R
n
h
of the
same dimension as h (number of hidden units), V R
n
o
×n
h
and W R
n
h
×n
i
being weight matrices.
The loss function for this classical example could be the squared error
L(ˆy, y) = ||ˆy y||
2
(see Section 5.8.1 showing that it makes ˆy an estima-
tor of E[Y |x]). The regularizer could be the ordinary L2 weight decay
||ω||
2
= λ(
ij
W
2
ij
+
ki
V
2
ki
), where we define the set of weights ω as the con-
catenation of the elements of matrices W and V . The L2 weight decay thus
penalizes the squared norm of the weights, with λ a scalar that is larger to pe-
nalize stronger and yield smaller weights. Combining the loss function and the
regularizer gives the training criterion, which is the objective function during
training:
C(θ) = λ||ω||
2
+
1
n
n
i=1
||y
i
(b + V sigmoid(c + Wx
i
))||
2
.
Finally, the classical training procedure in this example is stochastic gradient
descent, which iteratively updates θ according to
ω ω 2(λω +
L(f
θ
(x
i
), y
i
)
ω
)
β β 2
L(f
θ
(x
i
), y
i
)
β
,
where β = (b, c) contains the offset (bias) parameters, incrementing i after
each training example is shown, modulo n.
6.2 Parametrizing a Learned Predictor
There are many ways to define the family of input-output functions, loss function,
regularizer and optimization procedure, and the most common ones are described below,
97
while more advanced ones are left to later chapters, in particular Chapters 10 and 12.
6.2.1 Family of Functions
A motivation for the family of functions defined by multi-layer neural networks is to
compose affine transformations and non-linearities, where the choice of parameter values
controls the degree and location of these non-linearities. As discussed in Section 6.4
below, with the appropriate choice of parameters, multi-layer neural networks can in
principle approximate smooth functions, with more hidden units allowing one to achieve
better approximations
A multi-layer neural network with more than one hidden layer can be defined by
generalizing the above structure, e.g., as follows, where we chose to use hyperbolic
tangent
2
activation functions instead of sigmoid activation functions:
h
k
= tanh(b
k
+ W
k
h
k1
)
where h
0
= x is the input of the neural net, h
k
(for k > 0) is the output of the k-th
hidden layer, which has weight matrix W
k
and offset (or bias) vector b
k
. If we want
the output f
θ
(x) to lie in some desired range, then we typically define an output non-
linearity (which we did not have in the above Example 6.1.1). The non-linearity for the
output layer is of course generally different from the tanh, depending on the type of
output to be predicted and the associated loss function (see below).
As discussed in Chapter 19 (Section 19.5) there are several other non-linearities
besides the sigmoid and the hyperbolic tangent which have been successfully used with
neural networks. In particular, TODO discusses alternative non-linear units, such as the
rectified linear unit (max(0, b+w·x)) and the maxout unit (max
i
(b
i
+W
i
·x)) which have
been particularly successful in the case of deep feedforward or convolutional networks.
These and other non-linearities or neural network computation commonly found in the
literature are summarized below.
Rectifier or rectified linear unit (RELU) or positive part: h(x) = max(0, b+
w · x).
Hyperbolic tangent: h(x) = tanh(b + w ·x).
Sigmoid: h(x) = 1/(1 + e
(b+w·x)
).
Softmax: h(x) = softmax(b + Wx). It is mostly used as output non-linearity for
predicting discrete probabilities. See definition and discussion below, Eq. 6.1.
Radial basis function or RBF unit: h(x) = e
−||wx||
2
2
. This is heavily used
in kernel SVMs (Boser et al., 1992; Scolkopf et al., 1999) and has the advantage
that such units can be easily initialized as a random subset of the input exam-
ples (Powell, 1987; Niranjan and Fallside, 1990).
2
which is linearly related to the sigmoid as follows: tanh(x) = 2 × sigmoid(2x) 1
98
Softplus: this is a smooth version of the rectifier, introduced in Dugas et al. (2001)
for function approximation and in Nair and Hinton (2010a) in RBMs. When they
introduced the rectifier for deep supervised nets, Glorot et al. (2011a) compared
the softplus and rectifier and found better results with the latter, in spite of the
very similar shape and the differentiability and non-zero derivative of the softplus
everywhere, contrary to the rectifier.
Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the
latter, it is bounded, h(x) = max(1, min(1, x)). It was introduced by Collobert
(2004).
Absolute value rectification: h(x) = |x| (may be applied on the dot product
output or on the output of a tanh unit). It is also related to the rectifier and
has been used for object recognition from images (Jarrett et al., 2009), where it
makes sense to seek features that are invariant under a polarity reversal of the
input illumination.
Maxout: this is discussed in more detail in TODO It generalizes the rectifier but
introduces multiple weight vectors w
i
(called filters) for each unit. h(x) = max
i
(b
i
+ w
i
· x).
This is not an exhaustive list but covers most of the non-linearities and unit computa-
tions seen in the deep learning and neural nets literature. Many variants are possible.
As discussed in Section 6.3, the structure (also called architecture) of the family of
input-output functions can be varied in many ways, which calls for a generic principle
for efficiently computing gradients, described in that section. For example, a common
variation is to connect layers that are not adjacent, with so-called skip connections,
which are found in the visual cortex (for which the word “layer” should be replaced by
the word “area”). Other common variations depart from a full connectivity between
adjacent layers. For example, each unit at layer k may be connected to only a subset of
units at layer k 1. A particular case of such form of sparse connectivity is discussed
in chapter 11 with convolutional networks. In general, the set of connections between
units of the whole network only needs to form a directed acyclic graph in order to define
a meaningful computation (see the flow graph formalism below, Section 6.3). When
units of the network are connected to themselves through a cycle, one has to properly
define what computation is to be done, and this is what is done in the case of recurrent
networks, treated in Chapter 12.
6.2.2 Loss Function and Conditional Log-Likelihood
In the 80’s and 90’s the most commonly used loss function was the squared error
L(f
θ
(x), y) = ||f
θ
(x) y||
2
. As discussed in Section 5.8.1, if f is unrestricted (non-
parametric), minimizing the expected value of loss function over some data-generating
distribution P (X, Y ) yields f (x) = E[Y |X], the true conditional expectation of Y given
X
3
. See also Section 5.8.1 for more details. This tells us what the neural network
3
Proving this is not difficult and is very instructive.
99
is trying to learn. Replacing the squared error by an absolute value makes the neural
network try to estimate not the conditional expectation but the conditional median.
However, when y is a discrete label, i.e., for classification problems, other loss func-
tions such as the Bernoulli negative log-likelihood
4
have been found to be more ap-
propriate than the squared error. In the case where y {0, 1} is binary this gives
L(f
θ
(x), y) = y log f
θ
(x) (1 y) log(1 f
θ
(x)) and it can be shown that the optimal
(non-parametric) f minimizing this criterion is f(x) = P (Y = 1|x). Note that in order
for the above expression of the criterion to make sense, f
θ
(x) must be strictly between 0
and 1 (an undefined or infinite value would otherwise arise). To achieve this, it is com-
mon to use the sigmoid as non-linearity for the output layer, which matches well with
the Binomial negative log-likelihood criterion
5
. As explained below (Section 6.2.2), the
cross-entropy criterion allows gradients to pass through the output non-linearity even
when the neural network produces a confidently wrong answer, unlike the squared error
criterion coupled with a sigmoid or softmax non-linearity.
Learning a Conditional Probability Model
More generally, one can define a loss function as corresponding to a conditional log-
likelihood, i.e., the negative log-likelihood (NLL) criterion
L(f
θ
(x), y) = log P
θ
(Y |X).
See Section 5.8.2 which shows that this criterion yields an estimator of the true condi-
tional probability of Y given X. For example, if Y is a continuous random variable and
we assume that, given X, it has a Gaussian distribution with mean f
θ
(X) and variance
σ
2
, then
log P
θ
(Y |X) =
1
2
(f
θ
(X) Y )
2
2
+ log(2πσ
2
).
Up to an additive and multiplicative constant (which would give the same choice of θ),
this is therefore equivalent to the squared error loss. Once we understand this principle,
we can readily generalize it to other distributions, as appropriate. For example, it
is straightforward to generalize the univariate Gaussian to the multivariate case, and
under appropriate parametrization consider the variance to be a parameter or even
parametrized function of x (for example with output units that are guaranteed to be
positive, or forming a positive definite matrix).
Similarly, for discrete variables, the Binomial negative log-likelihood criterion corre-
sponds to the conditional log-likelihood associated with the Bernoulli distribution with
probability p = f
θ
(x) of generating Y = 1 given X = x (and probability 1 p of
generating Y = 0):
log P
θ
(Y |X) = 1
Y =1
log p1
Y =0
log(1 p) = Y log f
θ
(X)(1Y ) log(1f
θ
(X)).
4
often called cross entropy in the literature, even though the term cross entropy should also apply to
many other losses that can be viewed as negative log-likelihood, discussed below in more detail. TODO:
where do we actually discuss this? may as well discuss it in maximum likelihood section
5
In reference to statistical models, this “match” between the loss function and the output non-
linearity is similar to the choice of a link function in generalized linear models (McCullagh and Nelder,
1989).
100
Softmax
When Y is discrete (in some finite set, say {1, . . . N}) but not binary, the Bernoulli
distribution is extended to the Multinoulli (Murphy, 2012)
6
. This distribution is spec-
ified by a vector of N probabilities that sum to 1, each element of which provides the
probability p
i
= P(Y = i|X). We thus need a vector-valued output non-linearity that
produces such a vector of probabilities, and a commonly used non-linearity for this
purpose is the softmax non-linearity introduced earlier.
p = softmax(a) p
i
=
e
a
i
j
e
a
j
. (6.1)
where typically a = b + W h is the vector of scores whose elements a
i
are associated
with each category i. The corresponding loss function is therefore L(p, y) = log p
y
.
Note how minimizing this loss will push a
y
up (increase the score a
y
associated with
the correct label y) while pushing a
i
(for i = y) down (decreasing the score of the other
labels, in the context x). The first effect comes from the numerator of the softmax while
the second effect comes from the normalizing denominator. These forces cancel on a
specific example only if p
y
= 1 and they cancel in average over examples (say sharing
the same x) if p
i
equals the fraction of times that Y = i for this value x. The same
principles and the role of the normalization constant (or “partition function”) can be
seen at play in the training of Markov Random Fields, Boltzmann machines and RBMs,
in Chapter 9. Note other interesting properties of the softmax. First of all, the gradient
of the Multinoulli negative log-likelihood with respect to the softmax argument a is
a
i
(log p
y
) = (p
i
δ
iy
) (6.2)
which does not saturate, i.e, the gradient does not vanish when the output probabilities
approach 0 or 1, except when the model is providing the correct answer. On the other
hand, other loss functions such as the squared error applied to sigmoid outputs (which
was popular in the 80’s and 90’s) will have vanishing gradient when an output unit satu-
rates, even if the output is completely wrong (Solla et al., 1988). Another property that
could potentially be interesting is that the softmax output is invariant under additive
changes of its input vector: softmax(a) = softmax(a + b) when b is a scalar added to all
the elements of vector a. Finally, it is interesting to think of the softmax as a way to
create a form of competition between the units (typically output units, but not neces-
sarily) that participate in it: the strongest filter output a
i
gets reinforced (because the
exponential increases faster for larger values) and the other units get inhibited. This is
analogous to the lateral inhibition that is believed to exist between nearby neurons in
cortex, and at the extreme (when the a
i
’s are large in magnitude) becomes a form of
winner-take-all (one of the outputs is 1 and the others is 0). A more computationally
expensive form of competition is found with sparse coding, described in Section 16.3.
6
The Multinoulli is also known as the categorical or generalized Bernoulli distribution, which is a
special case of the Multinomial distribution with one trial, and the term multinomial is often used even
in this case.
101
Neural Net Outputs as Parameters of a Conditional Distribution
In general, for any parametric probability distribution P(Y |ω) with parameters ω, we
can construct a conditional distribution P (Y |X) by making ω a parametrized function
of X, and learning that function:
P (Y |ω = f
θ
(X))
where f
θ
(X) is the output of a predictor, X is its input, and Y can be thought of as a
target. This established choice of words comes from the common cases of classification
and regression, where f
θ
(X) is really a prediction associated with Y , or its expected
value. However, in general ω = f
θ
(X) may contain parameters of the distribution of Y
other than its expected value. For example, it could contain its variance or covariance,
in the case where Y is conditionally Gaussian. In the above examples, with the squared
error loss, ω is the mean of the Gaussian which captures the conditional distribution
of Y (which means that the variance is considered fixed, not a function of X). In
the common classification case, ω contains the probabilities associated with the various
events of interest.
Once we view things in this way, we automatically get as the natural cost function
the negative log-likelihood L(X, Y ) = log P (Y |ω = f
θ
(X)). Besides θ, there could be
other parameters that control the distribution of Y . For example, we may wish to learn
the variance of a conditional Gaussian, even when the latter is not a function of X. In
this case, the solution can be computed analytically because the maximum likelihood
estimator of variance is simply the empirical mean of the squared difference between
observations Y and their expected value (here f
θ
(X)). In other words, the conditional
variance can be estimated from the mean squared error.
Besides the Gaussian, a simple and common example is the case where Y is binary
(i.e. Bernoulli distributed), where it is enough to specify ω = P (Y = 1|X). In the
Multinoulli case, ω is generally specified by a vector of probabilities summing to 1, e.g.,
via the softmax non-linearity discussed above.
Another interesting and powerful example of output distribution for neural networks
is the mixture model, and in particular the Gaussian mixture model, introduced in Sec-
tion 3.10.5. Neural networks that compute the parameters of a mixture model were
introduced in Jacobs et al. (1991); Bishop (1994). In the case of the Gaussian mixture
model with N components,
P (y|x) =
N
i=1
P (C = i|x)N(y|µ
i
(x), Σ
i
(x)
the neural network must have three kinds of outputs, P(C = i|x), µ
i
(x), and Σ
i
(x),
which must satisfy different constraints:
1. Mixture components P (C = i|x): these form a Multinoulli distribution over the
N different components C, and can typically be obtained by a softmax over an
N-vector, to guarantee that these outputs are positive and sum to 1.
102
2. Means µ
i
(x): these indicate the center or mean associated with each Gaussian
component, and are unconstrained (typically with no non-linearity at all for these
output units). If Y is a d-vector, then the network must output an N ×d matrix
containing all these N d-vectors.
3. Covariances Σ
i
(x): these specify the covariance matrix for each component i. In
many models the variance is unconditional (does not depend on x) and diagonal or
even scalar. If it is diagonal or scalar, only positivity must be enforced (e.g., using
the softplus non-linearity). If it is full and conditional, then a parametrization must
be chosen that guarantees positive-definiteness of the predicted covariance matrix.
This can be achieved by writing Σ
i
(x) = B
i
(x)B
i
(x), where B
i
is an unconstrained
square matrix. One practical issue if the the matrix is full is that computing
the likelihood is expensive, requiring O(d
3
) computation for the determinant and
inverse of Σ
i
(x) (or equivalently, and more commonly done, its eigendecomposition
or that of B
i
(x)).
It has been reported that gradient-based optimization of conditional Gaussian mixtures
(on the output of neural networks) can be finicky, in part because one gets divisions
(by the variance) which can be numerically unstable (when some variance gets to be
small for a particular example, yielding very large gradients). One solution is to clip
gradients (see Section ?? and Mikolov (2012); Pascanu and Bengio (2012); Graves (2013);
Pascanu et al. (2013a)), while another is to scale the gradients heuristically (Murray and
Larochelle, 2014).
Multiple Output Variables
When Y is actually a tuple formed by multiple random variables Y
1
, Y
2
, . . . Y
k
, then one
has to choose an appropriate form for their joint distribution, conditional on X = x. The
simplest and most common choice is to assume that the Y
i
are conditionally independent,
i.e.,
P (Y
1
, Y
2
, . . . Y
k
|x) =
k
i=1
P (Y
i
|x).
This brings us back to the single variable case, especially since the log-likelihood now
decomposes into a sum of terms log P (Y
i
|x). If each P (Y
i
|x) is separately parametrized
(e.g. a different neural network), then we can train these neural networks independently.
However, a more common and powerful choice assumes that the different variables Y
i
share some common factors that can be represented in some hidden layer of the network
(such as the top hidden layer). See Sections 6.5 and 7.11 for a deeper treatment of the
notion of underlying factors of variation and multi-task training.
If the conditional independence assumption is considered too strong, what can we do?
At this point it is useful to step back and consider everything we know about learning
a joint probability distribution. Any probability distribution P
ω
(Y ) parametrized by
parameters ω can be turned into a conditional distribution P
θ
(Y |x) by making ω a
function ω = f
θ
(x) parametrized by θ. This is not only true of the simple parametric
103
distributions we have seen above (Gaussian, Bernoulli, Multinoulli) but also of more
complex joint distributions typically represented by a graphical model. Much of this
book is devoted to such graphical models (see Chapters 9, 15, 16, 17). For more
concrete examples of such structured output models with deep learning, see Section 20.4.
6.2.3 Training Criterion and Regularizer
The loss function (often interpretable as a negative log-likelihood) tells us what we would
like the learner to capture. Maximizing the conditional log-likelihood over the true
distribution, i.e., minimizing the expected loss E
X,Y
[log P
θ
(Y |X)], makes P
θ
(Y |X)
estimate the true P(Y |X) associated with the unknown data generating distribution,
within the boundaries of what the chosen family of functions allows. See Section 5.8.2
for a proof. In practice we cannot minimize this expectation because we do not know
P (X, Y ) and because computing and minimizing this integral exactly would generally
be intractable. Instead we are going to approximately minimize a training criterion
C(θ) based on the empirical average of the loss (over the training set). The simplest
such criterion is the average training loss
1
n
n
i=1
L(f
θ
(x
i
), y
i
), where the training set is
a set of n examples (x
i
, y
i
). However, better results can often be achieved by crippling
the learner and preventing it from simply finding the best θ that minimizes the average
training loss. This means that we combine the evidence coming from the data (the
training set average loss) with some a priori preference on the different values that θ
can take (the regularizer). If this concept (and the related concepts of generalization,
overfitting and underfitting) are not clear, please return to Sections 5.3 and 5.6.2 for a
refresher.
The most common regularizer is simply an additive term equal to a regularization
coefficient λ times the squared norm of the parameters, ||θ||
2
2
. This regularizer is often
called the weight decay or L2 weight decay or shrinkage because adding the squared
L2 norm to the training criterion pushes the weights towards zero, in proportion to
their magnitude. For example, when using stochastic gradient descent, each step of the
update with regularizer term λ||θ||
2
would look like
θ θ
L(f
θ
(x), y)
θ
λθ
where is the learning rate (see Section 4.3). This can be rewritten
θ θ(1 λ)
L(f
θ
(x), y)
θ
where we see that the first term systematically shaves off a small proportion of θ before
adding the gradient.
Another commonly used regularizer is the so-called L1 regularizer, which adds a term
proportional to the L1 (absolute value) norm, |θ|
1
=
i
|θ
i
|. The L1 regularizer also
prefers values that are small, but whereas the L2 weight decay has only a weak preference
for 0 rather than small values, the L1 regularizer continues pushing the parameters
towards 0 with a constant gradient even when they get very small. As a consequence, it
104
tends to bring some of the parameters to exactly 0 (how many depends on how large the
regularization coefficient λ is chosen). When optimizing with an approximate iterative
and noisy method such as stochastic gradient descent, no actual 0 is obtained. On the
other hand, the L1 regularizer tolerates large values of some parameters (only additively
removing a small quantity compared to the situation without regularization) whereas
the L2 weight decay aggressively punishes and prevents large parameter values. The L1
and L2 regularizers can be combined, as in the so-called elastic net (Zou and Hastie,
2005), and this is commonly done in deep networks
7
.
Note how in the context of maximum likelihood, regularizers can generally be in-
terpreted as Bayesian priors on the parameters P(θ) or on the learned function, as
discussed in Sections 5.6.2 and 5.7. Later in this book we discuss regularizers that are
data-dependent (i.e., cannot be expressed easily as a pure prior), such as the contractive
penalty (Section 10.1) as well as regularization methods that are difficult to interpret
simply as added terms in the training criterion, such as dropout (Section 7.10).
6.2.4 Optimization Procedure
Once a training criterion is defined, we enter the realm of iterative optimization proce-
dures, with the slight twist that if possible we want to monitor held-out performance
(for example the average loss on a validation set, which usually does not include the
regularizing terms) and not just the value of the training criterion. Monitoring vali-
dation set performance allows one to stop training at a point where generalization is
estimated to be the best among the training iterations. This is called early stopping and
is a standard machine learning technique, discussed in chpater 7.7 and chapter 8.1.3.
Section 4.3 has already covered the basics of gradient-based optimization. The sim-
plest such technique is stochastic gradient descent (with minibatches), in which the
parameters are updated after computing the gradient of the average loss over a mini-
batch of examples (e.g. 128 examples) and making an update in the direction opposite
to that gradient (or opposite to some accumulated average of such gradients, i.e., the
momentum technique, reviewed in Section 19.5.3). The most commonly used optimiza-
tion procedures for multi-layer neural networks and deep learning in general are either
variants of stochastic gradient descent (typically with minibatches), second-order meth-
ods (the most commonly used being LBFGS and nonlinear conjugate gradients) applied
on large minibatches (e.g. of size 10000) or on the whole training set at once, or nat-
ural gradient methods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu
and Bengio, 2013). Exact second-order and natural gradient methods are computation-
ally too expensive for large neural networks because they involve matrices of dimension
equal to the square of the number of parameters. Approximate methods are discussed
in Section 8.8.1.
On smaller datasets or when computing can be parallelized, second-order methods
have a computational advantage because they are easy to parallelize and can still perform
many updates per unit of time. On larger datasets (and in the limit of an infinite dataset,
7
See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.html#
l1-and-l2-regularization
105
i.e., a stream of training examples) one cannot afford to wait for seeing the whole dataset
before making an update, so that favors either stochastic gradient descent variants
(possibly with minibatches to take advantage of fast or parallelized implementations of
matrix-matrix products) or second-order methods with minibatches.
A more detailed discussion of issues arising with optimization methods in deep learn-
ing can be found in Chapter 8. In particular, many design choices in the construction
of the family of functions, loss function and regularizer can have a major impact on
the difficulty of optimizing the training criterion. Furthermore, instead of using generic
optimization techniques, one can design optimization procedures that are specific to
the learning problem and chosen architecture of the family of functions, for example by
initializing the parameters of the final optimization routine from the result of a differ-
ent optimization (or a series of optimizations, each initialized from the previous one).
Because of the non-convexity of the training criterion, the initial conditions can make
a very important difference, and this is what is exploited in the various pre-training
strategies, Sections 8.9.3 and 10.4, as well as with curriculum learning (Bengio et al.,
2009), Section 8.10.
6.3 Flow Graphs and Back-Propagation
The term back-propagation is often misunderstood as meaning the whole learning algo-
rithm for multi-layer neural networks. Actually it just means the method for computing
gradients in such networks. Furthermore, it is generally understood as something very
specific to multi-layer neural networks, but once its derivation is understood, it can eas-
ily be generalized to arbitrary functions (for which computing a gradient is meaningful),
and we describe this generalization here, focusing on the case of interest in machine
learning where the output of the function to differentiate (e.g., the loss L or the train-
ing criterion C) is a scalar and we are interested in its derivative with respect to a set
of parameters (considered to be the elements of a vector θ). It can be readily proven
that the back-propagation algorithm has optimal computational complexity in the sense
there is no algorithm that can compute the gradient faster (in the O(·) sense, i.e, up to
an additive and multiplicative constant).
6.3.1 Chain Rule
In order to apply the back-propagation algorithm, we take advantage of the chain rule:
C(g(θ))
θ
=
C(g(θ))
g(θ)
g(θ)
θ
(6.3)
which works also when C, g or θ are vectors rather than scalars (in which case the
corresponding partial derivatives are understood as Jacobian matrices of the appropriate
dimensions). In the purely scalar case we can understand the chain rule as follows: a
small change in θ will propagate into a small change in g(θ) by getting multiplied by
g(θ)
θ
. Similarly, a small change in g(θ) will propagate into a small change in C(g(θ)) by
106
getting multiplied by
C(g(θ))
g(θ)
. Hence a small change in θ first gets multiplied by
g(θ)
θ
to
obtain the change in g(θ) and this then gets multiplied by
C(g(θ))
g(θ)
to obtain the change
in C(g(θ)). Hence the ratio of the change in θ to the change in C(g(θ)) is the product
of these partial derivatives. The partial derivative measures the locally linear influence
of a variable on another. This is illustrated in Figure 6.2, with x = θ, y = g(θ), and
z = C(g(θ)).
Figure 6.2: The chain rule, illustrated in the simplest possible case, with z a scalar
function of y, which is itself a scalar function of x. A small change x in x gets turned
into a small change ∆y in y through the partial derivative
y
x
, from the first-order Taylor
approximation of y(x), and similarly for z(y). Plugging the equation for y into the
equation for z yields the chain rule.
Now, if g is a vector, we can rewrite the above as follows:
C(g(θ))
θ
=
i
C(g(θ))
g
i
(θ)
g
i
(θ)
θ
which sums over the influences of θ on C(g(θ)) through all the intermediate variables
g
i
(θ). This is illustrated with a simple case of two paths in Figure 6.3 with x = θ,
y
i
= g
i
(θ), and z = C(g(θ)).
107
Figure 6.3: Top: The chain rule, when there are two intermediate variables y
1
and y
2
between x and z, creating two paths for changes in x to propagate and yield changes in
z. Bottom: more general case, with n intermediate variables y
1
to y
n
.
6.3.2 Back-Propagation in a General Flow Graph
More generally, we can think about decomposing a function C(θ) into a more complicated
graph of computations. This graph is called a flow graph. Each node u
i
of the graph
denotes a numerical quantity that is obtained by performing a computation requiring
the values u
j
of other nodes, with j < i. The nodes satisfy a partial order which
dictates in what order the computation can proceed. In practical implementations of
such functions (e.g. with the criterion C(θ) or its value estimated on a minibatch), the
final computation is obtained as the composition of simple functions taken from a given
set (such as the set of numerical operations that the numpy library can perform on arrays
of numbers).
We will define the back-propagation in a general flow-graph, using the following
generic notation: u
i
= f
i
(a
i
), where a
i
is a list of arguments for the application of f
i
to
the values u
j
for the parents of i in the graph: a
i
= (u
j
)
jparents(i)
. This is illustrated
in Figure 6.4.
108
u
j
a
i
u
1
f
i
u
i
= f
i
(a
i
)
u
N
u
2
Figure 6.4: Illustration of recursive forward computation, where at each node u
i
we
compute a value u
i
= f
i
(a
i
), with a
i
being the list of values from parents u
j
of node
u
i
. Following Algorithm 6.1, the overall inputs to the graph are u
1
. . . u
M
(e.g., the
parameters we may want to tune during training), and there is a single scalar output
u
N
(e.g., the loss which we want to minimize).
The overall computation of the function represented by the flow graph can thus be
summarized by the forward computation algorithm, Algorithm 6.1.
In addition to having some code that tells us how to compute f
i
(a
i
) for some val-
ues in the vector a
i
, we also need some code that tells us how to compute its partial
derivatives,
f
i
(a
i
)
a
ik
with respect to any immediate argument a
ik
. Let k = π(i, j) denote
the index of u
j
in the list a
i
. Note that u
j
could influence u
i
through multiple paths.
Whereas
u
i
u
j
would denote the total gradient adding up all of these influences,
f
i
(a
i
)
a
ik
only denotes the derivative of f
i
with respect to its specific k-th argument, keeping the
other arguments fixed, i.e., only considering the influence through the arc from u
j
to
u
i
. In general, when manipulating partial derivatives, one should keep clear in one’s
mind (and implementation) the notational and semantic distinction between a partial
derivative that includes all paths and one that includes only the immediate effect of a
function’s argument on the function output, with the other arguments considered fixed.
For example consider f
3
(a
3,1
, a
3,2
) = e
a
3,1
+a
3,2
and f
2
(a
2,1
) = a
2
2,1
, while u
3
= f
3
(u
2
, u
1
)
109
Algorithm 6.1 Flow graph forward computation. Each node computes numerical value
u
i
by applying a function f
i
to its argument list a
i
that comprises the values of previous
nodes u
j
, j < i, with j parents(i). The input to the flow graph is the vector x, and is
set into the first M nodes u
1
to u
M
. The output of the flow graph is read off the last
(output) node u
N
.
for i = 1 . . . M do
u
i
x
i
end for
for i = M + 1 . . . N do
a
i
(u
j
)
jparents(i)
u
i
f
i
(a
i
)
end for
return u
N
and u
2
= f
2
(u
1
), illustrated in Figure 6.5. The direct derivative of f
3
with respect to its
argument a
3,2
is
f
3
a
3,2
= e
a
3,1
+a
3,2
while if we consider the variables u
3
and u
1
to which
these correspond, there are two paths from u
1
to u
3
, and we obtain as derivative the
sum of partial derivatives over these two paths,
u
3
u
1
= e
u
1
+u
2
(1 + 2u
1
). The results are
different because
u
3
u
1
involves not just the direct dependency of u
1
on u
3
but also the
indirect dependency through u
2
.
u
1
u
2
u
3
Figure 6.5: Illustration of indirect effect and direct effect of variable u
1
on variable u
3
in a flow graph, which means that the derivative of u
3
with respect to u
1
must include
the sum of two terms, one for the direct effect (derivative of u
3
with respect to its
first argument) and one for the indirect effect through u
2
(involving the product of
the derivative of u
3
with respect to u
2
times the derivative of u
2
with respect to u
1
).
Forward computation of u
i
’s (as in Figure 6.4) is indicated with upward full arrows,
while backward computation (of derivatives with respect to u
i
’s, as in Figure 6.6) is
indicated with downward dashed arrows.
110
Figure 6.6: Illustration of recursive backward computation, where we associate to each
node j not just the values u
j
computed in the forward pass (Figure 6.4, bold upward
arrows) but also the gradient
u
N
u
j
with respect to the output scalar node u
N
. These
gradients are recursively computed in exactly the opposite order, as described in Algo-
rithm 6.2 by using the already computed
u
N
u
i
of the children i of j (dashed downward
arrows).
Armed with this understanding, we can define the back-propagation algorithm as
follows, in Algorithm 6.2, which would be computed after the forward propagation
(Algorithm 6.1) has been performed. Note the recursive nature of the application of the
chain rule, in Algorithm 6.2: we compute the gradient on node j by re-using the already
computed gradient for children nodes i, starting the recurrence from the trivial
u
N
u
N
= 1
that sets the gradient for the output node. This is illustrated in Figure 6.6.
Algorithm 6.2 Back-propagation computation of a flow graph (full, upward arrows),
which itself produces an additional flow graph (dashed, backward arrows). See the
forward propagation in a flow-graph (Algorithm 6.1, to be performed first) and the
required data structure. In addition, a quantity
u
N
u
i
needs to be stored (and computed)
at each node, for the purpose of gradient back-propagation. Below the notation π(i, j)
is the index of u
j
as an argument to f
i
. The back-propagation algorithm efficiently
computes
u
N
u
i
for all i’s (traversing the graph backwards this time), and in particular
we are interested in the derivatives of the output node u
N
with respect to the “inputs”
u
1
. . . u
M
(which could be the parameters, in a learning setup). The cost of the overall
computation is proportional to the number of arcs in the graph, assuming that the
partial derivative associated with each arc requires a constant time. This is of the same
order as the number of computations for the forward propagation.
u
N
u
N
1
for j = N 1 down to 1 do
u
N
u
j
i:jparents(i)
u
N
u
i
f
i
(a
i
)
a
i,π(i,j)
end for
return
u
N
u
i
M
i=1
This recursion is a form of efficient factorization of the total gradient, i.e., it is an
application of the principles of dynamic programming. Indeed, the derivative of the
output node with respect to any node can also be written down in this intractable form:
u
N
u
i
=
paths u
k
1
...u
k
n
: k
1
=i,k
n
=N
n
j=2
u
k
j
u
k
j1
where the paths u
k
1
. . . u
k
n
go from the node k
1
= i to the final node k
n
= N in the flow
graph. Computing the sum as above would be intractable because the number of possible
paths can be exponential in the depth of the graph. The back-propagation algorithm is
111
efficient because, like dynamic programming, it re-uses partial sums associated with the
gradients on intermediate nodes.
Although the above was stated as if the u
i
’s were scalars, exactly the same procedure
can be run with u
i
’s being tuples of numbers (more easily represented by vectors). In
that case the equations remain valid, and the multiplication of scalar partial derivatives
becomes the multiplication of a row vector of gradients
u
N
u
i
with a Jacobian of partial
derivatives associated with the j i arc of the graph,
f
i
(a
i
)
a
i,π(i,j)
. In the case where
minibatches are used during training, u
i
would actually be a whole matrix (the extra
dimension being for the examples in the minibatch). This would then turn the basic
computation into matrix-matrix products rather than matrix-vector products, and the
former can be computed much more efficiently than a sequence of matrix-vector products
(e.g. with the BLAS library), especially so on modern computers and GPUs, that rely
more and more on parallelization through many cores.
More implementation issues regarding the back-propagation algorithm are discussed
in Chapters 18 and 19, regarding respectively GPU implementations (Section 18.2)
and debugging tricks (Section 19.5.1). Section 19.5.2 also discusses a natural general-
ization of the back-propagation algorithm in which one manipulates not numbers but
symbolic expressions, i.e., turning a program that performs a computation (decomposed
as a flow graph expressed in the forward propagation algorithm) into another program
(another flow graph) that performs gradient computation (i.e. automatically generat-
ing the back-propagation program, given the forward propagation graph). Using such
symbolic automatic differentiation procedures (implemented for example in the Theano
library
8
), one can compute second derivatives and other kinds of derivatives, as well as
apply numerical stabilization and simplifications on a flow graph.
6.4 Universal Approximation Properties and Depth
Without hidden layer, and for convex loss functions and regularizers (which is typically
the case), we obtain a convex training criterion. We end up with linear regression, logistic
regression or and other log-linear models that are used in many applications of machine
learning. This is appealing (no local minima, no saddle point) and convenient (no strong
dependency on initial conditions) but comes with a major disadvantage: such learners
are very limited in their ability to represent more complex functions. In the case of
classification tasks, we are limited to linear decision surfaces. In practice, this limitation
can be somewhat circumvented by handcrafting a large enough or discriminant enough
set of hardwired features extracted from the raw input and on which the linear predictor
can be easily trained. See next section for a longer discussion about feature learning.
To make the family of functions rich enough, the other option is to introduce one
or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi, 1994)
that with a single hidden layer of a sufficient size and a reasonable choice of non-
linearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can represent
any smooth function to some desired accuracy (the greater the required accuracy, the
8
http://deeplearning.net/software/theano/
112
more hidden units are required). However, these theorems generally do not tell us
how many hidden units will be required to achieve a given accuracy for particular data
distributions: in fact the worse case scenario is generally going to require an exponential
number of hidden units (to basically record every input configuration that needs to be
distinguished). It is easier to understand how bad things can be by considering the case
of binary input vectors: the number of possible binary functions on vectors v {0, 1}
d
is
2
2
d
and selecting one such function requires 2
d
bits, which will in general require O(2
d
)
degrees of freedom.
However, machine learning is not about learning equally easily any possible func-
tion: we mostly care about the kinds of functions that are needed to represent the
world around us or the task of interest. The no-free-lunch theorems for machine
learning (Wolpert, 1996) essentially say that no learning algorithm is “universal”, i.e.,
dominates all the others against all possible ground truths (data generating distribution
or target function). However, for a given family of tasks, some learning algorithms are
definitely better. Choosing a learning algorithm is equivalent to choosing a prior, i.e.,
making a larger bet for some guesses of the underlying target function than for others.
If the target function is likely under the chosen prior, then this prior (i.e., this learning
algorithm) turns out to be a good choice. The best choice is the unrealistic prior that
puts all probability mass on the true target function. This would only happen if there
is nothing to be learned any more because we already know the target function. Having
a universal approximation property just means that this prior is broad enough to cover
or come close to any possible target function, and this is a useful property, but in itself
clearly not sufficient to provide good generalization (not to speak of the optimization
issues and other computational concerns that may be associated with a choice of prior
and learning algorithm).
One of the central priors that is exploited in deep learning is that the target function
to be learned can be efficiently represented as a deep composition of simpler functions
(“features”), where features at one level can be re-used to define many features at the
next level. This is connected to the notion of underlying factors described in the next
section. One therefore assumes that these factors or features are organized at multiple
levels, corresponding to multiple levels of representation. The number of such levels is
what we call depth in this context. The computation of these features can therefore be
laid down in a flow graph or circuit, whose depth is the length of the longest path from
an input node to an output node. Note that we can define the operations performed
in each node in different ways. For example, do we consider a node that computes
the affine operations of a neural net followed by a node that computes the non-linear
neuron activation, or do we consider both of these operations as one node or one level
of the graph? Hence the notion of depth really depends on the allowed operations at
each node and one flow graph of a given depth can be converted into an equivalent
flow graph of a different depth by redefining which operations can be performed at each
node. However, for neural networks, we typically talk about a depth 1 network if there
is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal
approximation properties for neural nets basically tell us that depth 2 is sufficient to
approximate any reasonable function to any desired finite accuracy.
113
From the point of view of approximation properties, the important result is that one
can find families of functions which can be approximated very efficiently when a partic-
ular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insufficient (or is limited to 2). Such
results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-
negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectifier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
learning (and in particular for AI) share such a property. However, a lot of experimental
evidence suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,
2012; Krizhevsky et al., 2012a; Sermanet et al., 2013; Farabet et al., 2013; Couprie et al.,
2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior and that
in order to take advantage of it, the learner’s family of function needs to allow multiple
levels of representation.
6.5 Feature / Representation Learning
Let us consider again the single layer networks such as the Perceptron, linear regression
and logistic regression: such linear models are appealing because training them involves a
convex optimization problem
9
, i.e., with some convergence guarantees towards a global
optimum, irrespective of initial conditions. Simple and well-understood optimization
algorithms are available in this case. However, this limits the representational capacity
too much and many tasks, for a given choice of input representation x (the raw input
features), cannot be solved by using only a linear predictor. What are our options to
avoid that limitation:
1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Scolkopf
et al., 1999), i.e., to consider a fixed mapping from x to φ(x), where φ(x) is of much
higher dimension. In this case, f
θ
(x) = b+w ·φ(x) can be linear in the parameters
(and in φ(x)) and optimization remains convex (or even analytic). Thanks to the
kernel trick, we can computationally handle a high-dimensional φ(x) (or even an
infinite-dimensional one) so long as the kernel K(u, v) = φ(u) · φ(v) (where · is
the appropriate dot product for the space of φ(·)) can be computed efficiently.
If φ(x) is of high enough dimension, we can always have enough capacity to fit
the training set, but generalization is not at all guaranteed: it will depend on
the appropriateness of the choice of φ as a feature space for our task. Kernel
machines theory clearly identifies the choice of φ to the choice of a prior. This
leads to kernel engineering, which is equivalent to feature engineering, discussed
next. The other type of kernel (that is very commonly used) embodies a very broad
prior, such as smoothness, e.g., the Gaussian (or RBF) kernel K(u, v) = e
||uv||
2
.
9
or even one for which an analytic solution can be computed, with linear regression or the case of
some Gaussian process regression models
114
Unfortunately, this prior may be insufficient, i.e., too broad and sensitive to the
curse of dimensionality, as introduced in Section 5.11 and developed in more detail
in Chapter 14.
2. Another option is to manually engineer the representation or features φ(x). Most
industrial applications of machine learning rely and hand-crafted features and
most of the research and development effort (as well as a very large fraction of
the scientific literature in machine learning and its applications) goes in designing
new features that are most appropriate to the task at hand. Clearly, faced with a
problem to solve and some prior knowledge in the form of representations that are
believe to be relevant, the prior knowledge can be very useful. This approach is
therefore common in practice, but is not completely satisfying because it involves
a very task-specific effort and a laborious never-ending effort to improve systems
by designing better features. If there were some more general feature learning
approaches that could be applied to a large set of related tasks (such as those
involved in AI), we would certainly like to take advantage of them. Since humans
seem to be able to learn a lot of new tasks (for which they were not programmed
by evolution), it seems that such broad priors do exist. This whole question is
discussed in a lot more detail in Bengio and LeCun (2007b), and motivates the
third option.
3. The third option is to learn the features, or learn the representation. In a sense,
it allows to interpolate between the almost agnostic approach of a kernel machine
with a general-purpose smoothness kernel (such as RBF SVMs and other non-
parametric statistical models) and full designer-provided knowledge in the form
of a xed representation that is perfectly tailored to the task. This is equiva-
lent to the idea of learning the kernel, except that whereas most kernel learning
methods only allow very few degrees of freedom in the learned kernel, representa-
tion learning methods such as those discussed in this book (including multi-layer
neural networks) allow the feature function φ(·) to be very rich (with a num-
ber of parameters that can be in the millions or more, depending on the amount
of data available). This is equivalent to learning the hidden layers, in the case
of a multi-layer neural network. Besides smoothness (which comes for example
from regularizers such as weight decay), other priors can be incorporated in this
feature learning. The most celebrated of these priors is depth, discussed above
(Section 6.4). Other priors are discussed in Chapter 14.
This whole discussion is clearly not specific to neural networks and supervised learn-
ing, and is one of the central motivations for this book.
6.6 Historical Notes
TODO describe ReLUs and maxout, include reference to dropout section
115