Chapter 6

Feedforward Deep Networks

6.1 Formalizing and Generalizing Neural Networks

Feedforward supervised neural networks were among the ﬁrst and most successful learn-

ing algorithms (Rumelhart et al., 1986c,b). They are also called deep networks, multi-

layer Perceptron (MLP), or simply neural networks. In addition to covering the basics of

such networks, this chapter introduces a general formalism for gradient-based optimiza-

tion of parametrized families of functions, often in the context of conditional maximum

likelihood criteria (Section 6.2).

They bring together a number of important machine learning concepts already in-

troduced in the previous chapters:

• Deﬁne a parametrized family of functions f

describing how the learner will

behave on new examples, i.e., what output f

(x) it will produce given some input

x. Training consists in choosing the parameter θ (usually represented by a vector)

given some training examples (x, y) sampled from an unknown data generating

distribution P(X, Y ).

• Deﬁne a loss function L describing what scalar loss L(ˆy, y) is associated with

each supervised example (x, y), as a function of the learner’s output ˆy = f

(x) and

the target output y.

• Deﬁne a training criterion and a regularizer. The objective of training is to

ideally to minimize the expected loss E

X,Y

[L(f

(X), Y )] over X, Y sampled from

the unknown data generating distribution P (X, Y ). However this is not possible

because the expectation makes use of the true underlying P(X, Y ) but we only

have access to a ﬁnite number of training examples, i.e. of pairs (X,Y ). Instead,

one deﬁnes a training criterion which usually includes an empirical average of the

loss over the training set, plus some additional terms (called regularizers) which

enforce some preferences over the choices of θ.

• Deﬁne an optimization procedure to approximately minimize the training cri-

terion

. The simplest such optimization procedure is stochastic gradient descent,

described in Section 4.3.

Example 6.1.1 illustrates these concepts for the case of a vanilla neural network for

regression.

In chapter 10 we consider generalizations of the above framework to the unsupervised

and semi-supervised cases, where Y may not exist or may not always be present. An

unsupervised loss can then be deﬁned solely as a function of the input x and some

function f

(x) that one wishes to learn.

Figure 6.1: Vanilla MLP, with one sigmoid hidden layer, computing vector-valued hidden

unit vector h = sigmoid(c + W x) with weight matrix W and bias or oﬀset vector c. The

output vector is obtained via another learned aﬃne transformation ˆy = b + V h, with

weight matrix V and output bias (oﬀset) vector b.

It is generally not possible to analytically obtain a global minimum of the training criterion, and

iterative numerical optimization methods are used instead.

Example 6.1.1. Vanilla Multi-Layer Neural Network for Regression

Based on the above deﬁnitions, we could pick the family of input-output func-

tions to be

(x) = b + V sigmoid(c + W x),

illustrated in Figure 6.1, where sigmoid(a) = 1/(1+e

−a

) is applied element-wise,

the input is the vector x ∈ R

, the hidden layer outputs are the elements of the

vector h = sigmoid(c + Wx) with n

entries, the parameters are θ = (b, c, V, W )

with b ∈ R

a vector the same dimension as the output (n

), c ∈ R

of the

same dimension as h (number of hidden units), V ∈ R

×n

and W ∈ R

×n

being weight matrices.

The loss function for this classical example could be the squared error

L(ˆy, y) = ||ˆy − y||

(see Section 5.8.1 showing that it makes ˆy an estima-

tor of E[Y |x]). The regularizer could be the ordinary L2 weight decay

||ω||

= λ(



), where we deﬁne the set of weights ω as the con-

catenation of the elements of matrices W and V . The L2 weight decay thus

penalizes the squared norm of the weights, with λ a scalar that is larger to pe-

nalize stronger and yield smaller weights. Combining the loss function and the

regularizer gives the training criterion, which is the objective function during

training:

C(θ) = λ||ω||



i=1

||y

− (b + V sigmoid(c + Wx

))||

Finally, the classical training procedure in this example is stochastic gradient

descent, which iteratively updates θ according to

ω ← ω −2(λω +

∂L(f

), y

)

∂ω

)

β ← β − 2

∂L(f

), y

)

∂β

where β = (b, c) contains the oﬀset (bias) parameters, incrementing i after

each training example is shown, modulo n.

6.2 Parametrizing a Learned Predictor

There are many ways to deﬁne the family of input-output functions, loss function,

regularizer and optimization procedure, and the most common ones are described below,

while more advanced ones are left to later chapters, in particular Chapters 10 and 12.

6.2.1 Family of Functions

A motivation for the family of functions deﬁned by multi-layer neural networks is to

compose aﬃne transformations and non-linearities, where the choice of parameter values

controls the degree and location of these non-linearities. As discussed in Section 6.4

below, with the appropriate choice of parameters, multi-layer neural networks can in

principle approximate smooth functions, with more hidden units allowing one to achieve

better approximations

A multi-layer neural network with more than one hidden layer can be deﬁned by

generalizing the above structure, e.g., as follows, where we chose to use hyperbolic

tangent

activation functions instead of sigmoid activation functions:

= tanh(b

+ W

k−1

)

where h

= x is the input of the neural net, h

(for k > 0) is the output of the k-th

hidden layer, which has weight matrix W

and oﬀset (or bias) vector b

. If we want

the output f

(x) to lie in some desired range, then we typically deﬁne an output non-

linearity (which we did not have in the above Example 6.1.1). The non-linearity for the

output layer is of course generally diﬀerent from the tanh, depending on the type of

output to be predicted and the associated loss function (see below).

As discussed in Chapter 19 (Section 19.5) there are several other non-linearities

besides the sigmoid and the hyperbolic tangent which have been successfully used with

neural networks. In particular, TODO discusses alternative non-linear units, such as the

rectiﬁed linear unit (max(0, b+w·x)) and the maxout unit (max

·x)) which have

been particularly successful in the case of deep feedforward or convolutional networks.

These and other non-linearities or neural network computation commonly found in the

literature are summarized below.

• Rectiﬁer or rectiﬁed linear unit (RELU) or positive part: h(x) = max(0, b+

w · x).

• Hyperbolic tangent: h(x) = tanh(b + w ·x).

• Sigmoid: h(x) = 1/(1 + e

−(b+w·x)

• Softmax: h(x) = softmax(b + Wx). It is mostly used as output non-linearity for

predicting discrete probabilities. See deﬁnition and discussion below, Eq. 6.1.

• Radial basis function or RBF unit: h(x) = e

−||w−x||

/σ

. This is heavily used

in kernel SVMs (Boser et al., 1992; Sch¨olkopf et al., 1999) and has the advantage

that such units can be easily initialized as a random subset of the input exam-

ples (Powell, 1987; Niranjan and Fallside, 1990).

which is linearly related to the sigmoid as follows: tanh(x) = 2 × sigmoid(2x) − 1

• Softplus: this is a smooth version of the rectiﬁer, introduced in Dugas et al. (2001)

for function approximation and in Nair and Hinton (2010a) in RBMs. When they

introduced the rectiﬁer for deep supervised nets, Glorot et al. (2011a) compared

the softplus and rectiﬁer and found better results with the latter, in spite of the

very similar shape and the diﬀerentiability and non-zero derivative of the softplus

everywhere, contrary to the rectiﬁer.

• Hard tanh: this is shaped similarly to the tanh and the rectiﬁer but unlike the

latter, it is bounded, h(x) = max(−1, min(1, x)). It was introduced by Collobert

(2004).

• Absolute value rectiﬁcation: h(x) = |x| (may be applied on the dot product

output or on the output of a tanh unit). It is also related to the rectiﬁer and

has been used for object recognition from images (Jarrett et al., 2009), where it

makes sense to seek features that are invariant under a polarity reversal of the

input illumination.

• Maxout: this is discussed in more detail in TODO It generalizes the rectiﬁer but

introduces multiple weight vectors w

(called ﬁlters) for each unit. h(x) = max

+ w

· x).

This is not an exhaustive list but covers most of the non-linearities and unit computa-

tions seen in the deep learning and neural nets literature. Many variants are possible.

As discussed in Section 6.3, the structure (also called architecture) of the family of

input-output functions can be varied in many ways, which calls for a generic principle

for eﬃciently computing gradients, described in that section. For example, a common

variation is to connect layers that are not adjacent, with so-called skip connections,

which are found in the visual cortex (for which the word “layer” should be replaced by

the word “area”). Other common variations depart from a full connectivity between

adjacent layers. For example, each unit at layer k may be connected to only a subset of

units at layer k −1. A particular case of such form of sparse connectivity is discussed

in chapter 11 with convolutional networks. In general, the set of connections between

units of the whole network only needs to form a directed acyclic graph in order to deﬁne

a meaningful computation (see the ﬂow graph formalism below, Section 6.3). When

units of the network are connected to themselves through a cycle, one has to properly

deﬁne what computation is to be done, and this is what is done in the case of recurrent

networks, treated in Chapter 12.

6.2.2 Loss Function and Conditional Log-Likelihood

In the 80’s and 90’s the most commonly used loss function was the squared error

L(f

(x), y) = ||f

(x) − y||

. As discussed in Section 5.8.1, if f is unrestricted (non-

parametric), minimizing the expected value of loss function over some data-generating

distribution P (X, Y ) yields f (x) = E[Y |X], the true conditional expectation of Y given

. See also Section 5.8.1 for more details. This tells us what the neural network

Proving this is not diﬃcult and is very instructive.

is trying to learn. Replacing the squared error by an absolute value makes the neural

network try to estimate not the conditional expectation but the conditional median.

However, when y is a discrete label, i.e., for classiﬁcation problems, other loss func-

tions such as the Bernoulli negative log-likelihood

have been found to be more ap-

propriate than the squared error. In the case where y ∈ {0, 1} is binary this gives

L(f

(x), y) = −y log f

(x) −(1 −y) log(1 −f

(x)) and it can be shown that the optimal

(non-parametric) f minimizing this criterion is f(x) = P (Y = 1|x). Note that in order

for the above expression of the criterion to make sense, f

(x) must be strictly between 0

and 1 (an undeﬁned or inﬁnite value would otherwise arise). To achieve this, it is com-

mon to use the sigmoid as non-linearity for the output layer, which matches well with

the Binomial negative log-likelihood criterion

. As explained below (Section 6.2.2), the

cross-entropy criterion allows gradients to pass through the output non-linearity even

when the neural network produces a conﬁdently wrong answer, unlike the squared error

criterion coupled with a sigmoid or softmax non-linearity.

Learning a Conditional Probability Model

More generally, one can deﬁne a loss function as corresponding to a conditional log-

likelihood, i.e., the negative log-likelihood (NLL) criterion

L(f

(x), y) = −log P

(Y |X).

See Section 5.8.2 which shows that this criterion yields an estimator of the true condi-

tional probability of Y given X. For example, if Y is a continuous random variable and

we assume that, given X, it has a Gaussian distribution with mean f

(X) and variance

, then

−log P

(Y |X) =

(X) −Y )

/σ

+ log(2πσ

Up to an additive and multiplicative constant (which would give the same choice of θ),

this is therefore equivalent to the squared error loss. Once we understand this principle,

we can readily generalize it to other distributions, as appropriate. For example, it

is straightforward to generalize the univariate Gaussian to the multivariate case, and

under appropriate parametrization consider the variance to be a parameter or even

parametrized function of x (for example with output units that are guaranteed to be

positive, or forming a positive deﬁnite matrix).

Similarly, for discrete variables, the Binomial negative log-likelihood criterion corre-

sponds to the conditional log-likelihood associated with the Bernoulli distribution with

probability p = f

(x) of generating Y = 1 given X = x (and probability 1 − p of

generating Y = 0):

−log P

(Y |X) = −1

Y =1

log p−1

Y =0

log(1 −p) = −Y log f

(X)−(1−Y ) log(1−f

(X)).

often called cross entropy in the literature, even though the term cross entropy should also apply to

many other losses that can be viewed as negative log-likelihood, discussed below in more detail. TODO:

where do we actually discuss this? may as well discuss it in maximum likelihood section

In reference to statistical models, this “match” between the loss function and the output non-

linearity is similar to the choice of a link function in generalized linear models (McCullagh and Nelder,

1989).

100

Softmax

When Y is discrete (in some ﬁnite set, say {1, . . . N}) but not binary, the Bernoulli

distribution is extended to the Multinoulli (Murphy, 2012)

. This distribution is spec-

iﬁed by a vector of N probabilities that sum to 1, each element of which provides the

probability p

= P(Y = i|X). We thus need a vector-valued output non-linearity that

produces such a vector of probabilities, and a commonly used non-linearity for this

purpose is the softmax non-linearity introduced earlier.

p = softmax(a) ⇐⇒ p



. (6.1)

where typically a = b + W h is the vector of scores whose elements a

are associated

with each category i. The corresponding loss function is therefore L(p, y) = −log p

Note how minimizing this loss will push a

up (increase the score a

associated with

the correct label y) while pushing a

(for i = y) down (decreasing the score of the other

labels, in the context x). The ﬁrst eﬀect comes from the numerator of the softmax while

the second eﬀect comes from the normalizing denominator. These forces cancel on a

speciﬁc example only if p

= 1 and they cancel in average over examples (say sharing

the same x) if p

equals the fraction of times that Y = i for this value x. The same

principles and the role of the normalization constant (or “partition function”) can be

seen at play in the training of Markov Random Fields, Boltzmann machines and RBMs,

in Chapter 9. Note other interesting properties of the softmax. First of all, the gradient

of the Multinoulli negative log-likelihood with respect to the softmax argument a is

∂

∂a

(−log p

) = (p

− δ

) (6.2)

which does not saturate, i.e, the gradient does not vanish when the output probabilities

approach 0 or 1, except when the model is providing the correct answer. On the other

hand, other loss functions such as the squared error applied to sigmoid outputs (which

was popular in the 80’s and 90’s) will have vanishing gradient when an output unit satu-

rates, even if the output is completely wrong (Solla et al., 1988). Another property that

could potentially be interesting is that the softmax output is invariant under additive

changes of its input vector: softmax(a) = softmax(a + b) when b is a scalar added to all

the elements of vector a. Finally, it is interesting to think of the softmax as a way to

create a form of competition between the units (typically output units, but not neces-

sarily) that participate in it: the strongest ﬁlter output a

∗

gets reinforced (because the

exponential increases faster for larger values) and the other units get inhibited. This is

analogous to the lateral inhibition that is believed to exist between nearby neurons in

cortex, and at the extreme (when the a

’s are large in magnitude) becomes a form of

winner-take-all (one of the outputs is 1 and the others is 0). A more computationally

expensive form of competition is found with sparse coding, described in Section 16.3.

The Multinoulli is also known as the categorical or generalized Bernoulli distribution, which is a

special case of the Multinomial distribution with one trial, and the term multinomial is often used even

in this case.

101

Neural Net Outputs as Parameters of a Conditional Distribution

In general, for any parametric probability distribution P(Y |ω) with parameters ω, we

can construct a conditional distribution P (Y |X) by making ω a parametrized function

of X, and learning that function:

P (Y |ω = f

(X))

where f

(X) is the output of a predictor, X is its input, and Y can be thought of as a

target. This established choice of words comes from the common cases of classiﬁcation

and regression, where f

(X) is really a prediction associated with Y , or its expected

value. However, in general ω = f

(X) may contain parameters of the distribution of Y

other than its expected value. For example, it could contain its variance or covariance,

in the case where Y is conditionally Gaussian. In the above examples, with the squared

error loss, ω is the mean of the Gaussian which captures the conditional distribution

of Y (which means that the variance is considered ﬁxed, not a function of X). In

the common classiﬁcation case, ω contains the probabilities associated with the various

events of interest.

Once we view things in this way, we automatically get as the natural cost function

the negative log-likelihood L(X, Y ) = −log P (Y |ω = f

(X)). Besides θ, there could be

other parameters that control the distribution of Y . For example, we may wish to learn

the variance of a conditional Gaussian, even when the latter is not a function of X. In

this case, the solution can be computed analytically because the maximum likelihood

estimator of variance is simply the empirical mean of the squared diﬀerence between

observations Y and their expected value (here f

(X)). In other words, the conditional

variance can be estimated from the mean squared error.

Besides the Gaussian, a simple and common example is the case where Y is binary

(i.e. Bernoulli distributed), where it is enough to specify ω = P (Y = 1|X). In the

Multinoulli case, ω is generally speciﬁed by a vector of probabilities summing to 1, e.g.,

via the softmax non-linearity discussed above.

Another interesting and powerful example of output distribution for neural networks

is the mixture model, and in particular the Gaussian mixture model, introduced in Sec-

tion 3.10.5. Neural networks that compute the parameters of a mixture model were

introduced in Jacobs et al. (1991); Bishop (1994). In the case of the Gaussian mixture

model with N components,

P (y|x) =



i=1

P (C = i|x)N(y|µ

(x), Σ

(x)

the neural network must have three kinds of outputs, P(C = i|x), µ

(x), and Σ

(x),

which must satisfy diﬀerent constraints:

1. Mixture components P (C = i|x): these form a Multinoulli distribution over the

N diﬀerent components C, and can typically be obtained by a softmax over an

N-vector, to guarantee that these outputs are positive and sum to 1.

102

2. Means µ

(x): these indicate the center or mean associated with each Gaussian

component, and are unconstrained (typically with no non-linearity at all for these

output units). If Y is a d-vector, then the network must output an N ×d matrix

containing all these N d-vectors.

3. Covariances Σ

(x): these specify the covariance matrix for each component i. In

many models the variance is unconditional (does not depend on x) and diagonal or

even scalar. If it is diagonal or scalar, only positivity must be enforced (e.g., using

the softplus non-linearity). If it is full and conditional, then a parametrization must

be chosen that guarantees positive-deﬁniteness of the predicted covariance matrix.

This can be achieved by writing Σ

(x) = B

(x)B



(x), where B

is an unconstrained

square matrix. One practical issue if the the matrix is full is that computing

the likelihood is expensive, requiring O(d

) computation for the determinant and

inverse of Σ

(x) (or equivalently, and more commonly done, its eigendecomposition

or that of B

(x)).

It has been reported that gradient-based optimization of conditional Gaussian mixtures

(on the output of neural networks) can be ﬁnicky, in part because one gets divisions

(by the variance) which can be numerically unstable (when some variance gets to be

small for a particular example, yielding very large gradients). One solution is to clip

gradients (see Section ?? and Mikolov (2012); Pascanu and Bengio (2012); Graves (2013);

Pascanu et al. (2013a)), while another is to scale the gradients heuristically (Murray and

Larochelle, 2014).

Multiple Output Variables

When Y is actually a tuple formed by multiple random variables Y

, Y

, . . . Y

, then one

has to choose an appropriate form for their joint distribution, conditional on X = x. The

simplest and most common choice is to assume that the Y

are conditionally independent,

i.e.,

P (Y

, Y

, . . . Y

|x) =



i=1

P (Y

|x).

This brings us back to the single variable case, especially since the log-likelihood now

decomposes into a sum of terms log P (Y

|x). If each P (Y

|x) is separately parametrized

(e.g. a diﬀerent neural network), then we can train these neural networks independently.

However, a more common and powerful choice assumes that the diﬀerent variables Y

share some common factors that can be represented in some hidden layer of the network

(such as the top hidden layer). See Sections 6.5 and 7.11 for a deeper treatment of the

notion of underlying factors of variation and multi-task training.

If the conditional independence assumption is considered too strong, what can we do?

At this point it is useful to step back and consider everything we know about learning

a joint probability distribution. Any probability distribution P

(Y ) parametrized by

parameters ω can be turned into a conditional distribution P

(Y |x) by making ω a

function ω = f

(x) parametrized by θ. This is not only true of the simple parametric

103

distributions we have seen above (Gaussian, Bernoulli, Multinoulli) but also of more

complex joint distributions typically represented by a graphical model. Much of this

book is devoted to such graphical models (see Chapters 9, 15, 16, 17). For more

concrete examples of such structured output models with deep learning, see Section 20.4.

6.2.3 Training Criterion and Regularizer

The loss function (often interpretable as a negative log-likelihood) tells us what we would

like the learner to capture. Maximizing the conditional log-likelihood over the true

distribution, i.e., minimizing the expected loss E

X,Y

[−log P

(Y |X)], makes P

(Y |X)

estimate the true P(Y |X) associated with the unknown data generating distribution,

within the boundaries of what the chosen family of functions allows. See Section 5.8.2

for a proof. In practice we cannot minimize this expectation because we do not know

P (X, Y ) and because computing and minimizing this integral exactly would generally

be intractable. Instead we are going to approximately minimize a training criterion

C(θ) based on the empirical average of the loss (over the training set). The simplest

such criterion is the average training loss



i=1

L(f

), y

), where the training set is

a set of n examples (x

, y

). However, better results can often be achieved by crippling

the learner and preventing it from simply ﬁnding the best θ that minimizes the average

training loss. This means that we combine the evidence coming from the data (the

training set average loss) with some a priori preference on the diﬀerent values that θ

can take (the regularizer). If this concept (and the related concepts of generalization,

overﬁtting and underﬁtting) are not clear, please return to Sections 5.3 and 5.6.2 for a

refresher.

The most common regularizer is simply an additive term equal to a regularization

coeﬃcient λ times the squared norm of the parameters, ||θ||

. This regularizer is often

called the weight decay or L2 weight decay or shrinkage because adding the squared

L2 norm to the training criterion pushes the weights towards zero, in proportion to

their magnitude. For example, when using stochastic gradient descent, each step of the

update with regularizer term λ||θ||

would look like

θ ← θ −

∂L(f

(x), y)

∂θ

− λθ

where  is the learning rate (see Section 4.3). This can be rewritten

θ ← θ(1 −λ) − 

∂L(f

(x), y)

∂θ

where we see that the ﬁrst term systematically shaves oﬀ a small proportion of θ before

adding the gradient.

Another commonly used regularizer is the so-called L1 regularizer, which adds a term

proportional to the L1 (absolute value) norm, |θ|



|θ

|. The L1 regularizer also

prefers values that are small, but whereas the L2 weight decay has only a weak preference

for 0 rather than small values, the L1 regularizer continues pushing the parameters

towards 0 with a constant gradient even when they get very small. As a consequence, it

104

tends to bring some of the parameters to exactly 0 (how many depends on how large the

regularization coeﬃcient λ is chosen). When optimizing with an approximate iterative

and noisy method such as stochastic gradient descent, no actual 0 is obtained. On the

other hand, the L1 regularizer tolerates large values of some parameters (only additively

removing a small quantity compared to the situation without regularization) whereas

the L2 weight decay aggressively punishes and prevents large parameter values. The L1

and L2 regularizers can be combined, as in the so-called elastic net (Zou and Hastie,

2005), and this is commonly done in deep networks

Note how in the context of maximum likelihood, regularizers can generally be in-

terpreted as Bayesian priors on the parameters P(θ) or on the learned function, as

discussed in Sections 5.6.2 and 5.7. Later in this book we discuss regularizers that are

data-dependent (i.e., cannot be expressed easily as a pure prior), such as the contractive

penalty (Section 10.1) as well as regularization methods that are diﬃcult to interpret

simply as added terms in the training criterion, such as dropout (Section 7.10).

6.2.4 Optimization Procedure

Once a training criterion is deﬁned, we enter the realm of iterative optimization proce-

dures, with the slight twist that if possible we want to monitor held-out performance

(for example the average loss on a validation set, which usually does not include the

regularizing terms) and not just the value of the training criterion. Monitoring vali-

dation set performance allows one to stop training at a point where generalization is

estimated to be the best among the training iterations. This is called early stopping and

is a standard machine learning technique, discussed in chpater 7.7 and chapter 8.1.3.

Section 4.3 has already covered the basics of gradient-based optimization. The sim-

plest such technique is stochastic gradient descent (with minibatches), in which the

parameters are updated after computing the gradient of the average loss over a mini-

batch of examples (e.g. 128 examples) and making an update in the direction opposite

to that gradient (or opposite to some accumulated average of such gradients, i.e., the

momentum technique, reviewed in Section 19.5.3). The most commonly used optimiza-

tion procedures for multi-layer neural networks and deep learning in general are either

variants of stochastic gradient descent (typically with minibatches), second-order meth-

ods (the most commonly used being LBFGS and nonlinear conjugate gradients) applied

on large minibatches (e.g. of size 10000) or on the whole training set at once, or nat-

ural gradient methods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu

and Bengio, 2013). Exact second-order and natural gradient methods are computation-

ally too expensive for large neural networks because they involve matrices of dimension

equal to the square of the number of parameters. Approximate methods are discussed

in Section 8.8.1.

On smaller datasets or when computing can be parallelized, second-order methods

have a computational advantage because they are easy to parallelize and can still perform

many updates per unit of time. On larger datasets (and in the limit of an inﬁnite dataset,

See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.html#

l1-and-l2-regularization

105

i.e., a stream of training examples) one cannot aﬀord to wait for seeing the whole dataset

before making an update, so that favors either stochastic gradient descent variants

(possibly with minibatches to take advantage of fast or parallelized implementations of

matrix-matrix products) or second-order methods with minibatches.

A more detailed discussion of issues arising with optimization methods in deep learn-

ing can be found in Chapter 8. In particular, many design choices in the construction

of the family of functions, loss function and regularizer can have a major impact on

the diﬃculty of optimizing the training criterion. Furthermore, instead of using generic

optimization techniques, one can design optimization procedures that are speciﬁc to

the learning problem and chosen architecture of the family of functions, for example by

initializing the parameters of the ﬁnal optimization routine from the result of a diﬀer-

ent optimization (or a series of optimizations, each initialized from the previous one).

Because of the non-convexity of the training criterion, the initial conditions can make

a very important diﬀerence, and this is what is exploited in the various pre-training

strategies, Sections 8.9.3 and 10.4, as well as with curriculum learning (Bengio et al.,

2009), Section 8.10.

6.3 Flow Graphs and Back-Propagation

The term back-propagation is often misunderstood as meaning the whole learning algo-

rithm for multi-layer neural networks. Actually it just means the method for computing

gradients in such networks. Furthermore, it is generally understood as something very

speciﬁc to multi-layer neural networks, but once its derivation is understood, it can eas-

ily be generalized to arbitrary functions (for which computing a gradient is meaningful),

and we describe this generalization here, focusing on the case of interest in machine

learning where the output of the function to diﬀerentiate (e.g., the loss L or the train-

ing criterion C) is a scalar and we are interested in its derivative with respect to a set

of parameters (considered to be the elements of a vector θ). It can be readily proven

that the back-propagation algorithm has optimal computational complexity in the sense

there is no algorithm that can compute the gradient faster (in the O(·) sense, i.e, up to

an additive and multiplicative constant).

6.3.1 Chain Rule

In order to apply the back-propagation algorithm, we take advantage of the chain rule:

∂C(g(θ))

∂θ

∂C(g(θ))

∂g(θ)

∂θ

(6.3)

which works also when C, g or θ are vectors rather than scalars (in which case the

corresponding partial derivatives are understood as Jacobian matrices of the appropriate

dimensions). In the purely scalar case we can understand the chain rule as follows: a

small change in θ will propagate into a small change in g(θ) by getting multiplied by

∂g(θ)

∂θ

. Similarly, a small change in g(θ) will propagate into a small change in C(g(θ)) by

106

getting multiplied by

∂C(g(θ))

∂g(θ)

. Hence a small change in θ ﬁrst gets multiplied by

∂g(θ)

∂θ

obtain the change in g(θ) and this then gets multiplied by

∂C(g(θ))

∂g(θ)

to obtain the change

in C(g(θ)). Hence the ratio of the change in θ to the change in C(g(θ)) is the product

of these partial derivatives. The partial derivative measures the locally linear inﬂuence

of a variable on another. This is illustrated in Figure 6.2, with x = θ, y = g(θ), and

z = C(g(θ)).

Figure 6.2: The chain rule, illustrated in the simplest possible case, with z a scalar

function of y, which is itself a scalar function of x. A small change ∆x in x gets turned

into a small change ∆y in y through the partial derivative

∂y

∂x

, from the ﬁrst-order Taylor

approximation of y(x), and similarly for z(y). Plugging the equation for ∆y into the

equation for ∆z yields the chain rule.

Now, if g is a vector, we can rewrite the above as follows:

∂C(g(θ))

∂θ



∂C(g(θ))

∂g

(θ)

∂g

(θ)

∂θ

which sums over the inﬂuences of θ on C(g(θ)) through all the intermediate variables

(θ). This is illustrated with a simple case of two paths in Figure 6.3 with x = θ,

= g

(θ), and z = C(g(θ)).

107

Figure 6.3: Top: The chain rule, when there are two intermediate variables y

and y

between x and z, creating two paths for changes in x to propagate and yield changes in

z. Bottom: more general case, with n intermediate variables y

to y

6.3.2 Back-Propagation in a General Flow Graph

More generally, we can think about decomposing a function C(θ) into a more complicated

graph of computations. This graph is called a ﬂow graph. Each node u

of the graph

denotes a numerical quantity that is obtained by performing a computation requiring

the values u

of other nodes, with j < i. The nodes satisfy a partial order which

dictates in what order the computation can proceed. In practical implementations of

such functions (e.g. with the criterion C(θ) or its value estimated on a minibatch), the

ﬁnal computation is obtained as the composition of simple functions taken from a given

set (such as the set of numerical operations that the numpy library can perform on arrays

of numbers).

We will deﬁne the back-propagation in a general ﬂow-graph, using the following

generic notation: u

= f

), where a

is a list of arguments for the application of f

the values u

for the parents of i in the graph: a

= (u

)

j∈parents(i)

. This is illustrated

in Figure 6.4.

108

…

= f

)

Figure 6.4: Illustration of recursive forward computation, where at each node u

compute a value u

= f

), with a

being the list of values from parents u

of node

. Following Algorithm 6.1, the overall inputs to the graph are u

. . . u

(e.g., the

parameters we may want to tune during training), and there is a single scalar output

(e.g., the loss which we want to minimize).

The overall computation of the function represented by the ﬂow graph can thus be

summarized by the forward computation algorithm, Algorithm 6.1.

In addition to having some code that tells us how to compute f

) for some val-

ues in the vector a

, we also need some code that tells us how to compute its partial

derivatives,

∂f

)

∂a

with respect to any immediate argument a

. Let k = π(i, j) denote

the index of u

in the list a

. Note that u

could inﬂuence u

through multiple paths.

Whereas

∂u

would denote the total gradient adding up all of these inﬂuences,

∂f

)

∂a

only denotes the derivative of f

with respect to its speciﬁc k-th argument, keeping the

other arguments ﬁxed, i.e., only considering the inﬂuence through the arc from u

. In general, when manipulating partial derivatives, one should keep clear in one’s

mind (and implementation) the notational and semantic distinction between a partial

derivative that includes all paths and one that includes only the immediate eﬀect of a

function’s argument on the function output, with the other arguments considered ﬁxed.

For example consider f

3,1

, a

3,2

) = e

3,1

3,2

and f

2,1

) = a

2,1

, while u

= f

, u

)

109

Algorithm 6.1 Flow graph forward computation. Each node computes numerical value

by applying a function f

to its argument list a

that comprises the values of previous

nodes u

, j < i, with j ∈ parents(i). The input to the ﬂow graph is the vector x, and is

set into the ﬁrst M nodes u

to u

. The output of the ﬂow graph is read oﬀ the last

(output) node u

for i = 1 . . . M do

← x

end for

for i = M + 1 . . . N do

← (u

)

j∈parents(i)

← f

)

end for

return u

and u

= f

), illustrated in Figure 6.5. The direct derivative of f

with respect to its

argument a

3,2

∂f

∂a

3,2

= e

3,1

3,2

while if we consider the variables u

and u

to which

these correspond, there are two paths from u

to u

, and we obtain as derivative the

sum of partial derivatives over these two paths,

∂u

= e

(1 + 2u

). The results are

diﬀerent because

∂u

involves not just the direct dependency of u

on u

but also the

indirect dependency through u

Figure 6.5: Illustration of indirect eﬀect and direct eﬀect of variable u

on variable u

in a ﬂow graph, which means that the derivative of u

with respect to u

must include

the sum of two terms, one for the direct eﬀect (derivative of u

with respect to its

ﬁrst argument) and one for the indirect eﬀect through u

(involving the product of

the derivative of u

with respect to u

times the derivative of u

with respect to u

Forward computation of u

’s (as in Figure 6.4) is indicated with upward full arrows,

while backward computation (of derivatives with respect to u

’s, as in Figure 6.6) is

indicated with downward dashed arrows.

110

Figure 6.6: Illustration of recursive backward computation, where we associate to each

node j not just the values u

computed in the forward pass (Figure 6.4, bold upward

arrows) but also the gradient

∂u

with respect to the output scalar node u

. These

gradients are recursively computed in exactly the opposite order, as described in Algo-

rithm 6.2 by using the already computed

∂u

of the children i of j (dashed downward

arrows).

Armed with this understanding, we can deﬁne the back-propagation algorithm as

follows, in Algorithm 6.2, which would be computed after the forward propagation

(Algorithm 6.1) has been performed. Note the recursive nature of the application of the

chain rule, in Algorithm 6.2: we compute the gradient on node j by re-using the already

computed gradient for children nodes i, starting the recurrence from the trivial

∂u

= 1

that sets the gradient for the output node. This is illustrated in Figure 6.6.

Algorithm 6.2 Back-propagation computation of a ﬂow graph (full, upward arrows),

which itself produces an additional ﬂow graph (dashed, backward arrows). See the

forward propagation in a ﬂow-graph (Algorithm 6.1, to be performed ﬁrst) and the

required data structure. In addition, a quantity

∂u

needs to be stored (and computed)

at each node, for the purpose of gradient back-propagation. Below the notation π(i, j)

is the index of u

as an argument to f

. The back-propagation algorithm eﬃciently

computes

∂u

for all i’s (traversing the graph backwards this time), and in particular

we are interested in the derivatives of the output node u

with respect to the “inputs”

. . . u

(which could be the parameters, in a learning setup). The cost of the overall

computation is proportional to the number of arcs in the graph, assuming that the

partial derivative associated with each arc requires a constant time. This is of the same

order as the number of computations for the forward propagation.

∂u

← 1

for j = N − 1 down to 1 do

∂u

←



i:j∈parents(i)

∂u

∂f

)

∂a

i,π(i,j)

end for

return



∂u



i=1

This recursion is a form of eﬃcient factorization of the total gradient, i.e., it is an

application of the principles of dynamic programming. Indeed, the derivative of the

output node with respect to any node can also be written down in this intractable form:

∂u



paths u

...u

: k

=i,k



j=2

∂u

j−1

where the paths u

. . . u

go from the node k

= i to the ﬁnal node k

= N in the ﬂow

graph. Computing the sum as above would be intractable because the number of possible

paths can be exponential in the depth of the graph. The back-propagation algorithm is

111

eﬃcient because, like dynamic programming, it re-uses partial sums associated with the

gradients on intermediate nodes.

Although the above was stated as if the u

’s were scalars, exactly the same procedure

can be run with u

’s being tuples of numbers (more easily represented by vectors). In

that case the equations remain valid, and the multiplication of scalar partial derivatives

becomes the multiplication of a row vector of gradients

∂u

with a Jacobian of partial

derivatives associated with the j → i arc of the graph,

∂f

)

∂a

i,π(i,j)

. In the case where

minibatches are used during training, u

would actually be a whole matrix (the extra

dimension being for the examples in the minibatch). This would then turn the basic

computation into matrix-matrix products rather than matrix-vector products, and the

former can be computed much more eﬃciently than a sequence of matrix-vector products

(e.g. with the BLAS library), especially so on modern computers and GPUs, that rely

more and more on parallelization through many cores.

More implementation issues regarding the back-propagation algorithm are discussed

in Chapters 18 and 19, regarding respectively GPU implementations (Section 18.2)

and debugging tricks (Section 19.5.1). Section 19.5.2 also discusses a natural general-

ization of the back-propagation algorithm in which one manipulates not numbers but

symbolic expressions, i.e., turning a program that performs a computation (decomposed

as a ﬂow graph expressed in the forward propagation algorithm) into another program

(another ﬂow graph) that performs gradient computation (i.e. automatically generat-

ing the back-propagation program, given the forward propagation graph). Using such

symbolic automatic diﬀerentiation procedures (implemented for example in the Theano

library

), one can compute second derivatives and other kinds of derivatives, as well as

apply numerical stabilization and simpliﬁcations on a ﬂow graph.

6.4 Universal Approximation Properties and Depth

Without hidden layer, and for convex loss functions and regularizers (which is typically

the case), we obtain a convex training criterion. We end up with linear regression, logistic

regression or and other log-linear models that are used in many applications of machine

learning. This is appealing (no local minima, no saddle point) and convenient (no strong

dependency on initial conditions) but comes with a major disadvantage: such learners

are very limited in their ability to represent more complex functions. In the case of

classiﬁcation tasks, we are limited to linear decision surfaces. In practice, this limitation

can be somewhat circumvented by handcrafting a large enough or discriminant enough

set of hardwired features extracted from the raw input and on which the linear predictor

can be easily trained. See next section for a longer discussion about feature learning.

To make the family of functions rich enough, the other option is to introduce one

or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi, 1994)

that with a single hidden layer of a suﬃcient size and a reasonable choice of non-

linearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can represent

any smooth function to some desired accuracy (the greater the required accuracy, the

http://deeplearning.net/software/theano/

112

more hidden units are required). However, these theorems generally do not tell us

how many hidden units will be required to achieve a given accuracy for particular data

distributions: in fact the worse case scenario is generally going to require an exponential

number of hidden units (to basically record every input conﬁguration that needs to be

distinguished). It is easier to understand how bad things can be by considering the case

of binary input vectors: the number of possible binary functions on vectors v ∈ {0, 1}

and selecting one such function requires 2

bits, which will in general require O(2

)

degrees of freedom.

However, machine learning is not about learning equally easily any possible func-

tion: we mostly care about the kinds of functions that are needed to represent the

world around us or the task of interest. The no-free-lunch theorems for machine

learning (Wolpert, 1996) essentially say that no learning algorithm is “universal”, i.e.,

dominates all the others against all possible ground truths (data generating distribution

or target function). However, for a given family of tasks, some learning algorithms are

deﬁnitely better. Choosing a learning algorithm is equivalent to choosing a prior, i.e.,

making a larger bet for some guesses of the underlying target function than for others.

If the target function is likely under the chosen prior, then this prior (i.e., this learning

algorithm) turns out to be a good choice. The best choice is the unrealistic prior that

puts all probability mass on the true target function. This would only happen if there

is nothing to be learned any more because we already know the target function. Having

a universal approximation property just means that this prior is broad enough to cover

or come close to any possible target function, and this is a useful property, but in itself

clearly not suﬃcient to provide good generalization (not to speak of the optimization

issues and other computational concerns that may be associated with a choice of prior

and learning algorithm).

One of the central priors that is exploited in deep learning is that the target function

to be learned can be eﬃciently represented as a deep composition of simpler functions

(“features”), where features at one level can be re-used to deﬁne many features at the

next level. This is connected to the notion of underlying factors described in the next

section. One therefore assumes that these factors or features are organized at multiple

levels, corresponding to multiple levels of representation. The number of such levels is

what we call depth in this context. The computation of these features can therefore be

laid down in a ﬂow graph or circuit, whose depth is the length of the longest path from

an input node to an output node. Note that we can deﬁne the operations performed

in each node in diﬀerent ways. For example, do we consider a node that computes

the aﬃne operations of a neural net followed by a node that computes the non-linear

neuron activation, or do we consider both of these operations as one node or one level

of the graph? Hence the notion of depth really depends on the allowed operations at

each node and one ﬂow graph of a given depth can be converted into an equivalent

ﬂow graph of a diﬀerent depth by redeﬁning which operations can be performed at each

node. However, for neural networks, we typically talk about a depth 1 network if there

is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal

approximation properties for neural nets basically tell us that depth 2 is suﬃcient to

approximate any reasonable function to any desired ﬁnite accuracy.

113

From the point of view of approximation properties, the important result is that one

can ﬁnd families of functions which can be approximated very eﬃciently when a partic-

ular depth is allowed, but which might require a much larger (typically exponentially

larger) model (e.g. more hidden units) if depth is insuﬃcient (or is limited to 2). Such

results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-

negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,

2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more

recently, for deep networks of rectiﬁer units (Pascanu et al., 2013b). Of course, there

is no guarantee that the kinds of functions we want to learn in applications of machine

learning (and in particular for AI) share such a property. However, a lot of experimental

evidence suggests that better generalization can be obtained with more depth, for many

such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,

2012; Krizhevsky et al., 2012a; Sermanet et al., 2013; Farabet et al., 2013; Couprie et al.,

2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior and that

in order to take advantage of it, the learner’s family of function needs to allow multiple

levels of representation.

6.5 Feature / Representation Learning

Let us consider again the single layer networks such as the Perceptron, linear regression

and logistic regression: such linear models are appealing because training them involves a

convex optimization problem

, i.e., with some convergence guarantees towards a global

optimum, irrespective of initial conditions. Simple and well-understood optimization

algorithms are available in this case. However, this limits the representational capacity

too much and many tasks, for a given choice of input representation x (the raw input

features), cannot be solved by using only a linear predictor. What are our options to

avoid that limitation:

1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Sch¨olkopf

et al., 1999), i.e., to consider a ﬁxed mapping from x to φ(x), where φ(x) is of much

higher dimension. In this case, f

(x) = b+w ·φ(x) can be linear in the parameters

(and in φ(x)) and optimization remains convex (or even analytic). Thanks to the

kernel trick, we can computationally handle a high-dimensional φ(x) (or even an

inﬁnite-dimensional one) so long as the kernel K(u, v) = φ(u) · φ(v) (where · is

the appropriate dot product for the space of φ(·)) can be computed eﬃciently.

If φ(x) is of high enough dimension, we can always have enough capacity to ﬁt

the training set, but generalization is not at all guaranteed: it will depend on

the appropriateness of the choice of φ as a feature space for our task. Kernel

machines theory clearly identiﬁes the choice of φ to the choice of a prior. This

leads to kernel engineering, which is equivalent to feature engineering, discussed

next. The other type of kernel (that is very commonly used) embodies a very broad

prior, such as smoothness, e.g., the Gaussian (or RBF) kernel K(u, v) = e

||u−v||/σ

or even one for which an analytic solution can be computed, with linear regression or the case of

some Gaussian process regression models

114

Unfortunately, this prior may be insuﬃcient, i.e., too broad and sensitive to the

curse of dimensionality, as introduced in Section 5.11 and developed in more detail

in Chapter 14.

2. Another option is to manually engineer the representation or features φ(x). Most

industrial applications of machine learning rely and hand-crafted features and

most of the research and development eﬀort (as well as a very large fraction of

the scientiﬁc literature in machine learning and its applications) goes in designing

new features that are most appropriate to the task at hand. Clearly, faced with a

problem to solve and some prior knowledge in the form of representations that are

believe to be relevant, the prior knowledge can be very useful. This approach is

therefore common in practice, but is not completely satisfying because it involves

a very task-speciﬁc eﬀort and a laborious never-ending eﬀort to improve systems

by designing better features. If there were some more general feature learning

approaches that could be applied to a large set of related tasks (such as those

involved in AI), we would certainly like to take advantage of them. Since humans

seem to be able to learn a lot of new tasks (for which they were not programmed

by evolution), it seems that such broad priors do exist. This whole question is

discussed in a lot more detail in Bengio and LeCun (2007b), and motivates the

third option.

3. The third option is to learn the features, or learn the representation. In a sense,

it allows to interpolate between the almost agnostic approach of a kernel machine

with a general-purpose smoothness kernel (such as RBF SVMs and other non-

parametric statistical models) and full designer-provided knowledge in the form

of a ﬁxed representation that is perfectly tailored to the task. This is equiva-

lent to the idea of learning the kernel, except that whereas most kernel learning

methods only allow very few degrees of freedom in the learned kernel, representa-

tion learning methods such as those discussed in this book (including multi-layer

neural networks) allow the feature function φ(·) to be very rich (with a num-

ber of parameters that can be in the millions or more, depending on the amount

of data available). This is equivalent to learning the hidden layers, in the case

of a multi-layer neural network. Besides smoothness (which comes for example

from regularizers such as weight decay), other priors can be incorporated in this

feature learning. The most celebrated of these priors is depth, discussed above

(Section 6.4). Other priors are discussed in Chapter 14.

This whole discussion is clearly not speciﬁc to neural networks and supervised learn-

ing, and is one of the central motivations for this book.

6.6 Historical Notes

TODO describe ReLUs and maxout, include reference to dropout section

115