Chapter 6

Feedforward Deep Networks

6.1 Formalizing and Generalizing Neural Networks

Feedforward supervised neural networks were among the ﬁrst and most successful learn-

ing algorithms (Rumelhart et al., 1986d,c). They are also called deep networks, multi-

layer Perceptron (MLP), or simply neural networks and the vanilla architecture with a

single hidden layer is illustrated in Figure 6.1.

Figure 6.1: Vanilla (shallow) MLP, with one sigmoid hidden layer, computing vector-

valued hidden unit vector h = sigmoid(c + W x) with weight matrix W and bias or

oﬀset vector c. The output vector is obtained via another learned aﬃne transformation

y = b + V h, with weight matrix V and output bias (oﬀset) vector b.

MLPs can learn powerful non-linear transformations: in fact, with enough hidden

units they can represent arbitrarily complex but smooth functions (see Section 6.4).

This is achieved by composing simpler but still non-linear learned transformations. By

104

transforming the data non-linearly into a new space, a classiﬁcation problem that was

not linearly separable (not solvable by a linear classiﬁer) can become separable, as

illustrated in Figure 6.2.

Figure 6.2: Each layer of a trained neural network non-linearly transforms its input,

distorting the space so that the task becomes easier to perform, e.g., linear classiﬁcation

in the above ﬁgures. Top: a vanilla neural network with 2 hidden units in its single

hidden layer can transform the 2-D input space so that the two classes (in red/pink vs

blue) become linearly separable (the red and blue curves correspond to the “manifold”

near which examples concentrate). Bottom: with a larger hidden layer (100 here),

the MNIST digit images (with 28 × 28 = 784 pixels) can be transformed so that the

classes (each shown with a diﬀerent color) can be much more easily classiﬁed by the

output linear+softmax layer (over 10 digit categories). Both ﬁgures are reproduced with

permission by Chris Olah from http://colah.github.io/, where many more insightful

visualizations can be found.

105

In addition to covering the basics of such networks, this chapter introduces a general

formalism for gradient-based optimization of parametrized families of functions, often

in the context of conditional maximum likelihood criteria (Section 6.2).

They bring together a number of important machine learning concepts already in-

troduced in the previous chapters:

• Deﬁne a parametrized family of functions f

describing how the learner will

behave on new examples, i.e., what output f

(x) will produce given some input

x. Training consists in choosing the parameter θ (usually represented by a vector)

given some training examples (x, y) sampled from an unknown data generating

distribution P(X, Y ).

• Deﬁne a loss function L describing what scalar loss L(

y, y) is associated with

each supervised example (x, y), as a function of the learner’s output

y = f

(x)

and the target output y.

• Deﬁne a training criterion and a regularizer. The objective of training is

ideally to minimize the expected loss E

X,Y

[L(f

(X), Y )] over X, Y sampled from

the unknown data generating distribution P (X, Y ). However this is not possible

because the expectation makes use of the true underlying P (X, Y ) but we only

have access to a ﬁnite number of training examples, i.e. of pairs (X,Y ). Instead,

one deﬁnes a training criterion which usually includes an empirical average of the

loss over the training set, plus some additional terms (called regularizers) which

enforce some preferences over the choices of θ.

• Deﬁne an optimization procedure to approximately minimize the training cri-

terion

. The simplest such optimization procedure is stochastic gradient descent,

described in Section 4.3.

Example 6.1.1 illustrates these concepts for the case of a vanilla neural network for

regression.

In chapter 17, we consider generalizations of the above framework to the unsuper-

vised and semi-supervised cases, where Y may not exist or may not always be present.

An unsupervised loss can then be deﬁned solely as a function of the input x and some

function f

(x) that one wishes to learn.

It is generally not possible to analytically obtain a global minimum of the training criterion, and

iterative numerical optimization methods are used instead.

106

Example 6.1.1. Vanilla (Shallow) Multi-Layer Neural Network for

Regression

Based on the above deﬁnitions, we could pick the family of input-output func-

tions to be

(x) = b + V sigmoid(c + W x),

illustrated in Figure 6.1, where sigmoid(a) = 1/(1 + e

−a

) is applied element-

wise, the input is the vector x ∈ R

, the hidden layer outputs are the elements

of the vector h = sigmoid(c + W x) with n

entries, the parameters are θ =

(b, c, V , W ) θ (viewed as the ﬂattened vectorized version of the tuple) with

b ∈ R

a vector the same dimension as the output (n

), c ∈ R

of the same

dimension as h (number of hidden units), V ∈ R

×n

and W ∈ R

×n

being

weight matrices.

The loss function for this classical example could be the squared error

y, y) = ||

y − y||

(see Section 5.8.1 showing that it makes

y an estima-

tor of E[Y |x]). The regularizer could be the ordinary L

weight decay

||ω||

= (

), where we deﬁne the set of weights ω as the con-

catenation of the elements of matrices W and V . The L

weight decay thus

penalizes the squared norm of the weights, with λ a scalar that is larger to pe-

nalize stronger and yield smaller weights. Combining the loss function and the

regularizer gives the training criterion, which is the objective function during

training:

C(θ) = λ||ω||

t=1

||y

(t)

− (b + V sigmoid(c + W x

(t)

))||

where (x

(t)

, y

(t)

) is the t-th training example, an (input,target) pair. Finally,

the classical training procedure in this example is stochastic gradient descent,

which iteratively updates θ according to

ω ← ω − 



2λω + ∇

L(f

(t)

), y

(t)

)



β ← β − ∇

L(f

(t)

), y

(t)

where β = (b, c) contains the oﬀset (bias) parameters, ω = (W , V ) the weight

matrices,  is a learning rate and t is incremented after each training example,

modulo n.

107

6.2 Parametrizing a Learned Predictor

There are many ways to deﬁne the family of input-output functions, loss function,

regularizer and optimization procedure, and the most common ones are described below,

while more advanced ones are left to later chapters, in particular Chapters 10 and 16.

6.2.1 Family of Functions

A motivation for the family of functions deﬁned by multi-layer neural networks is to

compose aﬃne transformations and non-linearities, where the choice of parameter values

controls the degree and location of these non-linearities. As discussed in Section 6.4

below, with the appropriate choice of parameters, multi-layer neural networks can in

principle approximate smooth functions, with more hidden units allowing one to achieve

better approximations.

A multi-layer neural network with more than one hidden layer can be deﬁned by

generalizing the above structure, e.g., as follows, where we chose to use hyperbolic

tangent

activation functions instead of sigmoid activation functions:

= tanh(b

+ W

k−1

)

where h

= x is the input of the neural net, h

(for k > 0) is the output of the k-th

hidden layer, which has weight matrix W

and oﬀset (or bias) vector b

. If we want

the output f

(x) to lie in some desired range, then we typically deﬁne an output non-

linearity (which we did not have in the above Example 6.1.1). The non-linearity for the

output layer is of course generally diﬀerent from the tanh, depending on the type of

output to be predicted and the associated loss function (see below).

As discussed in Chapter 12 (Section 12.5) there are several other non-linearities

besides the sigmoid and the hyperbolic tangent which have been successfully used with

neural networks. In particular, TODO discusses alternative non-linear units, such as

the rectiﬁed linear unit (max(0, b + w · x)) and the maxout unit (max

+ W

:,i

· x))

which have been particularly successful in the case of deep feedforward or convolutional

networks. These and other non-linear neural network activation functions commonly

found in the literature are summarized below.

• Rectiﬁer or rectiﬁed linear unit (ReLU) or positive part: h(x) = max(0, b+

w · x), also written (b + w ·x)

• Hyperbolic tangent: h(x) = tanh(b + w · x).

• Sigmoid: h(x) = 1/(1 + e

−(b+w·x)

• Softmax: h(x) = softmax(b + W x). It is mostly used as output non-linearity for

predicting discrete probabilities. See deﬁnition and discussion below, Eq. 6.1.

which is linearly related to the sigmoid as follows: tanh(x) = 2 × sigmoid(2x) − 1

108

• Radial basis function or RBF unit: h(x) = e

−||w−x||

/σ

. This is heavily used

in kernel SVMs (Boser et al., 1992; Sch¨olkopf et al., 1999) and has the advantage

that such units can be easily initialized as a random subset of the input exam-

ples (Powell, 1987; Niranjan and Fallside, 1990).

• Softplus: this is a smooth version of the rectiﬁer, introduced in Dugas et al. (2001)

for function approximation and in Nair and Hinton (2010) in RBMs. Glorot et al.

(2011a) compared the softplus and rectiﬁer and found better results with the latter,

in spite of the very similar shape and the diﬀerentiability and non-zero derivative

of the softplus everywhere, contrary to the rectiﬁer.

• Hard tanh: this is shaped similarly to the tanh and the rectiﬁer but unlike the

latter, it is bounded, h(x) = max(−1, min(1, x)). It was introduced by Collobert

(2004).

• Absolute value rectiﬁcation: h(x) = |x| (may be applied on the dot product

output or on the output of a tanh unit). It is also related to the rectiﬁer and

has been used for object recognition from images (Jarrett et al., 2009b), where

it makes sense to seek features that are invariant under a polarity reversal of the

input illumination.

• Maxout: this is discussed in more detail in TODO It generalizes the rectiﬁer but

introduces multiple weight vectors w

(called ﬁlters) for each unit. h(x) = max

+ w

· x).

This is not an exhaustive list but covers most of the non-linearities and unit computa-

tions seen in the deep learning and neural nets literature. Many variants are possible.

As discussed in Section 6.3, the structure (also called architecture) of the family of

input-output functions can be varied in many ways, which calls for a generic principle

for eﬃciently computing gradients, described in that section. For example, a common

variation is to connect layers that are not adjacent, with so-called skip connections,

which are found in the visual cortex (for which the word “layer” should be replaced by

the word “area”). Other common variations depart from a full connectivity between

adjacent layers. For example, each unit at layer k may be connected to only a subset of

units at layer k − 1. A particular case of such form of sparse connectivity is discussed

in chapter 9 with convolutional networks. In general, the set of connections between

units of the whole network only needs to form a directed acyclic graph in order to deﬁne

a meaningful computation (see the ﬂow graph formalism below, Section 6.3). When

units of the network are connected to themselves through a cycle, one has to properly

deﬁne what computation is to be done, and this is what is done in the case of recurrent

networks, treated in Chapter 10.

6.2.2 Loss Function and Conditional Log-Likelihood

In the 80’s and 90’s the most commonly used loss function was the squared error

L(f

(x), y) = ||f

(x) − y||

. As discussed in Section 5.8.1, if f is unrestricted (non-

parametric), minimizing the expected value of the loss function over some data-generating

109

distribution P (X, Y ) yields f(x) = E[Y |X], the true conditional expectation of Y given

. See also Section 5.8.1 for more details. This tells us what the neural network

is trying to learn. Replacing the squared error by an absolute value makes the neural

network try to estimate not the conditional expectation but the conditional median.

However, when y is a discrete label, i.e., for classiﬁcation problems, other loss func-

tions such as the Bernoulli negative log-likelihood

have been found to be more ap-

propriate than the squared error. In the case where y ∈ {0, 1} is binary this gives

L(f

(x), y) = −y log f

(x) −(1 −y) log(1 −f

(x)) and it can be shown that the optimal

(non-parametric) f minimizing this criterion is f(x) = P (Y = 1|x). Note that in order

for the above expression of the criterion to make sense, f

(x) must be strictly between 0

and 1 (an undeﬁned or inﬁnite value would otherwise arise). To achieve this, it is com-

mon to use the sigmoid as non-linearity for the output layer, which matches well with

the Binomial negative log-likelihood criterion

. As explained below (Section 6.2.2), the

cross-entropy criterion allows gradients to pass through the output non-linearity even

when the neural network produces a conﬁdently wrong answer, unlike the squared error

criterion coupled with a sigmoid or softmax non-linearity.

Learning a Conditional Probability Model

More generally, one can deﬁne a loss function as corresponding to a conditional log-

likelihood, i.e., the negative log-likelihood (NLL) criterion

NLL

(x), y) = −log P

(Y = y|X = x).

See Section 5.8.2 which shows that this criterion yields an estimator of the true condi-

tional probability of Y given X. For example, if Y is a continuous random variable and

we assume that, given X, it has a Gaussian distribution with mean f

(X) and variance

, then

−log P

(Y |X) =

(X) −Y )

/σ

+ log(2πσ

Up to an additive and multiplicative constant (which would give the same choice of θ),

this is therefore equivalent to the squared error loss. Once we understand this principle,

we can readily generalize it to other distributions, as appropriate. For example, it

is straightforward to generalize the univariate Gaussian to the multivariate case, and

under appropriate parametrization consider the variance to be a parameter or even

parametrized function of x (for example with output units that are guaranteed to be

positive, or forming a positive deﬁnite matrix).

Similarly, for discrete variables, the Binomial negative log-likelihood criterion corre-

sponds to the conditional log-likelihood associated with the Bernoulli distribution with

Proving this is not diﬃcult and is very instructive.

This is often called cross entropy in the literature, even though the term cross entropy should also

apply to many other losses that can be viewed as negative log-likelihood, discussed below in more detail.

TODO: where do we actually discuss this? may as well discuss it in maximum likelihood section

In reference to statistical models, this “match” between the loss function and the output non-

linearity is similar to the choice of a link function in generalized linear models (McCullagh and Nelder,

1989).

110

probability p = f

(x) of generating Y = 1 given X = x (and probability 1 − p of

generating Y = 0):

−log P

(Y |X) = −1

Y =1

log p−1

Y =0

log(1 −p) = −Y log f

(X)−(1−Y ) log(1−f

(X)).

Softmax

When y is discrete (in some ﬁnite set, say {1, . . . , N }) but not binary, the Bernoulli

distribution is extended to the Multinoulli (Murphy, 2012), Section 3.10.2

. This dis-

tribution is speciﬁed by a vector of N − 1 probabilities whose sum is less or equal to

1, each element of which provides the probability p

= P (y = i|x). We thus need a

vector-valued output non-linearity that produces such a vector of probabilities, and a

commonly used non-linearity for this purpose is the softmax non-linearity introduced

earlier.

p = softmax(a) ⇐⇒ p

. (6.1)

where typically a = b + W h is the vector of scores whose elements a

are associated

with each category i. The corresponding loss function is therefore L

NLL

(p, y) = −log p

Note how minimizing this loss will push a

up (increase the score a

associated with

the correct label y) while pushing a

(for i 6= y) down (decreasing the score of the other

labels, in the context x). The ﬁrst eﬀect comes from the numerator of the softmax

while the second eﬀect comes from the normalizing denominator. These forces cancel on

a speciﬁc example only if p

= 1 and they cancel in average over examples (say sharing

the same x) if p

equals the fraction of times that y = i for this value x. The same

principles and the role of the normalization constant (or “partition function”) can be

seen at play in the training of Markov Random Fields, Boltzmann machines and RBMs,

in Chapter 14. Note other interesting properties of the softmax. First of all, the gradient

of the Multinoulli negative log-likelihood with respect to the softmax argument a is

∂

∂a

NLL

(p(a), y) =

∂

∂p

(−log p

(a)) ·

∂p

∂a

= −

(a)

· p

(a)(1

i=y

− p)

= (p −1

i=y

) (6.2)

which does not saturate, the gradient does not vanish when the output probabilities

approach 0 or 1, except when the model is providing the correct answer. Speciﬁcally,

lets consider the case where the correct label is i, i.e. y = i. The element of the gradient

associated with an erroneous label, say j 6= i, is

∂

∂a

NLL

(p(a), y) = p

. (6.3)

The Multinoulli is also known as the categorical or generalized Bernoulli distribution, which is a

special case of the Multinomial distribution with one trial, and the term multinomial is often used even

in this case.

111

So if the model correctly predicts a low probability that the ry = j, i.e. that p

≈ 0,

then the gradient is also close to zero.

There are other loss functions such as the squared error applied to softmax (or

sigmoid) outputs (which was popular in the 80’s and 90’s) which have vanishing gradient

when an output unit saturates (when the derivative of the non-linearity is near 0), even

if the output is completely wrong (Solla et al., 1988). This may be a problem because it

means that the parameters will basically not change, even though the output is wrong.

To see how the squared error interacts with the softmax output, we need to introduce

a one-hot encoding of the label, y = [0, . . . , 0, 1, 0, . . . , 0], i.e for the label y = i, we have

= 1 and y

= 0, ∀j 6= i. We will again consider that we have the output of the

network to be p = softmax(a), where, as before, a is the input to the softmax function

( i.e. a = b + W h with h the output of the last hidden layer).

For the squared error loss L

(p(a), y) = ||p(a) − y||

, the gradient of the loss with

respect to the input vector to the softmax, a, is given by:

∂

∂a

L(p(a), y) =

∂

∂p

∂

∂p

= 2(p(a) −y)  p (1 − p),

where  indicates an element-wise multiplication. Let us again consider the case where

y = i, the gradient associated with an erroneous label, say j 6= i, is

∂

∂a

(p(a), y) = 2(p(a) − y) p (1 − p). (6.4)

So if the model correctly predicts a low probability that the ry = j, i.e. that p

≈ 0,

then the gradient is also close to zero.

Another property that could potentially be interesting is that the softmax output is

invariant under additive changes of its input vector: softmax(a) = softmax(a + b) when

b is a scalar added to all the elements of vector a. Finally, it is interesting to think

of the softmax as a way to create a form of competition between the units (typically

output units, but not necessarily) that participate in it: the strongest ﬁlter output

∗

gets reinforced (because the exponential increases faster for larger values) and the

other units get inhibited. This is analogous to the lateral inhibition that is believed to

exist between nearby neurons in cortex, and at the extreme (when the a

’s are large in

magnitude) becomes a form of winner-take-all (one of the outputs is 1 and the others is

0). A more computationally expensive form of competition is found with sparse coding,

described in Section 20.3.

Neural Net Outputs as Parameters of a Conditional Distribution

In general, for any parametric probability distribution P (y|ω) with parameters ω, we

can construct a conditional distribution P (y|x) by making ω a parametrized function of

x, and learning that function:

P (y|ω = f

(x))

112

where f

(x) is the output of a predictor, x is its input, and y can be thought of as a

target. This established choice of words comes from the common cases of classiﬁcation

and regression, where f

(X) is really a prediction associated with Y , or its expected

value. However, in general ω = f

(X) may contain parameters of the distribution of Y

other than its expected value. For example, it could contain its variance or covariance,

in the case where Y is conditionally Gaussian. In the above examples, with the squared

error loss, ω is the mean of the Gaussian which captures the conditional distribution

of Y (which means that the variance is considered ﬁxed, not a function of X). In

the common classiﬁcation case, ω contains the probabilities associated with the various

events of interest.

Once we view things in this way, we automatically get as the natural cost function

the negative log-likelihood L(X, Y ) = −log P (Y |ω = f

(X)). Besides θ, there could be

other parameters that control the distribution of Y . For example, we may wish to learn

the variance of a conditional Gaussian, even when the latter is not a function of X. In

this case, the solution can be computed analytically because the maximum likelihood

estimator of variance is simply the empirical mean of the squared diﬀerence between

observations Y and their expected value (here f

(X)). In other words, the conditional

variance can be estimated from the mean squared error.

Besides the Gaussian, a simple and common example is the case where Y is binary

(i.e. Bernoulli distributed), where it is enough to specify ω = P (Y = 1|X). In the

Multinoulli case, ω is generally speciﬁed by a vector of probabilities summing to 1, e.g.,

via the softmax non-linearity discussed above.

Another interesting and powerful example of output distribution for neural networks

is the mixture model, and in particular the Gaussian mixture model, introduced in Sec-

tion 3.10.5. Neural networks that compute the parameters of a mixture model were

introduced in Jacobs et al. (1991); Bishop (1994). In the case of the Gaussian mixture

model with N components,

P (y|x) =

i=1

P (C = i|x)N(y|µ

(x), Σ

(x)),

the neural network must have three kinds of outputs, P (C = i|x), µ

(x), and Σ

(x),

which must satisfy diﬀerent constraints:

1. Mixture components P (C = i|x): these form a Multinoulli distribution over the

N diﬀerent components C, and can typically be obtained by a softmax over an

N-vector, to guarantee that these outputs are positive and sum to 1.

2. Means µ

(x): these indicate the center or mean associated with each Gaussian

component, and are unconstrained (typically with no non-linearity at all for these

output units). If Y is a d-vector, then the network must output an N ×d matrix

containing all these N d-vectors.

3. Covariances Σ

(x): these specify the covariance matrix for each component i. In

many models the variance is unconditional (does not depend on x) and diagonal or

113

even scalar. If it is diagonal or scalar, only positivity must be enforced (e.g., using

the softplus non-linearity). If it is full and conditional, then a parametrization must

be chosen that guarantees positive-deﬁniteness of the predicted covariance matrix.

This can be achieved by writing Σ

(x) = B

(x)B

(x), where B

is an unconstrained

square matrix. One practical issue if the the matrix is full is that computing

the likelihood is expensive, requiring O(d

) computation for the determinant and

inverse of Σ

(x) (or equivalently, and more commonly done, its eigendecomposition

or that of B

(x)).

It has been reported that gradient-based optimization of conditional Gaussian mixtures

(on the output of neural networks) can be ﬁnicky, in part because one gets divisions (by

the variance) which can be numerically unstable (when some variance gets to be small

for a particular example, yielding very large gradients). One solution is to clip gradients

(see Section 10.6.7 and Mikolov (2012); Pascanu and Bengio (2012); Graves (2013);

Pascanu et al. (2013a)), while another is to scale the gradients heuristically (Murray

and Larochelle, 2014).

Multiple Output Variables

When Y is actually a tuple formed by multiple random variables Y

, Y

, . . . , Y

, then one

has to choose an appropriate form for their joint distribution, conditional on X = x. The

simplest and most common choice is to assume that the Y

are conditionally independent,

i.e.,

P (Y

, Y

, . . . , Y

|x) =

i=1

P (Y

|x).

This brings us back to the single variable case, especially since the log-likelihood now

decomposes into a sum of terms log P (Y

|x). If each P (Y

|x) is separately parametrized

(e.g. a diﬀerent neural network), then we can train these neural networks independently.

However, a more common and powerful choice assumes that the diﬀerent variables Y

share some common factors that can be represented in some hidden layer of the network

(such as the top hidden layer). See Sections 6.5 and 7.12 for a deeper treatment of the

notion of underlying factors of variation and multi-task training. See also Figure 7.6

illustrating these concepts.

If the conditional independence assumption is considered too strong, what can we do?

At this point it is useful to step back and consider everything we know about learning

a joint probability distribution. Any probability distribution P

(Y ) parametrized by

parameters ω can be turned into a conditional distribution P

(Y |x) by making ω a

function ω = f

(x) parametrized by θ. This is not only true of the simple parametric

distributions we have seen above (Gaussian, Bernoulli, Multinoulli) but also of more

complex joint distributions typically represented by a graphical model. Much of this

book is devoted to such graphical models (see Chapters 14, 19, 20, 21). For more

concrete examples of such structured output models with deep learning, see Section 13.4.

114

6.2.3 Training Criterion and Regularizer

The loss function (often interpretable as a negative log-likelihood) tells us what we would

like the learner to capture. Maximizing the conditional log-likelihood over the true

distribution, i.e., minimizing the expected loss E

X,Y

[−log P

(Y |X)], makes P

(Y |X)

estimate the true P (Y |X) associated with the unknown data generating distribution,

within the boundaries of what the chosen family of functions allows. See Section 5.8.2

for a proof. In practice we cannot minimize this expectation because we do not know

P (X, Y ) and because computing and minimizing this integral exactly would generally

be intractable. Instead we are going to approximately minimize a training criterion

C(θ) based on the empirical average of the loss (over the training set). The simplest

such criterion is the average training loss

t=1

L(f

(t)

), y

(t)

), where the training

set is a set of n examples (x

(t)

, y

(t)

). However, better results can often be achieved by

crippling the learner and preventing it from simply ﬁnding the best θ that minimizes the

average training loss. This means that we combine the evidence coming from the data

(the training set average loss) with some a priori preference on the diﬀerent values that

θ can take (the regularizer). If this concept (and the related concepts of generalization,

overﬁtting and underﬁtting) are not clear, please return to Sections 5.3 and 5.6.2 for a

refresher.

The most common regularizer is simply an additive term equal to a regularization

coeﬃcient λ times the squared norm of the parameters

, ||θ||

. This regularizer is often

called the weight decay or L2 weight decay or shrinkage because adding the squared

L2 norm to the training criterion pushes the weights towards zero, in proportion to

their magnitude. For example, when using stochastic gradient descent, each step of the

update with regularizer term

||θ||

would look like

θ ← θ − ∇

L(f

(x), y) − λθ

where  is the learning rate (see Section 4.3). This can be rewritten

θ ← θ(1 − λ) −∇

L(f

(x), y)

where we see that the ﬁrst term systematically shaves oﬀ a small proportion of θ before

adding the gradient.

Another commonly used regularizer is the so-called L1 regularizer, which adds a term

proportional to the L1 (absolute value) norm, |θ|

|θ

|. The L1 regularizer also

prefers values that are small, but whereas the L2 weight decay has only a weak preference

for 0 rather than small values, the L1 regularizer continues pushing the parameters

towards 0 with a constant gradient even when they get very small. As a consequence, it

tends to bring some of the parameters to exactly 0 (how many depends on how large the

regularization coeﬃcient λ is chosen). When optimizing with an approximate iterative

and noisy method such as stochastic gradient descent, no actual 0 is obtained. On the

In principle, one could have diﬀerent priors on diﬀerent parameters, e.g., it is common to treat

the output weights with a separate regularization coeﬃcient, but the more hyper-parameters, the more

diﬃcult is their optimization, discussed in Chapter 12.

115

other hand, the L1 regularizer tolerates large values of some parameters (only additively

removing a small quantity compared to the situation without regularization) whereas

the L2 weight decay aggressively punishes and prevents large parameter values. The L1

and L2 regularizers can be combined, as in the so-called elastic net (Zou and Hastie,

2005), and this is commonly done in deep networks

Note how in the context of maximum likelihood, regularizers can generally be in-

terpreted as Bayesian priors on the parameters P (θ) or on the learned function, as

discussed in Sections 5.6.2 and 5.7. Later in this book we discuss regularizers that are

data-dependent (i.e., cannot be expressed easily as a pure prior), such as the contractive

penalty (Chapter 16) as well as regularization methods that are diﬃcult to interpret

simply as added terms in the training criterion, such as dropout (Section 7.11).

6.2.4 Optimization Procedure

Once a training criterion is deﬁned, we enter the realm of iterative optimization proce-

dures, with the slight twist that if possible we want to monitor held-out performance

(for example the average loss on a validation set, which usually does not include the

regularizing terms) and not just the value of the training criterion. Monitoring the val-

idation set performance allows one to stop training at a point where generalization is

estimated to be the best among the training iterations. This is called early stopping and

is a standard machine learning technique, discussed in Sec. 7.8.

Section 4.3 has already covered the basics of gradient-based optimization. The sim-

plest such technique is stochastic gradient descent (with minibatches), in which the

parameters are updated after computing the gradient of the average loss over a mini-

batch of examples (e.g. 128 examples) and making an update in the direction opposite

to that gradient (or opposite to some accumulated average of such gradients, i.e., the

momentum technique, reviewed in Section 12.5.3). The most commonly used optimiza-

tion procedures for multi-layer neural networks and deep learning in general are either

variants of stochastic gradient descent (typically with minibatches), second-order meth-

ods (the most commonly used being L-BFGS and nonlinear conjugate gradients) applied

on large minibatches (e.g. of size 10000) or on the whole training set at once, or nat-

ural gradient methods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu

and Bengio, 2013). Exact second-order and natural gradient methods are computation-

ally too expensive for large neural networks because they involve matrices of dimension

equal to the square of the number of parameters. Approximate methods are discussed

in Section 8.4.

On smaller datasets or when computing can be parallelized, second-order methods

have a computational advantage because they are easy to parallelize and can still perform

many updates per unit of time. On larger datasets (and in the limit of an inﬁnite dataset,

i.e., a stream of training examples) one cannot aﬀord to wait for seeing the whole dataset

before making an update, so that favors either stochastic gradient descent variants

See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.

html#l1-and-l2-regularization

116

(possibly with minibatches to take advantage of fast or parallelized implementations of

matrix-matrix products) or second-order methods with minibatches.

A more detailed discussion of issues arising with optimization methods in deep learn-

ing can be found in Chapter 8. In particular, many design choices in the construction

of the family of functions, loss function and regularizer can have a major impact on

the diﬃculty of optimizing the training criterion. Furthermore, instead of using generic

optimization techniques, one can design optimization procedures that are speciﬁc to

the learning problem and chosen architecture of the family of functions, for example by

initializing the parameters of the ﬁnal optimization routine from the result of a diﬀer-

ent optimization (or a series of optimizations, each initialized from the previous one).

Because of the non-convexity of the training criterion, the initial conditions can make

a very important diﬀerence, and this is what is exploited in the various pre-training

strategies, Sections 8.6.4 and 17.1, as well as with curriculum learning (Bengio et al.,

2009), Section 8.7.

6.3 Flow Graphs and Back-Propagation

The term back-propagation is often misunderstood as meaning the whole learning algo-

rithm for multi-layer neural networks. Actually it just means the method for computing

gradients in such networks. Furthermore, it is generally understood as something very

speciﬁc to multi-layer neural networks, but once its derivation is understood, it can eas-

ily be generalized to arbitrary functions (for which computing a gradient is meaningful),

and we describe this generalization here, focusing on the case of interest in machine

learning where the output of the function to diﬀerentiate (e.g., the loss L or the training

criterion C) is a scalar and we are interested in its derivative with respect to a set of pa-

rameters (considered to be the elements of a vector θ). The partial derivative of C with

respect to θ (called the gradient) tells us whether θ should be increased or decreased

in order to decrease C, and is a crucial tool in optimizing the training objective. It

can be readily proven that the back-propagation algorithm has optimal computational

complexity in the sense that there is no algorithm that can compute the gradient faster

(in the O(·) sense, i.e., up to an additive and multiplicative constant).

The basic idea of the back-propagation algorithm is that the partial derivative of

the cost C with respect to parameters θ can be decomposed recursively by taking into

consideration the composition of functions that relate θ to C, via intermediate quantities

that mediate that inﬂuence, e.g., the activations of hidden units in a deep neural network.

In this section we call these intermediate values u

(indexed by j) and consider the

general case in which they form a directed acyclic graph that has C as its ﬁnal node

, that depends of all the other nodes u

. The back-propagation algorithm exploits

the chain rule for derivatives to compute

∂C

∂u

when

∂C

∂u

has already been computed for

successors u

of u

in the graph, e.g., the hidden units in the next layer downstream.

This recursion can be initialized by noting that

∂C

∂u

= 1 and at each step only requires

to use the partial derivatives associated with each arc of the graph,

∂u

, when u

is a

successor of u

117

Figure 6.3: The chain rule, illustrated in the simplest possible case, with z a scalar

function of y, which is itself a scalar function of x. A small change ∆x in x gets turned

into a small change ∆y in y through the partial derivative

∂y

∂x

, from the ﬁrst-order Taylor

approximation of y(x), and similarly for z(y). Plugging the equation for ∆y into the

equation for ∆z yields the chain rule.

6.3.1 Chain Rule

In order to apply the back-propagation algorithm, we take advantage of the chain rule:

∇

C(g(θ)) = ∇

g(θ)

C(g(θ))

∂g(θ)

∂θ

(6.5)

which works also when C, g or θ are vectors rather than scalars (in which case the

corresponding partial derivatives are understood as Jacobian matrices of the appropriate

dimensions). In the purely scalar case we can understand the chain rule as follows: a

small change in θ will propagate into a small change in g(θ) by getting multiplied by

∂g(θ)

∂θ

. Similarly, a small change in g(θ) will propagate into a small change in C(g(θ))

by getting multiplied by ∇

g(θ)

C(g(θ)). Hence a small change in θ ﬁrst gets multiplied

∂g(θ)

∂θ

to obtain the change in g(θ) and this then gets multiplied by ∇

g(θ)

C(g(θ)) to

obtain the change in C(g(θ)). Hence the ratio of the change in C(g(θ)) to the change in

θ is the product of these partial derivatives. The partial derivative measures the locally

linear inﬂuence of a variable on another. This is illustrated in Figure 6.3, with x = θ,

y = g(θ), and z = C(g(θ)).

Now, if g is a vector, we can rewrite the above as follows:

∇

C(g(θ)) =

∂C(g(θ))

∂g

(θ)

∂g

(θ)

∂θ

which sums over the inﬂuences of θ on C(g(θ)) through all the intermediate variables

(θ). This is illustrated in Figure 6.4 with x = θ, y

= g

(θ), and z = C(g(θ)).

118

Figure 6.4: Top: The chain rule, when there are two intermediate variables y

and y

between x and z, creating two paths for changes in x to propagate and yield changes in

z. Bottom: more general case, with n intermediate variables y

to y

6.3.2 Back-Propagation in an MLP

Whereas example 6.1.1 illustrated the case of of an MLP with a single hidden layer let

us consider in this section back-propagation for an ordinary but deep MLP, i.e., like the

above vanilla MLP but with several hidden layers. For this purpose, we will recursively

apply the chain rule illustrated in Figure 6.4. The algorithm proceeds by ﬁrst computing

the gradient of the cost C with respect to output units, and these are used to compute

the gradient of C with respect to the top hidden layer activations. We can then continue

computing the gradients of lower level hidden units one at a time in the same way The

gradients on hidden and output units can be used to compute the gradient of C with

respect to the parameters (e.g. weights and biases) of each layer (i.e., that directly

contribute to the output of that layer).

Algorithm 6.1 describes in matrix-vector form the forward propagation computation

for a classical multi-layer network with M layers, where each layer computes an aﬃne

119

Algorithm 6.1 Forward computation associated with input x for a deep neural network

with ordinary aﬃne layers composed with an arbitrary elementwise diﬀerentiable (almost

everywhere) non-linearity f. There are M such layers, each mapping their vector-

valued input h

to a pre-activation vector a

via a weight matrix W

(k)

which is then

transformed via f into h

k+1

. The input vector x corresponds to h

and the predicted

outputs

y corresponds to h

. The cost function L(

y, y) depends on the output

y and

on a target y (see Section 6.2.2 for examples of loss functions). The loss may be added

to a regularizer Ω (see Section 6.2.3 and Chapter 7) to obtain the example-wise cost C.

Algorithm 6.2 shows how to compute gradients of C with respect to parameters W and

= x

for k = 1 . . . , M do

(k)

= b

(k)

+ W

(k)

(k−1)

(k)

= f(a

(k)

)

end for

y = h

(M)

C = L(

y, y) + λΩ

transformation (deﬁned by a bias vector b

(k)

and a weight matrix W

(k)

followed by a

non-linearity f . In general, the non-linearity may be diﬀerent on diﬀerent layers, and

this is typically the case for the output layer (see Section 6.2.1). Hence each unit at

layer k computes an output h

(k)

as follows:

(k)

= b

(k)

(k−1)

(k)

= f(a

(k)

) (6.6)

where we separate the aﬃne transformation from the non-linear activation operations

for ease of exposition of the back-propagation computations.

These are described in matrix-vector form by Algorithm 6.2 and proceed from the

output layer towards the ﬁrst hidden layer, as outlined above.

6.3.3 Back-Propagation in a General Flow Graph

More generally, we can think about decomposing a function C(θ) into a more complicated

graph of computations. This graph is called a ﬂow graph. Each node u

of the graph

denotes a numerical quantity that is obtained by performing a computation requiring

the values u

of other nodes, with j < i. The nodes satisfy a partial order which

dictates in what order the computation can proceed. In practical implementations of

such functions (e.g. with the criterion C(θ) or its value estimated on a minibatch), the

ﬁnal computation is obtained as the composition of simple functions taken from a given

set (such as the set of numerical operations that the numpy library can perform on arrays

of numbers).

120

Algorithm 6.2 Backward computation for the deep neural network of Algorithm 6.1,

which uses in addition to the input x a target y. This computation yields the gradients on

the activations a

(k)

for each layer k, starting from the output layer and going backwards

to the ﬁrst hidden layer. From these gradients, which can be interpreted as an indication

of how each layer’s output should change to reduce error, one can obtain the gradient on

the parameters of each layer. The gradients on weights and biases can be immediately

used as part of a stochastic gradient update (performing the update right after the

gradients have been computed) or used with other gradient-based optimization methods.

After the forward computation, compute the gradient on the output layer:

g ← ∇

C = ∇

y, y) + λ∇

Ω

(typically Ω is only a function of parameters not activations, so the last term would

be zero)

for k = M down to 1 do

Convert the gradient on the layer’s output into a gradient into the pre-nonlinearity

activation (element-wise multiplication if f is element-wise):

g ← ∇

(k)

C = g  f

(k)

)

Compute gradients on weights and biases (including the regularization term, where

needed):

∇

(k)

C = g + λ∇

(k)

Ω

∇

(k)

C = g h

(k−1)>

+ λ∇

(k)

Ω

Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:

g ← ∇

(k−1)

C = W

(k)>

∇

(k)

end for

We will deﬁne the back-propagation in a general ﬂow-graph, using the following

generic notation: u

= f

), where a

is a list of arguments for the application of f

the values u

for the parents of i in the graph: a

= (u

)

j∈parents(i)

. This is illustrated

in Figure 6.5.

121

…

= f

)

Figure 6.5: Illustration of recursive forward computation, where at each node u

compute a value u

= f

), with a

being the list of values from parents u

of node

. Following Algorithm 6.3, the overall inputs to the graph are u

. . . , u

(e.g., the

parameters we may want to tune during training), and there is a single scalar output

(e.g., the loss which we want to minimize).

The overall computation of the function represented by the ﬂow graph can thus be

summarized by the forward computation algorithm, Algorithm 6.3.

In addition to having some code that tells us how to compute f

) for some val-

ues in the vector a

, we also need some code that tells us how to compute its partial

derivatives,

∂f

)

∂a

with respect to any immediate argument a

. Let k = π(i, j) denote

the index of u

in the list a

. Note that u

could inﬂuence u

through multiple paths.

Whereas

∂u

would denote the total gradient adding up all of these inﬂuences,

∂f

)

∂a

only denotes the derivative of f

with respect to its speciﬁc k-th argument, keeping the

other arguments ﬁxed, i.e., only considering the inﬂuence through the arc from u

. In general, when manipulating partial derivatives, one should keep clear in one’s

mind (and implementation) the notational and semantic distinction between a partial

derivative that includes all paths and one that includes only the immediate eﬀect of a

function’s argument on the function output, with the other arguments considered ﬁxed.

For example consider f

3,1

, a

3,2

) = e

3,1

3,2

and f

2,1

) = a

2,1

, while u

= f

, u

)

122

Algorithm 6.3 Flow graph forward computation. Each node computes numerical value

by applying a function f

to its argument list a

that comprises the values of previous

nodes u

, j < i, with j ∈ parents(i). The input to the ﬂow graph is the vector x, and is

set into the ﬁrst M nodes u

to u

. The output of the ﬂow graph is read oﬀ the last

(output) node u

for i = 1 . . . , M do

← x

end for

for i = M + 1 . . . , N do

← (u

)

j∈parents(i)

← f

)

end for

return u

and u

= f

), illustrated in Figure 6.6. The direct derivative of f

with respect to its

argument a

3,2

∂f

∂a

3,2

= e

3,1

3,2

while if we consider the variables u

and u

to which

these correspond, there are two paths from u

to u

, and we obtain as derivative the

sum of partial derivatives over these two paths,

∂u

= e

(1 + 2u

). The results are

diﬀerent because

∂u

involves not just the direct dependency of u

on u

but also the

indirect dependency through u

Figure 6.6: Illustration of indirect eﬀect and direct eﬀect of variable u

on variable u

in a ﬂow graph, which means that the derivative of u

with respect to u

must include

the sum of two terms, one for the direct eﬀect (derivative of u

with respect to its

ﬁrst argument) and one for the indirect eﬀect through u

(involving the product of

the derivative of u

with respect to u

times the derivative of u

with respect to u

Forward computation of u

’s (as in Figure 6.5) is indicated with upward full arrows,

while backward computation (of derivatives with respect to u

’s, as in Figure 6.7) is

indicated with downward dashed arrows.

123

…

Figure 6.7: Illustration of recursive backward computation, where we associate to each

node j not just the values u

computed in the forward pass (Figure 6.5, bold upward

arrows) but also the gradient

∂u

with respect to the output scalar node u

. These

gradients are recursively computed in exactly the opposite order, as described in Algo-

rithm 6.4 by using the already computed

∂u

of the children i of j (dashed downward

arrows).

Armed with this understanding, we can deﬁne the back-propagation algorithm as

follows, in Algorithm 6.4, which would be computed after the forward propagation

(Algorithm 6.3) has been performed. Note the recursive nature of the application of the

chain rule, in Algorithm 6.4: we compute the gradient on node j by re-using the already

computed gradient for children nodes i, starting the recurrence from the trivial

∂u

= 1

that sets the gradient for the output node. This is illustrated in Figure 6.7.

124

Algorithm 6.4 Back-propagation computation of a ﬂow graph (full, upward arrows),

which itself produces an additional ﬂow graph (dashed, backward arrows). See the

forward propagation in a ﬂow-graph (Algorithm 6.3, to be performed ﬁrst) and the

required data structure. In addition, a quantity

∂u

needs to be stored (and computed)

at each node, for the purpose of gradient back-propagation. Below the notation π(i, j)

is the index of u

as an argument to f

. The back-propagation algorithm eﬃciently

computes

∂u

for all i’s (traversing the graph backwards this time), and in particular

we are interested in the derivatives of the output node u

with respect to the “inputs”

. . . , u

(which could be the parameters, in a learning setup). The cost of the overall

computation is proportional to the number of arcs in the graph, assuming that the

partial derivative associated with each arc requires a constant time. This is of the same

order as the number of computations for the forward propagation.

∂u

← 1

for j = N − 1 down to 1 do

∂u

←

i:j∈parents(i)

∂u

∂f

)

∂a

i,π(i,j)

end for

return



∂u



i=1

This recursion is a form of eﬃcient factorization of the total gradient, i.e., it is an

application of the principles of dynamic programming

. Indeed, the derivative of the

output node with respect to any node can also be written down in this intractable form:

∂u

paths u

...,u

: k

=i,k

j=2

∂u

j−1

where the paths u

. . . , u

go from the node k

= i to the ﬁnal node k

= N in the ﬂow

graph. Computing the sum as above would be intractable because the number of possible

paths can be exponential in the depth of the graph. The back-propagation algorithm

is eﬃcient because it employs a dynamic programming strategy to re-use rather than

re-compute partial sums associated with the gradients on intermediate nodes.

Although the above was stated as if the u

’s were scalars, exactly the same procedure

can be run with u

’s being tuples of numbers (more easily represented by vectors). In

that case the equations remain valid, and the multiplication of scalar partial derivatives

becomes the multiplication of a row vector of gradients

∂u

with a Jacobian of partial

derivatives associated with the j → i arc of the graph,

∂f

)

∂a

i,π(i,j)

. In the case where

minibatches are used during training, u

would actually be a whole matrix (the extra

dimension being for the examples in the minibatch). This would then turn the basic

computation into matrix-matrix products rather than matrix-vector products, and the

Here we refer to “dynamic programming” in the sense of table-ﬁlling algorithms that avoid re-

computing frequently used subexpressions. In the context of machine learning, “dynamic programming”

can also refer to iterating Bellman’s equations. That is not the kind of dynamic programming we refer

to here.

125

former can be computed much more eﬃciently than a sequence of matrix-vector products

(e.g. with the BLAS library), especially so on modern computers and GPUs, that rely

more and more on parallelization through many cores.

More implementation issues regarding the back-propagation algorithm are discussed

in Chapters 11 and 12, regarding respectively GPU implementations (Section 11.2)

and debugging tricks (Section 12.5.1). Section 12.5.2 also discusses a natural general-

ization of the back-propagation algorithm in which one manipulates not numbers but

symbolic expressions, i.e., turning a program that performs a computation (decomposed

as a ﬂow graph expressed in the forward propagation algorithm) into another program

(another ﬂow graph) that performs gradient computation (i.e. automatically generat-

ing the back-propagation program, given the forward propagation graph). Using such

symbolic automatic diﬀerentiation procedures (implemented for example in the Theano

library

), one can compute second derivatives and other kinds of derivatives, as well as

apply numerical stabilization and simpliﬁcations on a ﬂow graph.

6.4 Universal Approximation Properties and Depth

When there is no hidden layer, and for convex loss functions and regularizers (which is

typically the case), we obtain a convex training criterion. We end up with linear regres-

sion, logistic regression or other log-linear models that are used in many applications of

machine learning. This is appealing (no local minima, no saddle point) and convenient

(no strong dependency on initial conditions) but comes with a major disadvantage: such

learners are very limited in their ability to represent more complex functions. In the

case of classiﬁcation tasks, we are limited to linear decision surfaces. In practice, this

limitation can be somewhat circumvented by handcrafting a large enough or discrimi-

nant enough set of hardwired features extracted from the raw input and on which the

linear predictor can be easily trained. See next section for a longer discussion about

feature learning.

To make the family of functions rich enough, the other option is to introduce one

or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi, 1994)

that with a single hidden layer of a suﬃcient size and a reasonable choice of non-

linearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can represent

any smooth function to any desired accuracy (the greater the required accuracy, the

more hidden units are required). However, these theorems generally do not tell us

how many hidden units will be required to achieve a given accuracy for particular data

distributions: in fact the worse case scenario is generally going to require an exponential

number of hidden units (to basically record every input conﬁguration that needs to be

distinguished). It is easier to understand how bad things can be by considering the case

of binary input vectors: the number of possible binary functions on vectors v ∈ {0, 1}

and selecting one such function requires 2

bits, which will in general require O(2

)

degrees of freedom.

However, machine learning is not about learning any possible function with equal

http://deeplearning.net/software/theano/

126

ease: we mostly care about the kinds of functions that are needed to represent the

world around us or the task of interest. The no-free-lunch theorems for machine

learning (Wolpert, 1996) essentially say that no learning algorithm is “universal”, i.e.,

dominates all the others against all possible ground truths (data generating distribution

or target function). However, for a given family of tasks, some learning algorithms are

deﬁnitely better. Choosing a learning algorithm is equivalent to choosing a prior, i.e.,

making a larger bet for some guesses of the underlying target function than for others.

If the target function is likely under the chosen prior, then this prior (i.e., this learning

algorithm) turns out to be a good choice. The best choice is the unrealistic prior that

puts all probability mass on the true target function. This would only happen if there is

nothing to be learned any more because we already know the target function. Having a

universal approximation property just means that this prior is broad enough to cover or

come close to any possible target function – and this is a useful property – but is, in itself,

clearly not suﬃcient to provide good generalization (not to speak of the optimization

issues and other computational concerns that may be associated with a choice of prior

and learning algorithm).

One of the central priors that is exploited in deep learning is that the target function

to be learned can be eﬃciently represented as a deep composition of simpler functions

(“features”), where features at one level can be re-used to deﬁne many features at the

next level. This is connected to the notion of underlying factors described in the next

section. One therefore assumes that these factors or features are organized at multiple

levels, corresponding to multiple levels of representation. The number of such levels is

what we call depth in this context. The computation of these features can therefore be

laid down in a ﬂow graph or circuit, whose depth is the length of the longest path from

an input node to an output node. Note that we can deﬁne the operations performed

in each node in diﬀerent ways. For example, do we consider a node that computes

the aﬃne operations of a neural net followed by a node that computes the non-linear

neuron activation, or do we consider both of these operations as one node or one level

of the graph? Hence the notion of depth really depends on the allowed operations at

each node and one ﬂow graph of a given depth can be converted into an equivalent

ﬂow graph of a diﬀerent depth by redeﬁning which operations can be performed at each

node. However, for neural networks, we typically talk about a depth 1 network if there

is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal

approximation properties for neural nets basically tell us that depth 2 is suﬃcient to

approximate any reasonable function to any desired ﬁnite accuracy.

From the point of view of approximation properties, the important result is that one

can ﬁnd families of functions which can be approximated very eﬃciently when a partic-

ular depth is allowed, but which might require a much larger (typically exponentially

larger) model (e.g. more hidden units) if depth is insuﬃcient (or is limited to 2). Such

results have been proven for logic gates (H˚astad, 1986), linear threshold units with non-

negative weights (H˚astad and Goldmann, 1991), polynomials (Delalleau and Bengio,

2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more

recently, for deep networks of rectiﬁer units (Pascanu et al., 2013b). Of course, there

is no guarantee that the kinds of functions we want to learn in applications of machine

127

learning (and in particular for AI) share such a property. However, a lot of experimental

evidence suggests that better generalization can be obtained with more depth, for many

such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,

2012; Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a; Couprie

et al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior

and that in order to take advantage of it, the learner’s family of function needs to allow

multiple levels of representation.

6.5 Feature / Representation Learning

Let us consider again the single layer networks such as the Perceptron, linear regression

and logistic regression: such linear models are appealing because training them involves

a convex optimization problem

, i.e., an optimization problem with some convergence

guarantees towards a global optimum, irrespective of initial conditions. Simple and well-

understood optimization algorithms are available in this case. However, this limits the

representational capacity too much: many tasks, for a given choice of input representa-

tion x (the raw input features), cannot be solved by using only a linear predictor. What

are our options to avoid that limitation?

1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Sch¨olkopf

et al., 1999), i.e., to consider a ﬁxed mapping from x to φ(x), where φ(x) is of

much higher dimension. In this case, f

(x) = b + w · φ(x) can be linear in the

parameters (and in φ(x)) and optimization remains convex (or even analytic). By

the exploiting the kernel trick, we can computationally handle a high-dimensional

φ(x) (or even an inﬁnite-dimensional one) so long as the kernel K(u, v) = φ(u)·φ(v)

(where · is the appropriate dot product for the space of φ(·)) can be computed

eﬃciently. If φ(x) is of high enough dimension, we can always have enough capacity

to ﬁt the training set, but generalization is not at all guaranteed: it will depend

on the appropriateness of the choice of φ as a feature space for our task. Kernel

machines theory clearly identiﬁes the choice of φ to the choice of a prior. This leads

to kernel engineering, which is equivalent to feature engineering, discussed next.

The other type of kernel (that is very commonly used) embodies a very broad prior,

such as smoothness, e.g., the Gaussian (or RBF) kernel K(u, v) = e

−||u−v||/σ

Unfortunately, this prior may be insuﬃcient, i.e., too broad and sensitive to the

curse of dimensionality, as introduced in Section 5.12 and developed in more detail

in Chapter 17.

2. Another option is to manually engineer the representation or features φ(x). Most

industrial applications of machine learning rely on hand-crafted features and most

of the research and development eﬀort (as well as a very large fraction of the

scientiﬁc literature in machine learning and its applications) goes into designing

new features that are most appropriate to the task at hand. Clearly, faced with

or even one for which an analytic solution can be computed, with linear regression or the case of

some Gaussian process regression models

128

a problem to solve and some prior knowledge in the form of representations that

are believed to be relevant, the prior knowledge can be very useful. This approach

is therefore common in practice, but is not completely satisfying because it in-

volves a very task-speciﬁc engineering work and a laborious never-ending eﬀort to

improve systems by designing better features. If there were some more general

feature learning approaches that could be applied to a large set of related tasks

(such as those involved in AI), we would certainly like to take advantage of them.

Since humans seem to be able to learn a lot of new tasks (for which they were

not programmed by evolution), it seems that such broad priors do exist. This

whole question is discussed in a lot more detail in Bengio and LeCun (2007b), and

motivates the third option.

3. The third option is to learn the features, or learn the representation. In a sense,

it allows one to interpolate between the almost agnostic approach of a kernel ma-

chine with a general-purpose smoothness kernel (such as RBF SVMs and other

non-parametric statistical models) and full designer-provided knowledge in the

form of a ﬁxed representation that is perfectly tailored to the task. This is equiv-

alent to the idea of learning the kernel, except that whereas most kernel learning

methods only allow very few degrees of freedom in the learned kernel, representa-

tion learning methods such as those discussed in this book (including multi-layer

neural networks) allow the feature function φ(·) to be very rich (with a num-

ber of parameters that can be in the millions or more, depending on the amount

of data available). This is equivalent to learning the hidden layers, in the case

of a multi-layer neural network. Besides smoothness (which comes for example

from regularizers such as weight decay), other priors can be incorporated in this

feature learning. The most celebrated of these priors is depth, discussed above

(Section 6.4). Other priors are discussed in Chapter 17.

This whole discussion is clearly not speciﬁc to neural networks and supervised learn-

ing, and is one of the central motivations for this book.

6.6 Piecewise Linear Hidden Units

Most of the recent improvement in the performance of deep neural networks can be

attributed to increases in computational power and the size of datasets. The machine

learning algorithms involved in recent state-of-the-art systems have mostly existed since

the 1980s, with very few recent conceptual advances contributing signiﬁcantly to in-

creased performance.

One of the main algorithmic improvements that has had a signiﬁcant impact is the

use of piecewise linear units, such as absolute value rectiﬁers, rectiﬁed linear units. Such

units consist of two linear pieces and their behavior is driven by a single weight vector.

Jarrett et al. (2009a) observed that “using a rectifying non-linearity is the single most

important factor in improving the performance of a recognition system” among several

diﬀerent factors of neural network architecture design.

129

For small datasets, Jarrett et al. (2009a) observed that rectifying non-linearities is

even more important than learning the weights of the hidden layers. Random weights are

suﬃcient to propagate useful information through a rectiﬁed linear network, allowing the

classiﬁer layer at the top to learn how to map diﬀerent feature vectors to class identities.

When more data is available, learning becomes relevant because it can extract more

knowledge from it, and learning typically beats ﬁxed or random settings of parameters.

Glorot et al. (2011b) showed that learning is far easier in deep rectiﬁed linear networks

than in deep networks that have curvature or two-sided saturation in their activation

functions. Because the behavior of the unit is linear over half of its domain, it is easy for

an optimization algorithm to tell how to improve the behavior of a unit, even when the

unit’s activations are far from optimal. Just as piecewise linear networks are good at

propagating information forward, back-propagation in such a network is also piecewise

linear and propagates information about the error derivatives to all of the gradients in

the network. Each piecewise linear function can be decomposed into diﬀerent regions

corresponding to diﬀerent linear pieces. When we change a parameter of the network,

the resulting change in the network’s activity is linear until the point that it causes some

unit to go from one linear piece to another. Traditional units such as sigmoids are more

prone to discarding information due to saturation both in forward propagation and in

back-propagation, and the response of such a network to a change in a single parameter

may be highly nonlinear even in a small neighborhood.

One drawback to rectiﬁed linear units is that they cannot learn via gradient-based

methods on examples for which their activation is zero. This problem can be mitigated

by initializing the biases to a small positive number, but it is still possible for a rectiﬁed

linear unit to learn to de-activate and then never be activated again. Goodfellow et al.

(2013a) introduced maxout units and showed that maxout units can successfully learn

in conditions where rectiﬁed linear units become stuck. Maxout units are also piecewise

linear, but unlike rectiﬁed linear units, each piece of the linear function has its own

weight vector, so whichever piece is active can always learn.

This same general principle of using linear behavior to obtain easier optimization also

applies in other contexts besides deep linear networks. Recurrent networks, which need

to propagate information through several time steps, are much easier to train when they

are more linear. One of the best-performing recurrent network architectures, the LSTM,

propagates information through time via summation–a particular straightforward kind

of linear activation. This is discussed further in Section 10.6.4.

In addition to helping to propagate information and making optimization easier,

piecewise linear units also have some nice properties that can make them easier to

regularize. This is discussed further in Section 7.11.

Sigmoidal non-linearities still perform well in some contexts, but piecewise linear

units are now by far the most popular kind of hidden units.

6.7 Historical Notes

130