Chapter 6

Feedforward Deep Networks

TODO: this section was kind of orphaned in ml.tex, IG moved it to this chapter

since this chapter is the only thing that references it, move it somewhere it ﬁts in

the ﬂow better

6.0.4 Estimating Conditional Statistics

One very general supervised learning task is to estimate the expected value of y

given x. We have already seen how to do this in the context of linear regression,

where the conditional expectation of y is used as the mean of a Gaussian distri-

bution that we ﬁt with maximum likelihood. We can generalize linear regression

to regression via any function f by deﬁning the mean squared error of f:

E[||y − f(x)||

]

where the expectation is over the training set during training, and over the data

generating distribution to obtain generalization error.

We can generalize its interpretation beyond the case where f is linear or

aﬃne, uncovering an interesting property: minimizing it yields an estimator of

the conditional expectation of the output variable y given the input variable x,

i.e.,

arg min

f∈H

p(x,y)

[||y − f(x)||

] = E

p(x,y)

[y|x]. (6.1)

provided that our set of function H contains E

p(x,y)

[y|x]. (If you would like to

work out the proof yourself, it is easy to do using calculus of variations, which we

describe in Chapter 20.4.2).

143

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

6.1 Formalizing and Generalizing Neural Networks

Feedforward supervised neural networks were among the ﬁrst and most successful

learning algorithms (Rumelhart et al., 1986e,c). They are also called deep net-

works, multi-layer perceptron (MLP), or simply neural networks and the vanilla

architecture with a single hidden layer is illustrated in Figure 6.1.

Figure 6.1: Vanilla (shallow) MLP, with one sigmoid hidden layer, computing vector-

valued hidden unit vector h = sigmoid(c + W x) with weight matrix W and oﬀset vector

c. The output vector is obtained via another learned aﬃne transformation

y = b + V h,

with weight matrix V and output oﬀset vector b.

MLPs can learn powerful non-linear transformations: in fact, with enough

hidden units they can represent arbitrarily complex but smooth functions (see

Section 6.4). This is achieved by composing simpler but still non-linear learned

transformations. By transforming the data non-linearly into a new space, a classi-

ﬁcation problem that was not linearly separable (not solvable by a linear classiﬁer)

can become separable, as illustrated in Figure 6.2.

In addition to covering the basics of such networks, this chapter introduces

a general formalism for gradient-based optimization of parametrized families of

functions, often in the context of conditional maximum likelihood criteria (Sec-

tion 6.2).

They bring together a number of important machine learning concepts already

introduced in the previous chapters:

• Deﬁne a parametrized family of functions f

describing how the learner

will behave on new examples, i.e., what output f

(x) will produce given

144

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Figure 6.2: Each layer of a trained neural network non-linearly transforms its input,

distorting the space so that the task becomes easier to perform, e.g., linear classiﬁcation

in the above ﬁgures. Top: a vanilla neural network with 2 hidden units in its single

hidden layer can transform the 2-D input space (shown in the blue and pink square

ﬁgure) so that the examples from the two classes (shown as the points on the red an blue

curves) become linearly separable (the red and blue curves correspond to the “manifold”

near which examples concentrate, while the red and pink areas correspond to the regions

where the neural network classiﬁes an input as either blue or red). Bottom: with a

larger hidden layer (100 here), the MNIST digit images (with 28 × 28 = 784 pixels)

can be transformed so that the classes (each shown with a diﬀerent color) can be much

more easily classiﬁed by the output linear+softmax layer (over 10 digit categories). Both

ﬁgures are reproduced with permission by Chris Olah from http://colah.github.io/,

where many more insightful visualizations can be found.

145

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

some input x. Training consists in choosing the parameter θ (usually rep-

resented by a vector) given some training examples (x, y) sampled from an

unknown data generating distribution P (X, Y ).

• Deﬁne a loss function L describing what scalar loss L(

y, y) is associated

with each supervised example (x, y), as a function of the learner’s output

y = f

(x) and the target output y.

• Deﬁne a training criterion and a regularizer. The objective of train-

ing is ideally to minimize the expected loss E

X,Y

[L(f

(X), Y )] over X, Y

sampled from the unknown data generating distribution P (X, Y ). However

this is not possible because the expectation makes use of the true underlying

P (X, Y ) but we only have access to a ﬁnite number of training examples,

i.e. of pairs (X,Y ). Instead, one deﬁnes a training criterion which usually

includes an empirical average of the loss over the training set, plus some

additional terms (called regularizers) which enforce some preferences over

the choices of θ.

• Deﬁne an optimization procedure to approximately minimize the training

criterion

. The simplest such optimization procedure is a variant of gradient

descent (gradient descent was introduced in Section 4.3) called stochastic

gradient descent, described in Section 8.3.2.

Example 6.1.1 illustrates these concepts for the case of a vanilla neural network

for regression.

In chapter 17, we consider generalizations of the above framework to the

unsupervised and semi-supervised cases, where Y may not exist or may not always

be present. An unsupervised loss can then be deﬁned solely as a function of the

input x and some function f

(x) that one wishes to learn.

It is generally not possible to analytically obtain a global minimum of the training criterion,

so iterative numerical optimization methods are used instead.

146

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Example 6.1.1. Vanilla (Shallow) Multi-Layer Neural Network

for Regression

Based on the above deﬁnitions, we could pick the family of input-output

functions to be

(x) = b + V sigmoid(c + W x),

illustrated in Figure 6.1, where sigmoid(a) = 1/(1 + e

−a

) is applied

element-wise, the input is the vector x ∈ R

, the hidden layer outputs

are the elements of the vector h = sigmoid(c + W x) with n

entries, the

parameters are θ = (b, c, V , W ) θ (viewed as the ﬂattened vectorized

version of the tuple) with b ∈ R

a vector the same dimension as the

output (n

), c ∈ R

of the same dimension as h (number of hidden

units), V ∈ R

×n

and W ∈ R

×n

being weight matrices.

The loss function for this classical example could be the squared error

y, y) = ||

y − y||

(see Section 6.0.4 showing that it makes

y an esti-

mator of E[Y |x]). The regularizer could be the ordinary L

weight decay

||ω||

= (

), where we deﬁne the set of weights ω as the

concatenation of the elements of matrices W and V . The L

weight de-

cay thus penalizes the squared norm of the weights, with λ a scalar that

is larger to penalize stronger and yield smaller weights. Combining the

loss function and the regularizer gives the training criterion, which is the

objective function during training:

C(θ) = λ||ω||

t=1

||y

(t)

− (b + V sigmoid(c + W x

(t)

))||

where (x

(t)

, y

(t)

) is the t-th training example, an (input,target) pair. Fi-

nally, the classical training procedure in this example is stochastic gradi-

ent descent, which iteratively updates θ according to

ω ← ω − 



2λω + ∇

L(f

(t)

), y

(t)

)



β ← β − ∇

L(f

(t)

), y

(t)

where β = (b, c) contains the oﬀset

parameters, ω = (W , V ) the weight

matrices,  is a learning rate and t is incremented after each training

example, modulo n.

147

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

6.2 Parametrizing a Learned Predictor

There are many ways to deﬁne the family of input-output functions, loss function,

regularizer and optimization procedure, and the most common ones are described

below, while more advanced ones are left to later chapters, in particular Chap-

ters 10 and 16.

6.2.1 Family of Functions

A motivation for the family of functions deﬁned by multi-layer neural networks is

to compose aﬃne transformations and non-linearities, where the choice of param-

eter values controls the degree and location of these non-linearities. As discussed

in Section 6.4 below, with the appropriate choice of parameters, multi-layer neural

networks can in principle approximate smooth functions, with more hidden units

allowing one to achieve better approximations.

A multi-layer neural network with more than one hidden layer can be de-

ﬁned by generalizing the above structure, e.g., as follows, where we chose to use

hyperbolic tangent

activation functions instead of sigmoid activation functions:

= tanh(b

+ W

k−1

)

where h

= x is the input of the neural net, h

(for k > 0) is the output of the

k-th hidden layer, which has weight matrix W

and oﬀset (or bias) vector b

. If

we want the output f

(x) to lie in some desired range, then we typically deﬁne

an output non-linearity (which we did not have in the above Example 6.1.1). The

non-linearity for the output layer is of course generally diﬀerent from the tanh,

depending on the type of output to be predicted and the associated loss function

(see below).

As discussed in TODO (this was going to be part of guidelines.tex, cut because

no TODOs in that section) there are several other non-linearities besides the

sigmoid and the hyperbolic tangent which have been successfully used with neural

networks. In particular, TODO discusses alternative non-linear units, such as the

rectiﬁed linear unit (max(0, b + w · x)) and the maxout unit (max

+ W

:,i

x)) which have been particularly successful in the case of deep feedforward or

convolutional networks. These and other non-linear neural network activation

functions commonly found in the literature are summarized below.

• Rectiﬁer or rectiﬁed linear unit (ReLU) or positive part: h(x) =

max(0, b + w · x), also written (b + w ·x)

• Hyperbolic tangent: h(x) = tanh(b + w · x).

which is linearly related to the sigmoid as follows: tanh(x) = 2 × sigmoid(2x) − 1

148

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

• Sigmoid: h(x) = 1/(1 + e

−(b+w·x)

• Softmax: h(x) = softmax(b + W x). It is mostly used as output non-

linearity for predicting discrete probabilities. See deﬁnition and discussion

below, Eq. 6.2.

• Radial basis function or RBF unit: h(x) = e

−||w−x||

/σ

. This is heavily

used in kernel SVMs (Boser et al., 1992; Sch¨olkopf et al., 1999) and has the

advantage that such units can be easily initialized as a random subset of

the input examples (Powell, 1987; Niranjan and Fallside, 1990).

• Softplus: this is a smooth version of the rectiﬁer, introduced in Dugas

et al. (2001) for function approximation and in Nair and Hinton (2010a) in

RBMs. Glorot et al. (2011a) compared the softplus and rectiﬁer and found

better results with the latter, in spite of the very similar shape and the

diﬀerentiability and non-zero derivative of the softplus everywhere, contrary

to the rectiﬁer.

• Hard tanh: this is shaped similarly to the tanh and the rectiﬁer but unlike

the latter, it is bounded, h(x) = max(−1, min(1, x)). It was introduced

by Collobert (2004).

• Absolute value rectiﬁcation: h(x) = |x| (may be applied on the dot

product output or on the output of a tanh unit). It is also related to

the rectiﬁer and has been used for object recognition from images (Jarrett

et al., 2009a), where it makes sense to seek features that are invariant under

a polarity reversal of the input illumination.

• Maxout: this is discussed in more detail in TODO It generalizes the recti-

ﬁer but introduces multiple weight vectors w

(called ﬁlters) for each unit.

h(x) = max

+ w

· x).

This is not an exhaustive list but covers most of the non-linearities and unit

computations seen in the deep learning and neural nets literature. Many variants

are possible.

As discussed in Section 6.3, the structure (also called architecture) of the

family of input-output functions can be varied in many ways, which calls for a

generic principle for eﬃciently computing gradients, described in that section. For

example, a common variation is to connect layers that are not adjacent, with so-

called skip connections, which are found in the visual cortex (for which the word

“layer” should be replaced by the word “area”). Other common variations depart

from a full connectivity between adjacent layers. For example, each unit at layer k

may be connected to only a subset of units at layer k−1. A particular case of such

149

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

form of sparse connectivity is discussed in chapter 9 with convolutional networks.

In general, the set of connections between units of the whole network only needs

to form a directed acyclic graph in order to deﬁne a meaningful computation

(see the ﬂow graph formalism below, Section 6.3). When units of the network

are connected to themselves through a cycle, one has to properly deﬁne what

computation is to be done, and this is what is done in the case of recurrent

networks, treated in Chapter 10.

6.2.2 Loss Function and Conditional Log-Likelihood

In the 80’s and 90’s the most commonly used loss function was the squared error

L(f

(x), y) = ||f

(x)−y||

. As discussed in Section 6.0.4, if f is unrestricted (non-

parametric), minimizing the expected value of the loss function over some data-

generating distribution P (X, Y ) yields f (x) = E[Y |X = x], the true conditional

expectation of Y given X This tells us what the neural network is trying to learn.

Replacing the squared error by an absolute value makes the neural network try

to estimate not the conditional expectation but the conditional median.

However, when y is a discrete label, i.e., for classiﬁcation problems, other loss

functions such as the Bernoulli negative log-likelihood

have been found to be

more appropriate than the squared error. In the case where y ∈ {0, 1} is binary

this gives L(f

(x), y) = −y log f

(x)−(1−y) log(1−f

(x)) and it can be shown that

the optimal (non-parametric) f minimizing this criterion is f (x) = P(Y = 1|x).

Note that in order for the above expression of the criterion to make sense, f

(x)

must be strictly between 0 and 1 (an undeﬁned or inﬁnite value would otherwise

arise). To achieve this, it is common to use the sigmoid as non-linearity for

the output layer, which matches well with the Binomial negative log-likelihood

criterion

. As explained below (Section 6.2.2), the cross-entropy criterion allows

gradients to pass through the output non-linearity even when the neural network

produces a conﬁdently wrong answer, unlike the squared error criterion coupled

with a sigmoid or softmax non-linearity.

This is often called cross-entropy in the literature, even though the term cross-entropy

should also apply to many other losses that can be viewed as negative log-likelihood, discussed

below in more detail. TODO: where do we actually discuss this? may as well discuss it in

maximum likelihood section

In reference to statistical models, this “match” between the loss function and the output

non-linearity is similar to the choice of a link function in generalized linear models (McCullagh

and Nelder, 1989).

150

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Learning a Conditional Probability Model

More generally, one can deﬁne a loss function as corresponding to a conditional

log-likelihood, i.e., the negative log-likelihood (NLL) criterion

NLL

(x), y) = −log P

(Y = y|X = x).

See Section TODO which shows that this criterion yields an estimator of the true

conditional probability of Y given X. TODO: clean this up, we killed that section

because it was extremely redundant with another section of ml.tex. Also that

section seriously interrupted the ﬂow of ml.tex. Grep ml.tex for KL divergence,

make sure the section you ﬁnd includes the info you want, and if not, just put

the proof here. ml.tex was turning into a big list of proofs of things that were not

obviously connected to any of the main ideas of that chapter For example, if Y

is a continuous random variable and we assume that, given X, it has a Gaussian

distribution with mean f

(X) and variance σ

, then

−log P

(Y |X) =

(X) − Y )

/σ

+ log(2πσ

Up to an additive and multiplicative constant (which would give the same choice

of θ), this is therefore equivalent to the squared error loss. Once we understand

this principle, we can readily generalize it to other distributions, as appropriate.

For example, it is straightforward to generalize the univariate Gaussian to the

multivariate case, and under appropriate parametrization consider the variance

to be a parameter or even parametrized function of x (for example with output

units that are guaranteed to be positive, or forming a positive deﬁnite matrix).

Similarly, for discrete variables, the Binomial negative log-likelihood criterion

corresponds to the conditional log-likelihood associated with the Bernoulli dis-

tribution (also known as cross-entropy) with probability p = f

(x) of generating

Y = 1 given X = x (and probability 1 − p of generating Y = 0):

−log P

(Y |X) = −1

Y =1

log p−1

Y =0

log(1−p) = −Y log f

(X)−(1−Y ) log(1−f

(X)).

Softmax

When y is discrete and has a ﬁnite domain (say {1, . . . , N}) but is not binary,

the Bernoulli distribution is extended to the multinoulli distribution (deﬁned in

Section 3.10.2). This distribution is speciﬁed by a vector of N − 1 probabilities

whose sum is less or equal to 1, each element of which provides the probability

= P (y = i|x). We thus need a vector-valued output non-linearity that produces

such a vector of probabilities, and a commonly used non-linearity for this purpose

is the softmax non-linearity introduced earlier.

p = softmax(a) ⇐⇒ p

. (6.2)

151

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

where typically a = b + W h is the vector of scores whose elements a

are

associated with each category i. The corresponding loss function is therefore

NLL

(p, y) = −log p

. Note how minimizing this loss will push a

up (increase

the score a

associated with the correct label y) while pushing a

(for i 6= y)

down (decreasing the score of the other labels, in the context x). The ﬁrst eﬀect

comes from the numerator of the softmax while the second eﬀect comes from the

normalizing denominator. These forces cancel on a speciﬁc example only if p

= 1

and they cancel in average over examples (say sharing the same x) if p

equals

the fraction of times that y = i for this value x. The same principles and the

role of the normalization constant (or “partition function”) can be seen at play

in the training of Markov Random Fields, Boltzmann machines and RBMs, in

Chapter 14. Note other interesting properties of the softmax. First of all, the

gradient of the multinoulli negative log-likelihood with respect to the softmax

argument a is

∂

∂a

NLL

(p(a), y) =

∂

∂p

(−log p

(a)) ·

∂p

∂a

= −

(a)

· p

(a)(1

i=y

− p)

= (p − 1

i=y

) (6.3)

which does not saturate, the gradient does not vanish when the output proba-

bilities approach 0 or 1, except when the model is providing the correct answer.

Speciﬁcally, lets consider the case where the correct label is i, i.e. y = i. The

element of the gradient associated with an erroneous label, say j 6= i, is

∂

∂a

NLL

(p(a), y) = p

. (6.4)

So if the model correctly predicts a low probability that the y = j, i.e. that

≈ 0, then the gradient is also close to zero.

There are other loss functions such as the squared error applied to softmax (or

sigmoid) outputs (which was popular in the 80’s and 90’s) which have vanishing

gradient when an output unit saturates (when the derivative of the non-linearity

is near 0), even if the output is completely wrong (Solla et al., 1988). This may be

a problem because it means that the parameters will basically not change, even

though the output is wrong.

To see how the squared error interacts with the softmax output, we need to

introduce a one-hot encoding of the label, y = [0, . . . , 0, 1, 0, . . . , 0], i.e for the

label y = i, we have y

= 1 and y

= 0, ∀j 6= i. We will again consider that we

have the output of the network to be p = softmax(a), where, as before, a is the

152

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

input to the softmax function ( i.e. a = b + W h with h the output of the last

hidden layer).

For the squared error loss L

(p(a), y) = ||p(a) −y||

, the gradient of the loss

with respect to the input vector to the softmax, a, is given by:

∂

∂a

L(p(a), y) =

∂

∂p

∂

∂p

= 2(p(a) − y) p (1 − p),

where  indicates an element-wise multiplication. Let us again consider the case

where y = i, the gradient associated with an erroneous label, say j 6= i, is

∂

∂a

(p(a), y) = 2(p(a) −y)  p  (1 −p). (6.5)

So if the model correctly predicts a low probability that the y = j, i.e. that

≈ 0, then the gradient is also close to zero.

Another property that could potentially be interesting is that the softmax

output is invariant under additive changes of its input vector: softmax(a) =

softmax(a+b) when b is a scalar added to all the elements of vector a. Finally, it is

interesting to think of the softmax as a way to create a form of competition between

the units (typically output units, but not necessarily) that participate in it: the

strongest ﬁlter output a

∗

gets reinforced (because the exponential increases faster

for larger values) and the other units get inhibited. This is analogous to the lateral

inhibition that is believed to exist between nearby neurons in cortex, and at the

extreme (when the a

’s are large in magnitude) becomes a form of winner-take-all

(one of the outputs is 1 and the others is 0). A more computationally expensive

form of competition is found with sparse coding, described in Section 20.3.

Neural Net Outputs as Parameters of a Conditional Distribution

In general, for any parametric probability distribution P (y|ω) with parameters ω,

we can construct a conditional distribution P (y|x) by making ω a parametrized

function of x, and learning that function:

P (y|ω = f

(x))

where f

(x) is the output of a predictor, x is its input, and y can be thought of

as a target. This established choice of words comes from the common cases of

classiﬁcation and regression, where f

(X) is really a prediction associated with

Y , or its expected value. However, in general ω = f

(X) may contain parameters

of the distribution of Y other than its expected value. For example, it could

contain its variance or covariance, in the case where Y is conditionally Gaussian.

153

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

In the above examples, with the squared error loss, ω is the mean of the Gaussian

which captures the conditional distribution of Y (which means that the variance

is considered ﬁxed, not a function of X). In the common classiﬁcation case, ω

contains the probabilities associated with the various events of interest.

Once we view things in this way, we automatically get as the natural cost

function the negative log-likelihood L(X, Y ) = −log P (Y |ω = f

(X)). Besides θ,

there could be other parameters that control the distribution of Y . For example,

we may wish to learn the variance of a conditional Gaussian, even when the latter

is not a function of X. In this case, the solution can be computed analytically

because the maximum likelihood estimator of variance is simply the empirical

mean of the squared diﬀerence between observations Y and their expected value

(here f

(X)). In other words, the conditional variance can be estimated from the

mean squared error.

Besides the Gaussian, a simple and common example is the case where Y is

binary (i.e. Bernoulli distributed), where it is enough to specify ω = P (Y =

1|X). In the multinoulli case, ω is generally speciﬁed by a vector of probabilities

summing to 1, e.g., via the softmax non-linearity discussed above.

Another interesting and powerful example of output distribution for neural

networks is the mixture model, and in particular the Gaussian mixture model,

introduced in Section 3.10.5. Neural networks that compute the parameters of

a mixture model were introduced in Jacobs et al. (1991); Bishop (1994). In the

case of the Gaussian mixture model with N components,

P (y|x) =

i=1

P (C = i|x)N(y|µ

(x), Σ

(x)),

the neural network must have three kinds of outputs, P (C = i|x), µ

(x), and

(x), which must satisfy diﬀerent constraints:

1. Mixture components P (C = i|x): these form a multinoulli distribution over

the N diﬀerent components C, and can typically be obtained by a softmax

over an N -vector, to guarantee that these outputs are positive and sum to

2. Means µ

(x): these indicate the center or mean associated with each Gaus-

sian component, and are unconstrained (typically with no non-linearity at

all for these output units). If Y is a d-vector, then the network must output

an N ×d matrix containing all these N d-vectors.

3. Covariances Σ

(x): these specify the covariance matrix for each component

i. In many models the variance is unconditional (does not depend on x) and

diagonal or even scalar. If it is diagonal or scalar, only positivity must be

154

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

enforced (e.g., using the softplus non-linearity). If it is full and conditional,

then a parametrization must be chosen that guarantees positive-deﬁniteness

of the predicted covariance matrix. This can be achieved by writing Σ

(x) =

(x)B

(x), where B

is an unconstrained square matrix. One practical

issue if the the matrix is full is that computing the likelihood is expensive,

requiring O(d

) computation for the determinant and inverse of Σ

(x) (or

equivalently, and more commonly done, its eigendecomposition or that of

(x)).

It has been reported that gradient-based optimization of conditional Gaussian

mixtures (on the output of neural networks) can be ﬁnicky, in part because one

gets divisions (by the variance) which can be numerically unstable (when some

variance gets to be small for a particular example, yielding very large gradients).

One solution is to clip gradients (see Section 10.6.7 and Mikolov (2012); Pascanu

and Bengio (2012); Graves (2013); Pascanu et al. (2013a)), while another is to

scale the gradients heuristically (Murray and Larochelle, 2014).

Multiple Output Variables

When Y is actually a tuple formed by multiple random variables Y

, Y

, . . . , Y

then one has to choose an appropriate form for their joint distribution, conditional

on X = x. The simplest and most common choice is to assume that the Y

are

conditionally independent, i.e.,

P (Y

, Y

, . . . , Y

|x) =

i=1

P (Y

|x).

This brings us back to the single variable case, especially since the log-likelihood

now decomposes into a sum of terms log P(Y

|x). If each P(Y

|x) is separately

parametrized (e.g. a diﬀerent neural network), then we can train these neural net-

works independently. However, a more common and powerful choice assumes that

the diﬀerent variables Y

share some common factors that can be represented in

some hidden layer of the network (such as the top hidden layer). See Sections 6.5

and 7.12 for a deeper treatment of the notion of underlying factors of variation

and multi-task training. See also Figure 7.6 illustrating these concepts.

If the conditional independence assumption is considered too strong, what

can we do? At this point it is useful to step back and consider everything we

know about learning a joint probability distribution. Any probability distribu-

tion P

(Y ) parametrized by parameters ω can be turned into a conditional dis-

tribution P

(Y |x) by making ω a function ω = f

(x) parametrized by θ. This is

not only true of the simple parametric distributions we have seen above (Gaus-

sian, Bernoulli, multinoulli) but also of more complex joint distributions typically

155

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

represented by a graphical model. Much of this book is devoted to such graph-

ical models (see Chapters 14, 19, 20, 21). For more concrete examples of such

structured output models with deep learning, see Section 13.4.

6.2.3 Training Criterion and Regularizer

The loss function (often interpretable as a negative log-likelihood) tells us what

we would like the learner to capture. Maximizing the conditional log-likelihood

over the true distribution, i.e., minimizing the expected loss E

X,Y

[−log P

(Y |X)],

makes P

(Y |X) estimate the true P (Y |X) associated with the unknown data gen-

erating distribution, within the boundaries of what the chosen family of functions

allows. See Section TODO for a proof. In practice we cannot minimize this

expectation because we do not know P (X, Y ) and because computing and mini-

mizing this integral exactly would generally be intractable. Instead we are going

to approximately minimize a training criterion C(θ) based on the empirical

average of the loss (over the training set). The simplest such criterion is the

average training loss

t=1

L(f

(t)

), y

(t)

), where the training set is a set of n

examples (x

(t)

, y

(t)

). However, better results can often be achieved by crippling

the learner and preventing it from simply ﬁnding the best θ that minimizes the

average training loss. This means that we combine the evidence coming from the

data (the training set average loss) with some a priori preference on the diﬀerent

values that θ can take (the regularizer). If this concept (and the related con-

cepts of generalization, overﬁtting and underﬁtting) are not clear, please return

to Sections 5.3 and 5.8.1 for a refresher.

The most common regularizer is simply an additive term equal to a regu-

larization coeﬃcient λ times the squared norm of the parameters

, ||θ||

. This

regularizer is often called the weight decay or L2 weight decay or shrinkage

because adding the squared L2 norm to the training criterion pushes the weights

towards zero, in proportion to their magnitude. For example, when using stochas-

tic gradient descent, each step of the update with regularizer term

||θ||

would

look like

θ ← θ −∇

L(f

(x), y) −λθ

where  is the learning rate (see Section 4.3). This can be rewritten

θ ← θ(1 −λ) − ∇

L(f

(x), y)

where we see that the ﬁrst term systematically shaves oﬀ a small proportion of θ

before adding the gradient.

In principle, one could have diﬀerent priors on diﬀerent parameters, e.g., it is common to treat

the output weights with a separate regularization coeﬃcient, but the more hyperparameters, the

more diﬃcult is their optimization, discussed in Chapter 12.

156

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Another commonly used regularizer is the so-called L1 regularizer, which adds

a term proportional to the L1 (absolute value) norm, |θ|

|θ

|. The L1

regularizer also prefers values that are small, but whereas the L2 weight decay

has only a weak preference for 0 rather than small values, the L1 regularizer

continues pushing the parameters towards 0 with a constant gradient even when

they get very small. As a consequence, it tends to bring some of the parameters

to exactly 0 (how many depends on how large the regularization coeﬃcient λ is

chosen). When optimizing with an approximate iterative and noisy method such

as stochastic gradient descent, no actual 0 is obtained. On the other hand, the L1

regularizer tolerates large values of some parameters (only additively removing

a small quantity compared to the situation without regularization) whereas the

L2 weight decay aggressively punishes and prevents large parameter values. The

L1 and L2 regularizers can be combined, as in the so-called elastic net (Zou and

Hastie, 2005), and this is commonly done in deep networks

Note how in the context of maximum likelihood, regularizers can generally be

interpreted as Bayesian priors on the parameters P (θ) or on the learned function,

as discussed in Sections 5.8.1 and 5.8. Later in this book we discuss regularizers

that are data-dependent (i.e., cannot be expressed easily as a pure prior), such

as the contractive penalty (Chapter 16) as well as regularization methods that

are diﬃcult to interpret simply as added terms in the training criterion, such as

dropout (Section 7.11).

6.2.4 Optimization Procedure

Once a training criterion is deﬁned, we enter the realm of iterative optimization

procedures, with the slight twist that if possible we want to monitor held-out

performance (for example the average loss on a validation set, which usually does

not include the regularizing terms) and not just the value of the training crite-

rion. Monitoring the validation set performance allows one to stop training at a

point where generalization is estimated to be the best among the training itera-

tions. This is called early stopping and is a standard machine learning technique,

discussed in Sec. 7.8.

Section 4.3 has already covered the basics of gradient-based optimization.

The simplest such technique is stochastic gradient descent (with minibatches), in

which the parameters are updated after computing the gradient of the average

loss over a minibatch of examples (e.g. 128 examples) and making an update in

the direction opposite to that gradient (or opposite to some accumulated average

of such gradients, i.e., the momentum technique, reviewed in Section 8.3.3). The

most commonly used optimization procedures for multi-layer neural networks and

See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.

html#l1-and-l2-regularization

157

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

deep learning in general are either variants of stochastic gradient descent (typi-

cally with minibatches), second-order methods (the most commonly used being

L-BFGS and nonlinear conjugate gradients) applied on large minibatches (e.g.

of size 10000) or on the whole training set at once, or natural gradient meth-

ods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu and Bengio,

2013). Exact second-order and natural gradient methods are computationally

too expensive for large neural networks because they involve matrices of dimen-

sion equal to the square of the number of parameters. Approximate methods are

discussed in Section 8.4.

On smaller datasets or when computing can be parallelized, second-order

methods have a computational advantage because they are easy to parallelize

and can still perform many updates per unit of time. On larger datasets (and in

the limit of an inﬁnite dataset, i.e., a stream of training examples) one cannot

aﬀord to wait for seeing the whole dataset before making an update, so that fa-

vors either stochastic gradient descent variants (possibly with minibatches to take

advantage of fast or parallelized implementations of matrix-matrix products) or

second-order methods with minibatches.

A more detailed discussion of issues arising with optimization methods in deep

learning can be found in Chapter 8. In particular, many design choices in the con-

struction of the family of functions, loss function and regularizer can have a major

impact on the diﬃculty of optimizing the training criterion. Furthermore, instead

of using generic optimization techniques, one can design optimization procedures

that are speciﬁc to the learning problem and chosen architecture of the family of

functions, for example by initializing the parameters of the ﬁnal optimization rou-

tine from the result of a diﬀerent optimization (or a series of optimizations, each

initialized from the previous one). Because of the non-convexity of the training

criterion, the initial conditions can make a very important diﬀerence, and this is

what is exploited in the various pre-training strategies, Sections 8.6.4 and 17.1,

as well as with curriculum learning (Bengio et al., 2009), Section 8.7.

6.3 Flow Graphs and Back-Propagation

The term back-propagation is often misunderstood as meaning the whole learning

algorithm for multi-layer neural networks. Actually it just means the method for

computing gradients in such networks. Furthermore, it is generally understood as

something very speciﬁc to multi-layer neural networks, but once its derivation is

understood, it can easily be generalized to arbitrary functions (for which comput-

ing a gradient is meaningful), and we describe this generalization here, focusing

on the case of interest in machine learning where the output of the function to

diﬀerentiate (e.g., the loss L or the training criterion C) is a scalar and we are

158

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

interested in its derivative with respect to a set of parameters (considered to be

the elements of a vector θ). The partial derivative of C with respect to θ (called

the gradient) tells us whether θ should be increased or decreased in order to de-

crease C, and is a crucial tool in optimizing the training objective. It can be

readily proven that the back-propagation algorithm has optimal computational

complexity in the sense that there is no algorithm that can compute the gradient

faster (in the O(·) sense, i.e., up to an additive and multiplicative constant).

Figure 6.3: The chain rule, illustrated in the simplest possible case, with z a scalar

function of y, which is itself a scalar function of x. A small change ∆x in x gets turned

into a small change ∆y in y through the partial derivative

∂y

∂x

, from the ﬁrst-order Taylor

approximation of y(x), and similarly for z(y). Plugging the equation for ∆y into the

equation for ∆z yields the chain rule.

The basic idea of the back-propagation algorithm is that the partial derivative

of the cost C with respect to parameters θ can be decomposed recursively by taking

into consideration the composition of functions that relate θ to C, via intermediate

quantities that mediate that inﬂuence, e.g., the activations of hidden units in a

deep neural network. In this section we call these intermediate values u

(indexed

by j) and consider the general case in which they form a directed acyclic graph

that has C as its ﬁnal node u

, that depends of all the other nodes u

. The

back-propagation algorithm exploits the chain rule for derivatives to compute

∂C

∂u

when

∂C

∂u

has already been computed for successors u

of u

in the graph, e.g., the

hidden units in the next layer downstream. This recursion can be initialized by

noting that

∂C

∂u

= 1 and at each step only requires to use the partial derivatives

associated with each arc of the graph,

∂u

, when u

is a successor of u

159

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Figure 6.4: Top: The chain rule, when there are two intermediate variables y

and y

between x and z, creating two paths for changes in x to propagate and yield changes in

z. Bottom: more general case, with n intermediate variables y

to y

6.3.1 Chain Rule

In order to apply the back-propagation algorithm, we take advantage of the chain

rule:

∇

C(g(θ)) = ∇

g(θ)

C(g(θ))

∂g(θ)

∂θ

(6.6)

which works also when C, g or θ are vectors rather than scalars (in which case

the corresponding partial derivatives are understood as Jacobian matrices of the

appropriate dimensions). In the purely scalar case we can understand the chain

rule as follows: a small change in θ will propagate into a small change in g(θ)

by getting multiplied by

∂g(θ)

∂θ

. Similarly, a small change in g(θ) will propagate

into a small change in C(g(θ)) by getting multiplied by ∇

g(θ)

C(g(θ)). Hence a

small change in θ ﬁrst gets multiplied by

∂g(θ)

∂θ

to obtain the change in g(θ) and

160

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

this then gets multiplied by ∇

g(θ)

C(g(θ)) to obtain the change in C(g(θ)). Hence

the ratio of the change in C(g(θ)) to the change in θ is the product of these

partial derivatives. The partial derivative measures the locally linear inﬂuence of

a variable on another. This is illustrated in Figure 6.3, with x = θ, y = g(θ), and

z = C(g(θ)).

Now, if g is a vector, we can rewrite the above as follows:

∇

C(g(θ)) =

∂C(g(θ))

∂g

(θ)

∂g

(θ)

∂θ

which sums over the inﬂuences of θ on C(g(θ)) through all the intermediate

variables g

(θ). This is illustrated in Figure 6.4 with x = θ, y

= g

(θ), and

z = C(g(θ)).

Algorithm 6.1 Forward computation associated with input x for a deep neural

network with ordinary aﬃne layers composed with an arbitrary elementwise dif-

ferentiable (almost everywhere) non-linearity f. There are M such layers, each

mapping their vector-valued input h

to a pre-activation vector a

via a weight

matrix W

(k)

which is then transformed via f into h

k+1

. The input vector x

corresponds to h

and the predicted outputs

y corresponds to h

. The cost

function L(

y, y) depends on the output

y and on a target y (see Section 6.2.2

for examples of loss functions). The loss may be added to a regularizer Ω (see

Section 6.2.3 and Chapter 7) to obtain the example-wise cost C. Algorithm 6.2

shows how to compute gradients of C with respect to parameters W and b. For

computational eﬃciency on modern computers (especially GPUs), it is important

to implement these equations minibatch-wise, i.e., h

(k)

(and similary a

(k)

) should

really be a matrix whose second dimension is the example index in the minibatch.

Accordingly, y and

y should have an additional dimension for the example index

in the minibatch, while C remains a scalar, i.e., the average of the costs over all

the minibatch examples.

= x

for k = 1 . . . , M do

(k)

= b

(k)

+ W

(k)

(k−1)

(k)

= f(a

(k)

)

end for

y = h

(M)

C = L(

y, y) + λΩ

161

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Algorithm 6.2 Backward computation for the deep neural network of Algo-

rithm 6.1, which uses in addition to the input x a target y. This computation

yields the gradients on the activations a

(k)

for each layer k, starting from the

output layer and going backwards to the ﬁrst hidden layer. From these gradi-

ents, which can be interpreted as an indication of how each layer’s output should

change to reduce error, one can obtain the gradient on the parameters of each

layer. The gradients on weights and biases can be immediately used as part of a

stochastic gradient update (performing the update right after the gradients have

been computed) or used with other gradient-based optimization methods.

After the forward computation, compute the gradient on the output layer:

g ← ∇

C = ∇

y, y) + λ∇

Ω

(typically Ω is only a function of parameters not activations, so the last term

would be zero)

for k = M down to 1 do

Convert the gradient on the layer’s output into a gradient into the pre-

nonlinearity activation (element-wise multiplication if f is element-wise):

g ← ∇

(k)

C = g  f

(k)

)

Compute gradients on weights and biases (including the regularization term,

where needed):

∇

(k)

C = g + λ∇

(k)

Ω

∇

(k)

C = g h

(k−1)>

+ λ∇

(k)

Ω

Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:

g ← ∇

(k−1)

C = W

(k)>

∇

(k)

end for

6.3.2 Back-Propagation in an MLP

Whereas example 6.1.1 illustrated the case of of an MLP with a single hidden

layer let us consider in this section back-propagation for an ordinary but deep

MLP, i.e., like the above vanilla MLP but with several hidden layers. For this

purpose, we will recursively apply the chain rule illustrated in Figure 6.4. The

algorithm proceeds by ﬁrst computing the gradient of the cost C with respect to

output units, and these are used to compute the gradient of C with respect to the

top hidden layer activations. We can then continue computing the gradients of

lower level hidden units one at a time in the same way The gradients on hidden

and output units can be used to compute the gradient of C with respect to the

parameters (e.g. weights and biases) of each layer (i.e., that directly contribute

to the output of that layer).

Algorithm 6.1 describes in matrix-vector form the forward propagation com-

putation for a classical multi-layer network with M layers, where each layer com-

162

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

putes an aﬃne transformation (deﬁned by a bias vector b

(k)

and a weight matrix

(k)

followed by a non-linearity f. In general, the non-linearity may be diﬀer-

ent on diﬀerent layers, and this is typically the case for the output layer (see

Section 6.2.1). Hence each unit at layer k computes an output h

(k)

as follows:

(k)

= b

(k)

(k−1)

(k)

= f(a

(k)

) (6.7)

where we separate the aﬃne transformation from the non-linear activation oper-

ations for ease of exposition of the back-propagation computations.

These are described in matrix-vector form by Algorithm 6.2 and proceed from

the output layer towards the ﬁrst hidden layer, as outlined above.

6.3.3 Back-Propagation in a General Flow Graph

More generally, we can think about decomposing a function C(θ) into a more

complicated graph of computations. This graph is called a ﬂow graph. Each

node u

of the graph denotes a numerical quantity that is obtained by performing

a computation requiring the values u

of other nodes, with j < i. The nodes

satisfy a partial order which dictates in what order the computation can proceed.

In practical implementations of such functions (e.g. with the criterion C(θ) or

its value estimated on a minibatch), the ﬁnal computation is obtained as the

composition of simple functions taken from a given set (such as the set of numerical

operations that the numpy library can perform on arrays of numbers).

We will deﬁne the back-propagation in a general ﬂow-graph, using the follow-

ing generic notation: u

= f

), where a

is a list of arguments for the application

of f

to the values u

for the parents of i in the graph: a

= (u

)

j∈parents(i)

. This

is illustrated in Figure 6.5.

163

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

…

= f

)

Figure 6.5: Illustration of recursive forward computation, where at each node u

compute a value u

= f

), with a

being the list of values from parents u

of node

. Following Algorithm 6.3, the overall inputs to the graph are u

. . . , u

(e.g., the

parameters we may want to tune during training), and there is a single scalar output u

(e.g., the loss which we want to minimize).

The overall computation of the function represented by the ﬂow graph can

thus be summarized by the forward computation algorithm, Algorithm 6.3.

In addition to having some code that tells us how to compute f

) for some

values in the vector a

, we also need some code that tells us how to compute

its partial derivatives,

∂f

)

∂a

with respect to any immediate argument a

. Let

k = π(i, j) denote the index of u

in the list a

. Note that u

could inﬂuence u

through multiple paths. Whereas

∂u

would denote the total gradient adding up

all of these inﬂuences,

∂f

)

∂a

only denotes the derivative of f

with respect to its

speciﬁc k-th argument, keeping the other arguments ﬁxed, i.e., only considering

the inﬂuence through the arc from u

to u

. In general, when manipulating

partial derivatives, one should keep clear in one’s mind (and implementation) the

notational and semantic distinction between a partial derivative that includes all

paths and one that includes only the immediate eﬀect of a function’s argument

164

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Algorithm 6.3 Flow graph forward computation. Each node computes numerical

value u

by applying a function f

to its argument list a

that comprises the values

of previous nodes u

, j < i, with j ∈ parents(i). The input to the ﬂow graph is

the vector x, and is set into the ﬁrst M nodes u

to u

. The output of the ﬂow

graph is read oﬀ the last (output) node u

for i = 1 . . . , M do

← x

end for

for i = M + 1 . . . , N do

← (u

)

j∈parents(i)

← f

)

end for

return u

on the function output, with the other arguments considered ﬁxed. For example

consider f

3,1

, a

3,2

) = e

3,1

3,2

and f

2,1

) = a

2,1

, while u

= f

, u

) and

= f

), illustrated in Figure 6.6. The direct derivative of f

with respect

to its argument a

3,2

∂f

∂a

3,2

= e

3,1

3,2

while if we consider the variables u

and u

to which these correspond, there are two paths from u

to u

, and we

obtain as derivative the sum of partial derivatives over these two paths,

∂u

(1 + 2u

). The results are diﬀerent because

∂u

involves not just the direct

dependency of u

on u

but also the indirect dependency through u

165

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Figure 6.6: Illustration of indirect eﬀect and direct eﬀect of variable u

on variable u

in a

ﬂow graph, which means that the derivative of u

with respect to u

must include the sum

of two terms, one for the direct eﬀect (derivative of u

with respect to its ﬁrst argument)

and one for the indirect eﬀect through u

(involving the product of the derivative of u

with respect to u

times the derivative of u

with respect to u

). Forward computation of

’s (as in Figure 6.5) is indicated with upward full arrows, while backward computation

(of derivatives with respect to u

’s, as in Figure 6.7) is indicated with downward dashed

arrows.

166

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

…

∂u

Figure 6.7: Illustration of recursive backward computation, where we associate to each

node j not just the values u

computed in the forward pass (Figure 6.5, bold upward

arrows) but also the gradient

∂u

with respect to the output scalar node u

. These

gradients are recursively computed in exactly the opposite order, as described in Algo-

rithm 6.4 by using the already computed

∂u

of the children i of j (dashed downward

arrows).

Armed with this understanding, we can deﬁne the back-propagation algorithm

as follows, in Algorithm 6.4, which would be computed after the forward prop-

agation (Algorithm 6.3) has been performed. Note the recursive nature of the

application of the chain rule, in Algorithm 6.4: we compute the gradient on node

j by re-using the already computed gradient for children nodes i, starting the

recurrence from the trivial

∂u

= 1 that sets the gradient for the output node.

167

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

This is illustrated in Figure 6.7.

Algorithm 6.4 Back-propagation computation of a ﬂow graph (full, upward ar-

rows), which itself produces an additional ﬂow graph (dashed, backward arrows).

See the forward propagation in a ﬂow-graph (Algorithm 6.3, to be performed ﬁrst)

and the required data structure. In addition, a quantity

∂u

needs to be stored

(and computed) at each node, for the purpose of gradient back-propagation. Be-

low the notation π(i, j) is the index of u

as an argument to f

. The back-

propagation algorithm eﬃciently computes

∂u

for all i’s (traversing the graph

backwards this time), and in particular we are interested in the derivatives of

the output node u

with respect to the “inputs” u

. . . , u

(which could be the

parameters, in a learning setup). The cost of the overall computation is propor-

tional to the number of arcs in the graph, assuming that the partial derivative

associated with each arc requires a constant time. This is of the same order as

the number of computations for the forward propagation.

∂u

← 1

for j = N − 1 down to 1 do

∂u

←

i:j∈parents(i)

∂u

∂f

)

∂a

i,π(i,j)

end for

return



∂u



i=1

This recursion is a form of eﬃcient factorization of the total gradient, i.e., it is

an application of the principles of dynamic programming

. Indeed, the derivative

of the output node with respect to any node can also be written down in this

intractable form:

∂u

paths u

...,u

: k

=i,k

j=2

∂u

j−1

where the paths u

. . . , u

go from the node k

= i to the ﬁnal node k

= N

in the ﬂow graph. Computing the sum as above would be intractable because

the number of possible paths can be exponential in the depth of the graph. The

back-propagation algorithm is eﬃcient because it employs a dynamic program-

ming strategy to reuse rather than re-compute partial sums associated with the

gradients on intermediate nodes.

Here we refer to “dynamic programming” in the sense of table-ﬁlling algorithms that avoid

re-computing frequently used subexpressions. In the context of machine learning, “dynamic

programming” can also refer to iterating Bellman’s equations. That is not the kind of dynamic

programming we refer to here.

168

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Although the above was stated as if the u

’s were scalars, exactly the same

procedure can be run with u

’s being tuples of numbers (more easily represented

by vectors). In that case the equations remain valid, and the multiplication of

scalar partial derivatives becomes the multiplication of a row vector of gradients

∂u

with a Jacobian of partial derivatives associated with the j → i arc of the

graph,

∂f

)

∂a

i,π(i,j)

. In the case where minibatches are used during training, u

would

actually be a whole matrix (the extra dimension being for the examples in the

minibatch). This would then turn the basic computation into matrix-matrix

products rather than matrix-vector products, and the former can be computed

much more eﬃciently than a sequence of matrix-vector products (e.g. with the

BLAS library), especially so on modern computers and GPUs, that rely more and

more on parallelization through many cores.

More implementation issues regarding the back-propagation algorithm are

discussed in Chapters 11 and 12, regarding respectively GPU implementations

(Section 11.2) and debugging tricks (Section 12.4). (TODO: section was moved

to a TODO in this chapter) also discusses a natural generalization of the back-

propagation algorithm in which one manipulates not numbers but symbolic ex-

pressions, i.e., turning a program that performs a computation (decomposed as

a ﬂow graph expressed in the forward propagation algorithm) into another pro-

gram (another ﬂow graph) that performs gradient computation (i.e. automatically

generating the back-propagation program, given the forward propagation graph).

Using such symbolic automatic diﬀerentiation procedures (implemented for exam-

ple in the Theano library

), one can compute second derivatives and other kinds

of derivatives, as well as apply numerical stabilization and simpliﬁcations on a

ﬂow graph.

6.4 Universal Approximation Properties and Depth

When there is no hidden layer, and for convex loss functions and regularizers

(which is typically the case), we obtain a convex training criterion. We end up

with linear regression, logistic regression or other log-linear models that are used

in many applications of machine learning. This is appealing (no local minima,

no saddle point) and convenient (no strong dependency on initial conditions)

but comes with a major disadvantage: such learners are very limited in their

ability to represent more complex functions. In the case of classiﬁcation tasks,

we are limited to linear decision surfaces. In practice, this limitation can be

somewhat circumvented by handcrafting a large enough or discriminant enough

set of hardwired features extracted from the raw input and on which the linear

http://deeplearning.net/software/theano/

169

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

predictor can be easily trained. See next section for a longer discussion about

feature learning.

To make the family of functions rich enough, the other option is to introduce

one or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi,

1994) that with a single hidden layer of a suﬃcient size and a reasonable choice of

non-linearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can

represent any smooth function to any desired accuracy (the greater the required

accuracy, the more hidden units are required). However, these theorems generally

do not tell us how many hidden units will be required to achieve a given accuracy

for particular data distributions: in fact the worse case scenario is generally going

to require an exponential number of hidden units (to basically record every input

conﬁguration that needs to be distinguished). It is easier to understand how

bad things can be by considering the case of binary input vectors: the number

of possible binary functions on vectors v ∈ {0, 1}

is 2

and selecting one such

function requires 2

bits, which will in general require O(2

) degrees of freedom.

However, machine learning is not about learning any possible function with

equal ease: we mostly care about the kinds of functions that are needed to repre-

sent the world around us or the task of interest. The no-free-lunch theorems

for machine learning (Wolpert, 1996) essentially say that no learning algorithm is

“universal”, i.e., dominates all the others against all possible ground truths (data

generating distribution or target function). However, for a given family of tasks,

some learning algorithms are deﬁnitely better. Choosing a learning algorithm is

equivalent to choosing a prior, i.e., making a larger bet for some guesses of the

underlying target function than for others. If the target function is likely under

the chosen prior, then this prior (i.e., this learning algorithm) turns out to be

a good choice. The best choice is the unrealistic prior that puts all probability

mass on the true target function. This would only happen if there is nothing

to be learned any more because we already know the target function. Having a

universal approximation property just means that this prior is broad enough to

cover or come close to any possible target function – and this is a useful property

– but is, in itself, clearly not suﬃcient to provide good generalization (not to

speak of the optimization issues and other computational concerns that may be

associated with a choice of prior and learning algorithm).

One of the central priors that is exploited in deep learning is that the target

function to be learned can be eﬃciently represented as a deep composition of

simpler functions (“features”), where features at one level can be reused to deﬁne

many features at the next level. This is connected to the notion of underlying

factors described in the next section. One therefore assumes that these factors

or features are organized at multiple levels, corresponding to multiple levels of

representation. The number of such levels is what we call depth in this context.

170

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

The computation of these features can therefore be laid down in a ﬂow graph

or circuit, whose depth is the length of the longest path from an input node to

an output node. Note that we can deﬁne the operations performed in each node

in diﬀerent ways. For example, do we consider a node that computes the aﬃne

operations of a neural net followed by a node that computes the non-linear neuron

activation, or do we consider both of these operations as one node or one level of

the graph? Hence the notion of depth really depends on the allowed operations at

each node and one ﬂow graph of a given depth can be converted into an equivalent

ﬂow graph of a diﬀerent depth by redeﬁning which operations can be performed

at each node. However, for neural networks, we typically talk about a depth 1

network if there is no hidden layer, a depth 2 network if there is one hidden layer,

etc. The universal approximation properties for neural nets basically tell us that

depth 2 is suﬃcient to approximate any reasonable function to any desired ﬁnite

accuracy.

From the point of view of approximation properties, the important result

is that one can ﬁnd families of functions which can be approximated very eﬃ-

ciently when a particular depth is allowed, but which might require a much larger

(typically exponentially larger) model (e.g. more hidden units) if depth is insuf-

ﬁcient (or is limited to 2). Such results have been proven for logic gates (H˚astad,

1986), linear threshold units with non-negative weights (H˚astad and Goldmann,

1991), polynomials (Delalleau and Bengio, 2011) organized as deep sum-product

networks (Poon and Domingos, 2011), and more recently, for deep networks of

rectiﬁer units (Pascanu et al., 2013b). Of course, there is no guarantee that the

kinds of functions we want to learn in applications of machine learning (and in

particular for AI) share such a property. However, a lot of experimental evidence

suggests that better generalization can be obtained with more depth, for many

such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan

et al., 2012; Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a;

Couprie et al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is

a useful prior and that in order to take advantage of it, the learner’s family of

function needs to allow multiple levels of representation.

6.5 Feature / Representation Learning

Let us consider again the single layer networks such as the perceptron, linear

regression and logistic regression: such linear models are appealing because train-

ing them involves a convex optimization problem

, i.e., an optimization problem

with some convergence guarantees towards a global optimum, irrespective of ini-

or even one for which an analytic solution can be computed, with linear regression or the

case of some Gaussian process regression models

171

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

tial conditions. Simple and well-understood optimization algorithms are available

in this case. However, this limits the representational capacity too much: many

tasks, for a given choice of input representation x (the raw input features), cannot

be solved by using only a linear predictor. What are our options to avoid that

limitation?

1. One option is to use a kernel machine (Williams and Rasmussen, 1996;

Sch¨olkopf et al., 1999), i.e., to consider a ﬁxed mapping from x to φ(x),

where φ(x) is of much higher dimension. In this case, f

(x) = b + w · φ(x)

can be linear in the parameters (and in φ(x)) and optimization remains

convex (or even analytic). By the exploiting the kernel trick, we can com-

putationally handle a high-dimensional φ(x) (or even an inﬁnite-dimensional

one) so long as the kernel K(u, v) = φ(u) · φ(v) (where · is the appropriate

dot product for the space of φ(·)) can be computed eﬃciently. If φ(x) is

of high enough dimension, we can always have enough capacity to ﬁt the

training set, but generalization is not at all guaranteed: it will depend on

the appropriateness of the choice of φ as a feature space for our task. Kernel

machines theory clearly identiﬁes the choice of φ to the choice of a prior.

This leads to kernel engineering, which is equivalent to feature engineering,

discussed next. The other type of kernel (that is very commonly used) em-

bodies a very broad prior, such as smoothness, e.g., the Gaussian (or RBF)

kernel K(u, v) = e

−||u−v||/σ

. Unfortunately, this prior may be insuﬃcient,

i.e., too broad and sensitive to the curse of dimensionality, as introduced in

Section 5.12.1 and developed in more detail in Chapter 17.

2. Another option is to manually engineer the representation or features φ(x).

Most industrial applications of machine learning rely on hand-crafted fea-

tures and most of the research and development eﬀort (as well as a very

large fraction of the scientiﬁc literature in machine learning and its applica-

tions) goes into designing new features that are most appropriate to the task

at hand. Clearly, faced with a problem to solve and some prior knowledge

in the form of representations that are believed to be relevant, the prior

knowledge can be very useful. This approach is therefore common in prac-

tice, but is not completely satisfying because it involves a very task-speciﬁc

engineering work and a laborious never-ending eﬀort to improve systems by

designing better features. If there were some more general feature learning

approaches that could be applied to a large set of related tasks (such as

those involved in AI), we would certainly like to take advantage of them.

Since humans seem to be able to learn a lot of new tasks (for which they

were not programmed by evolution), it seems that such broad priors do

exist. This whole question is discussed in a lot more detail in Bengio and

LeCun (2007a), and motivates the third option.

172

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

3. The third option is to learn the features, or learn the representation. In a

sense, it allows one to interpolate between the almost agnostic approach

of a kernel machine with a general-purpose smoothness kernel (such as

RBF SVMs and other non-parametric statistical models) and full designer-

provided knowledge in the form of a ﬁxed representation that is perfectly

tailored to the task. This is equivalent to the idea of learning the kernel, ex-

cept that whereas most kernel learning methods only allow very few degrees

of freedom in the learned kernel, representation learning methods such as

those discussed in this book (including multi-layer neural networks) allow

the feature function φ(·) to be very rich (with a number of parameters that

can be in the millions or more, depending on the amount of data available).

This is equivalent to learning the hidden layers, in the case of a multi-layer

neural network. Besides smoothness (which comes for example from reg-

ularizers such as weight decay), other priors can be incorporated in this

feature learning. The most celebrated of these priors is depth, discussed

above (Section 6.4). Other priors are discussed in Chapter 17.

This whole discussion is clearly not speciﬁc to neural networks and supervised

learning, and is one of the central motivations for this book.

6.6 Piecewise Linear Hidden Units

Most of the recent improvement in the performance of deep neural networks can be

attributed to increases in computational power and the size of datasets. The ma-

chine learning algorithms involved in recent state-of-the-art systems have mostly

existed since the 1980s, with very few recent conceptual advances contributing

signiﬁcantly to increased performance.

One of the main algorithmic improvements that has had a signiﬁcant impact

is the use of piecewise linear units, such as absolute value rectiﬁers and rectiﬁed

linear units. Such units consist of two linear pieces and their behavior is driven

by a single weight vector. Jarrett et al. (2009b) observed that “using a rectifying

non-linearity is the single most important factor in improving the performance of a

recognition system” among several diﬀerent factors of neural network architecture

design.

For small datasets, Jarrett et al. (2009b) observed that rectifying non-linearities

is even more important than learning the weights of the hidden layers. Random

weights are suﬃcient to propagate useful information through a rectiﬁed linear

network, allowing the classiﬁer layer at the top to learn how to map diﬀerent

feature vectors to class identities.

When more data is available, learning becomes relevant because it can extract

more knowledge from it, and learning typically beats ﬁxed or random settings of

173

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

parameters. Glorot et al. (2011b) showed that learning is far easier in deep

rectiﬁed linear networks than in deep networks that have curvature or two-sided

saturation in their activation functions. Because the behavior of the unit is linear

over half of its domain, it is easy for an optimization algorithm to tell how to im-

prove the behavior of a unit, even when the unit’s activations are far from optimal.

Just as piecewise linear networks are good at propagating information forward,

back-propagation in such a network is also piecewise linear and propagates infor-

mation about the error derivatives to all of the gradients in the network. Each

piecewise linear function can be decomposed into diﬀerent regions corresponding

to diﬀerent linear pieces. When we change a parameter of the network, the result-

ing change in the network’s activity is linear until the point that it causes some

unit to go from one linear piece to another. Traditional units such as sigmoids

are more prone to discarding information due to saturation both in forward prop-

agation and in back-propagation, and the response of such a network to a change

in a single parameter may be highly nonlinear even in a small neighborhood.

Glorot et al. (2011b) actually motivate rectiﬁed linear units from biological

considerations. The half-rectifying non-linearity was intended to capture these

properties of biological neurons: 1) For some inputs, biological neurons are com-

pletely inactive. 2) For some inputs, a biological neuron’s output is proportional

to its input. 3) Most of the time, biological neurons operate in the regime where

they are inactive (e.g., they should have sparse activations).

One drawback to rectiﬁed linear units is that they cannot learn via gradient-

based methods on examples for which their activation is zero. This problem

can be mitigated by initializing the biases to a small positive number, but it is

still possible for a rectiﬁed linear unit to learn to de-activate and then never be

activated again. Goodfellow et al. (2013a) introduced maxout units and showed

that maxout units can successfully learn in conditions where rectiﬁed linear units

become stuck. Maxout units are also piecewise linear, but unlike rectiﬁed linear

units, each piece of the linear function has its own weight vector, so whichever

piece is active can always learn.

This same general principle of using linear behavior to obtain easier opti-

mization also applies in other contexts besides deep linear networks. Recurrent

networks, which need to propagate information through several time steps, are

much easier to train when they are more linear. One of the best-performing re-

current network architectures, the LSTM, propagates information through time

via summation–a particular straightforward kind of linear activation. This is

discussed further in Section 10.6.4.

In addition to helping to propagate information and making optimization

easier, piecewise linear units also have some nice properties that can make them

easier to regularize. This is discussed further in Section 7.11.

174

CHAPTER 6. FEEDFORWARD DEEP NETWORKS

Sigmoidal non-linearities still perform well in some contexts, but piecewise

linear units are now by far the most popular kind of hidden units.

6.7 Historical Notes

175