Chapter 5

Machine Learning Basics

TODO: no free lunch theorem

Deep learning is a speciﬁc kind of machine learning. In order to understand deep

learning well, one must have a solid understanding of the basic principles of machine

learning. This chapter provides a brief course in the most important general princi-

ples that will be applied throughout the rest of the book. If you are already familiar

with machine learning basics, feel free to skip to the end of this chapter, starting at

Section 5.11. These sections cover basic machine learning notions having to do with

smoothness, local vs non-local generalization and the curse of dimensionality, as these

are not always discussed in basic courses in machine learning.

5.1 Learning Algorithms

A machine learning algorithm is an algorithm that is able to learn from data. But what

do we mean by learning? A popular deﬁnition of learning in the context of computer

programs is “A computer program is said to learn from experience E with respect to

some class of tasks T and performance measure P, if its performance at tasks in T , as

measured by P, improves with experience E” (Mitchell, 1997). One can imagine a very

wide variety of experiences E, tasks T , and performance measures P , and we do not

make any attempt in this book to provide a formal deﬁnition of what may be used for

each of these entities.

5.1.1 The Task, T

Machine learning is mostly interesting because of the tasks we can accomplish with

it. From an engineering point of view, machine learning allows us to tackle tasks that

are too diﬃcult to solve with ﬁxed programs written and designed by human beings.

From a scientiﬁc and philosophical point of view, machine learning is interesting because

understanding it allows us to understand the principles that underlie intelligent behavior,

and intelligent behavior is deﬁned as being able to accomplish certain tasks.

Many kinds of tasks can be solved with machine learning. This book describes these

kinds of tasks:

• Classiﬁcation: In this type of task, the algorithm is asked to output a function

f : R

→ {1, . . . , k}. Here f(x) can be interpreted as an estimate of the category

that x belongs to. There are other variants of the classiﬁcation task, for example,

where f outputs a probability distribution over classes. An example of this type of

task is object recognition, where the input is an image (usually described as a set

of pixel brightness values), and the output is a numeric code identifying the object

in the image. For example, the Willow Garage PR2 robot is able to recognize

objects and deliver them to people on command (Goodfellow et al., 2010). Object

recognition is also commercially interesting to search within a set of photos, to

recognize faces in order to identify a person or unlock your phone, or for web

sites that use targeted advertising based on the content of user-uploaded photos.

Modern object recognition is best accomplished with deep learning (Krizhevsky

et al., 2012a; Zeiler and Fergus, 2014; Sermanet et al., 2014).

• Classiﬁcation with missing inputs : This is similar to classiﬁcation, except rather

than providing a single classiﬁcation function, the algorithm must learn a set of

functions. Each function corresponds to classifying x with a diﬀerent subset of

its inputs missing. This kind of situation arises frequently in medical diagnosis,

because many kinds of medical tests are expensive or invasive.

• Regression : In this type of task, the algorithm is asked to output a function

f : R

→ R. This type of task is similar to classiﬁcation, except the machine

learning system predicts a real-valued output instead of a category code. An

example of a regression task is the prediction of the expected claim amount that

an insured person will make (used to set insurance premia), or the prediction of

future prices of securities. These kinds of predictions are used for algorithmic

trading.

• Transcription : This type of task is similar to classiﬁcation, except that the output

is a sequence of symbols, rather than a category code (which can be thought of as

a single symbol). For example, in speech recognition and in machine translation,

the target output is a sequence of words. Such task fall under the more general

umbrella of structured output tasks, where the target output is a complex compo-

nential object involving many random variables (such as the individual words in

the translated sentence) that have a non-trivial joint distribution, given the input.

• Density estimation: In this type of task, the machine learning algorithm is asked

to learn a function p

model

: R

→ R, where p

model

(x) can be interpreted as a

probability density function on the space that the examples were drawn from. To

do such a task well (we will specify exactly what that means when we discuss

performance measures P), the algorithm needs to learn the structure of the data

it has seen. It must know where examples cluster tightly and where they are

unlikely to occur. This can be used for anomaly detection (see below) but it can

also be used in principle (at the price of some computation) to answer all question

of the form: what values are probable (or what are the expected values or other

such statistic) for some subset of the variables in x, given another subset of these

variables? This generalizes all of the tasks presented here and many more but

may involve intractable computations called inference, discussed in Chapter 20.

Density estimation is a form of unsupervised learning, and many other forms of

unsupervised learning are possible, which can solve all or a subset of such tasks,

provided appropriate inference is computationally feasible.

• Anomaly detection: In this type of task, the machine learning algorithm is to

identify which events are normal and which events are unusual. This is closely

related to the density estimation task, since unusual events are less likely. An

example of an anomaly detection task is credit card fraud detection. By modeling

your purchasing habits, a credit card company can detect misuse of your cards.

If a thief steals your credit card or credit card information, the thief’s purchases

will presumably come from a diﬀerent probability distribution over purchase types

than your own. The credit card company can prevent fraud by placing a hold on

an account as soon as that card has been used for an unusual purchase.

• Synthesis and sampling: In this type of task, the machine learning algorithm is

asked to generate new examples that are similar to those in the training data.

This can be useful for media applications where it can be expensive or boring for

an artist to generate large volumes of content by hand. For example, video games

can automatically generate textures for large objects or landscapes, rather than

requiring an artist to manually label each pixel (Luo et al., 2013).

• Imputation of missing values: In this type of task, the machine learning algorithm

is given a new example x ∈ R

, but with some entries x

of x missing. The algo-

rithm must provide a prediction of the values of the missing entries. This task is

closely related to density estimation, because it can be solved by learning p

model

(x)

then conditioning on the observed entries of x. TODO: example application

Of course, many other tasks and types of tasks are possible.

5.1.2 The Performance Measure, P

In order to evaluate the performance of a machine learning algorithm, we must design a

quantitative measure of its performance. Usually this performance measure P is speciﬁc

to the task T being carried out by the system.

For tasks such as classiﬁcation, classiﬁcation with missing inputs, and transcription,

we often measure the accuracy of the model. This is simply the proportion of examples

for which the model produces the correct output. For tasks such as density estimation,

we can measure the probability the model assigns to some examples.

Usually we are interested in how well the machine learning algorithm performs on

data that it has not seen before, since this determines how well it will work when

deployed in the real world. We therefore evaluate these performance measures using a

test set of data that is separate from the data used for training the machine learning

system. We discuss this idea further in section 5.3.

In machine learning, the performance measure is sometimes called the loss which

represent the cost associated with a particular event (such as a classiﬁcation). The

objective of learning is then to minimize the loss. For classiﬁcation tasks, a common

choice for the loss function is the error – which is simply the proportion of examples

for which the model produces an incorrect output. Minimizing the error is, of course,

equivalent to maximizing the accuracy.

The choice of performance measure may seem straightforward and objective, but it

is often diﬃcult to choose a performance measure that corresponds well to the desired

behavior of the system.

In some cases, this is because it is diﬃcult to decide what should be measured.

For example, when performing a transcription task, should we measure the accuracy

of the system at transcribing entire sequences, or should we use a more ﬁne-grained

performance measure that gives partial credit for getting some elements of the sequence

correct? When performing a regression task, should we penalize the system more if it

frequently makes medium-sized mistakes or if it rarely makes very large mistakes? These

kinds of design choices depend on the application.

In other cases, we know what quantity we would ideally like to measure, but mea-

suring it is impractical. This arises frequently in density estimation. Many of the best

probabilistic models represent probability distributions only implicitly. Computing the

actual probability value assigned to a speciﬁc point in space is intractable. In these cases,

one must design an alternative criterion that still corresponds to the design objectives,

or design a good approximation to the desired criterion.

Finally, in machine learning, our goal is to adapt the model parameters to optimize

the performance measure; however in many cases the relevant performance measure is

not amenable to direct application in the optimization process. In order for a perfor-

mance measure to be optimized directly, we almost always require a smooth signal – such

as a gradient – to let us know how to move the model parameters in order to improve

the performance measure. Many natural performance measures lack this property.

For example, consider again the use of classiﬁcation accuracy as a measure of per-

formance. Very small (inﬁnitesimal) changes in the model parameters are unlikely to

cause a change in the classiﬁcation of any example. As a consequence, there is no local

signal to tell us how to update the model parameters to improve model performance.

As a result, in place of the natural loss functions, we often use surrogate performance

measures (also called surrogate loss functions) that are amenable for direct use as the

objective function optimized with respect to the model parameters.

5.1.3 The Experience, E

In this book, the experience E usually consists of allowing the algorithm to observe

a dataset containing several examples. This is a relatively simple kind of experience;

other approaches to machine learning involve providing the algorithm with richer and

more interactive experiences. For example, in active learning the algorithm may request

speciﬁc additions to the dataset, and in reinforcement learning the algorithm continually

interacts with an environment.

A dataset can be described in many ways. In all cases, a dataset is a collection of

examples. Each example is a collection of observations called features collected from

a diﬀerent time or place. If we wish to make a system for recognizing objects from

photographs, we might use a machine learning algorithm where each example is a pho-

tograph, and the features within the example are the brightness values of each of the

pixels within the photograph. If we wish to perform speech recognition, we might collect

a dataset where each example is a recording of a person saying a word or sentence, and

each of the features is the amplitude of the sound wave at a particular moment in time.

One common way of describing a dataset is with a design matrix. A design matrix

is a matrix containing a diﬀerent example in each row. Each column of the matrix cor-

responds to a diﬀerent feature. A famous example of a dataset that can be represented

with a design matrix is the Iris dataset (Fisher, 1936). This dataset contains mea-

surements of diﬀerent parts of iris plants. The features measured are the sepal length,

sepal width, petal length, and petal width. In total, the dataset describes 150 diﬀerent

individual iris plants. This means we can represent the dataset with a design matrix

X ∈ R

150×4

, where X

i,1

is the sepal length of plant i, X

i,2

is the sepal width of plant i,

etc. We will describe most of the learning algorithms in this book in terms of how they

operate on design matrix datasets.

Of course, to describe a dataset as a design matrix, it must be possible to describe

each example as a vector, and each of these vectors must be the same size. This is

not always possible. For example, if you have a collection of photographs with diﬀerent

width and height, then diﬀerent photographs will contain diﬀerent numbers of pixels, so

not all of the photographs may be described with the same length of vector. Diﬀerent

sections of this book describe how to handle diﬀerent types of heterogeneous data.

In many cases, the example contains a label as well as a collection of features. For

example, if we want to use a learning algorithm to perform object recognition from

photographs, we need to specify which object appears in each of the photos. We might

do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying

a cat, etc. Often when working with a dataset containing a design matrix of feature

observations X, we also provide a vector of labels y, with y

providing the label for

example i.

Of course, sometimes the label may be more than just a single number. For example,

if we want to train a speech recognition system to transcribe entire sentences, then the

label for each example sentence is a string of characters. These cases require more

sophisticated data structures.

5.2 Example: Linear Regression

Let’s begin with an example of a simple machine learning algorithm. One of the simplest

machine learning algorithms is linear regression. As the name implies, this learning

algorithm solves a regression problem. In other words, the goal is to build a system that

can take a vector x ∈ R

as input and predict the value of a scalar y ∈ R as its output.

In the case of linear regression, the output is a linear function of the input. Let ˆy be

the value that our model predicts y should take on. We deﬁne the output to be

ˆy = w

where w ∈ R

is a vector of parameters.

Parameters are values that control the behavior of the system. In this case, w

the coeﬃcient that we multiply by feature x

before summing up the contributions from

all the features. We can think of w as a set of weights that determine how each feature

aﬀects the prediction. If a feature x

receives a positive weight w

, then increasing the

value of that feature increases the value of our prediction ˆy. If a feature receives a

negative weight, then increasing the value of that feature decreases the value of our

prediction. If a feature’s weight is large in magnitude, then it has a large eﬀect on the

prediction. If a feature’s weight is zero, it has no eﬀect on the prediction.

We thus have a deﬁnition of our task T : to predict y from x by outputting ˆy = w

Next we need a deﬁnition of our performance measure, P .

Let’s suppose that we have a design matrix of m example inputs that we will not use

for training, only for evaluating how well the model performs. We also have a vector of

regression targets providing the correct value of y for each of these examples. Because

this dataset will only be used for evaluation, we call it the test set. Let’s refer to the

design matrix of inputs as X

(test)

and the vector of regression targets as y

(test)

One way of measuring the performance of the model is to compute the mean squared

error of the model on the test set. If ˆy

(test)

is the predictions of the model on the test

set, then the mean squared error is given by

MSE

test

(ˆy

(test)

− y

(test)

)

Intuitively, one can see that this error measure decreases to 0 when ˆy

(test)

= y

(test)

. We

can also see that

MSE

test

||ˆy

(test)

− y

(test)

so the error increases whenever the Euclidean distance between the predictions and the

targets increases. We will justify the use of this performance measure more formally in

section 5.6.

To make a machine learning algorithm, we need to design an algorithm that will

improve the weights w in a way that reduces MSE

test

when the algorithm is allowed

to gain experience by observing a training set (X

(train)

, y

(train)

). One intuitive way of

doing this is just to minimize the mean squared error on the training set, MSE

train

. We

will justify this intuition later, in section 5.6.

To minimize MSE

train

, we can simply solve for where its gradient is 0:

∇

MSE

train

= 0

⇒ ∇

||ˆy

(train)

− y

(train)

= 0

⇒

∇

||X

(train)

w −y

(train)

= 0

Figure 5.1: Consider this example linear regression problem, with a training set consist-

ing of 5 data points, each containing one feature. This means that the weight vector

w contains only a single parameter to learn, w

. (Left) Observe that linear regression

learns to set w

such that the line y = w

x comes as close as possible to passing through

all the training points. (Right) The plotted point indicates the value of w

found by the

normal equations, which we can see minimizes the mean squared error on the training

set.

⇒ ∇

(train)

w −y

(train)

)

(train)

w −y

(train)

) = 0

⇒ ∇

(train)>

(train)

w −2w

(train)>

(train)

+ y

(train)>

(train)

) = 0

⇒ 2X

(train)>

(train)

w −2X

(train)>

(train)

= 0

⇒ w = (X

(train)>

(train)

)

−1

(train)>

(train)

(5.1)

The system of equations deﬁned by Eq. 5.1 is known as the normal equations. Solving

these equations constitutes a simple learning algorithm. For an example of the linear

regression learning algorithm in action, see Fig. 5.1.

This is of course an extremely simple and limited learning algorithm, but it provides

an example of how a learning algorithm can work. In the subsequent sections we will de-

scribe some of the basic principles underlying learning algorithm design and demonstrate

how these principles can be used to build more complicated learning algorithms.

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting

5.3.1 Generalization

In our example machine learning algorithm, linear regression, we ﬁt the model by min-

imizing the squared error on the training set. However, what we actually care about

is the performance of the model on new, previously unseen examples, such as the test

set. The expected error over all examples is known as the generalization error, and it

is the quantity that machine learning algorithms aim to minimize. Generalization is

the central objective of learning algorithms because for most problems of interest, the

number of possible input conﬁgurations is huge, so that new examples are almost surely

going to be diﬀerent from any of the training examples. The learner has to generalize

from the training examples to new cases.

Formally, generalization performance is typically deﬁned as the expected value of

the chosen performance measure, taken over the probability distribution of interest. For

example, in the above regression example, we care about the expected squared prediction

error. This probability distribution over which generalization performance is to be

measured would be the probability distribution from which future test cases are supposed

to arise. Hopefully, this is also the probability distribution from which training cases

are obtained. Otherwise, it will be diﬃcult to obtain theoretical guarantees about

generalization. However, if some test examples from the target distribution are available,

they could be used to assess future generalization performance.

Figure 5.2: We ﬁt three models to this example training set. The training data was

generated synthetically, by adding a small amount of noise to a quadratic function. (Left)

A linear function ﬁt to the data suﬀers from underﬁtting–it cannot capture the curvature

that is present in the data. (Center) A quadratic function ﬁt to the data generalizes

well to unseen points. It does not suﬀer from a signiﬁcant amount of overﬁtting or

underﬁtting. (Right) A polynomial of degree 10 ﬁt to the data suﬀers from overﬁtting.

It is tightly ﬁt to noise in the data and has inappropriately steep slope at the edges of

the data.

5.3.2 Capacity

Capacity is a general term designating the ability of the learner to discover a function

taken from a more or less rich family of function. For example, Figure 5.2 illustrates

the use of three families of functions over a scalar input x: the linear (actually, aﬃne)

predictor (left of the ﬁgure),

ˆy = b + wx,

the quadratic predictor (middle of the ﬁgure),

ˆy = b + w

x + w

and the degree-10 polynomial predictor (right of the ﬁgure),

ˆy = b +

i=1

The latter family is clearly richer, allowing to capture more complex functions, but as

discussed next, a richer family of functions or a larger capacity can yield to the problem

of overﬁtting and poor generalization.

Many ways of deﬁning the capacity of a model exist. One of the most popular is

the Vapnik-Chervonenkis dimension or VC dimension (Vapnik and Chervonenkis, 1971;

Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). The VC dimension is deﬁned for

binary classiﬁers. The basic idea is to measure the size of the largest dataset on which

the model is at least capable of representing a function that correctly classiﬁes all training

examples. Of course, it is easy to classify an inﬁnitely large dataset correctly if all of

the examples belong to the same class. We therefore require that the model be able to

classify all of these training examples correctly regardless of the label that is assigned to

them. Formally, the VC dimension of a binary classiﬁer with decision function f(x; θ)

is the largest number m such that

min

X∈R

m×n

max

y∈{0,1}

min

i=1

|f(X[i, :]; θ) −y

| = 0.

As discussed in many parts of this book, there are many ways to change the capacity

of a learner. For example, with deep neural networks (Chapter 6), one can change

the number of hidden units of each layer, the number of training iterations, or use

regularization techniques (Chapter 7) to reduce capacity. In general, regularization is

equivalent to imposing a preference over the set of functions that a learner can obtain

as a solution. Such preferences can be thought of as a prior probability distribution

over the space of functions or parameters that the learner can access. This point of

view is particularly important according to the Bayesian statistics approach to machine

learning, introduced in Section 5.7.

5.3.3 Occam’s Razor, Underﬁtting and Overﬁtting

A fundamental element of machine learning is the trade-oﬀ between capacity and gener-

alization, and it has been the subject of a vast and mathematically grounded literature,

also known as statistical learning theory, with some of the best known contributions com-

ing from Vapnik and his collaborators, starting with the Vapnik-Chervonenkis dimension

or VC dimension (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989;

Vapnik, 1995), and with the many contributions reported at the COLT (COmputational

Learning Theory) conference. This trade-oﬀ is related to a much older idea, Occam’s

razor (c. 1287-1347), or principle of parsimony, which states that among competing hy-

potheses (here, read functions that could explain the observed data), one should choose

the “simpler” one. Simplicity here is just another word for the opposite of “richness of

the family of functions”, or capacity.

Machine learning algorithms can fail in two ways: underﬁtting and overﬁtting. Un-

derﬁtting is simple to understand: it happens when the learner cannot ﬁnd a solution

that ﬁts the observed data well enough. If we try to ﬁt with a linear regression (x, y)

pairs for which y is actually a quadratic function of x, then we are bound to have

underﬁtting. There will be important aspects of the data that our learner cannot cap-

ture. If the learner needs a richer model in order to capture the data, it is underﬁtting.

Sometimes, underﬁtting is not just due to the limitation imposed on the family of func-

tions but also due to limitations on computational resources. For example, deep neural

networks or recurrent neural networks can be very diﬃcult to optimize (and in theory

ﬁnding the global optimum of the training objective can be NP-hard). With a limited

computational budget (e.g., a limited number of training iterations of an iterative opti-

mization procedure), the learner may not be able to ﬁt the training data well enough.

It could also be that the optimization procedure easily gets stuck in local minima of

the objective function and vastly more expensive optimization procedures are needed to

really extract all the juice from the data and the theoretical richness of the family of

functions in which the learner searches for a solution.

Overﬁtting is more subtle and not so easy to wrap one’s mind around. What would

be wrong with picking a very rich family of functions that is large enough to always ﬁt

our training data? We could certainly ﬁt the training data all right, but we might lose

the ability to generalize well. Why? The fundamental reason is that if the family of

functions accessible to the learner (with its limitations on computational resources) is

too large, then this family of functions may actually contain many functions which all

ﬁt the training data well. Without suﬃcient data, we would therefore not be able to

distinguish which one was most appropriate, and the learner would make an arbitrary

choice among these apparently good solutions. Yet, these functions are all diﬀerent from

each other, yielding diﬀerent answers in at least some cases. Even if one of these functions

was the correct one, we might pick the wrong one, and the expected generalization

error would thus be at least proportional to the average discrepancy between these

functions, which is related to the variance of the estimator, discussed below (Section 5.5).

Occam’s razor suggests to pick the family of functions just large enough to leave only one

choice that ﬁts well the data. Statistical learning theory has quantiﬁed this notion more

precisely, and shown a mathematical relationship between capacity (e.g. measured by the

VC-dimension) and the diﬀerence between generalization error and training error (also

known as optimism). Although generalization error cannot generally be characterized

exactly because it depends on the speciﬁcs of the training distribution, error bounds can

be proven, showing that the discrepancy between training error and generalization error

is bounded by a quantity that grows with the ratio of capacity to number of training

examples (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik,

1995).

Since training error decreases as a function of capacity, and adding a decreasing

function (training error) and an increasing function (the diﬀerence between generaliza-

tion and training error) yields a U-shaped function (ﬁrst decreasing, then increasing),

it means that generalization error is a U-shaped function of capacity. This is illustrated

in Figure 5.3, showing the underﬁtting regime (left) and overﬁtting regime (right).

Capacity

training

error

generalization

error

Error

optimism

optimal

capacity

Overﬁtting zoneUnderﬁtting zone

Figure 5.3: Typical relationship between capacity (horizontal axis) and both training

(bottom curve, blue) and generalization (or test) error (top curve, red). Capacity is

basically the number of examples that the learner can always ﬁt. As we increase it,

training error can be reduced, but the optimism (diﬀerence between training and gen-

eralization error, called “optimism”) increases. At some point, the increase in optimism

is larger than the decrease in training error (typically when the training error is low and

cannot go much lower), and we enter the overﬁtting regime, where capacity is too large,

above the optimal capacity. Before reaching optimal capacity, we are in the underﬁtting

regime.

How do we know if we are underﬁtting or overﬁtting? If by increasing capacity

we decrease generalization error, then we are underﬁtting, otherwise we are overﬁtting.

The capacity associated with the transition from underﬁtting to overﬁtting is the optimal

capacity.

An interesting question is how one can select optimal capacity, and how it is expected

to vary. We can use a validation set on monitor generalization error as a function of ca-

pacity and thus estimate optimal capacity empirically. Methods such as cross-validation

(TODO) allow one to estimate generalization error even when the dataset is small, by

considering many possible splits of the data into training and validation examples. What

does theory tell us about the optimal capacity? Because theory predicts that optimism

roughly increases with the ratio of capacity and number of training examples (Vapnik,

1995), it means that optimal capacity should increase with the number of training ex-

amples, as illustrated in Figure 5.4. This also agrees with intuition: with more training

examples, we can “aﬀord” a more complex model. If one only sees a single (x

, y

)

example, a very reasonable prediction for a function y = f (x) is the constant function

f(x) = y

. If one sees two examples (x

, y

) and (x

, y

), a very reasonable prediction is

the straight line that passes through these two points, i.e., a linear function of x. As we

see more examples, it makes sense to consider higher order polynomials, but certainly

not of higher order than the number of training examples.

With this notion in mind, let us consider an example. We can ﬁt a polynomial

of degree n to a dataset by using linear regression on a transformed space. Instead

of ﬁtting a linear function to x = [x

]

, we ﬁt it to x

= [1, x

, x

, . . . , x

]

. For an

example of how lower order polynomials are more prone to underﬁtting and higher order

polynomials are more prone to overﬁtting, see Fig. 5.2.

Generally speaking, underﬁtting can be caused in two ways:

1. The model complexity (richness of the family of functions) is too limited for the

function being approximated. This is the situation illustrated in Fig. 5.2 and is

most commonly how machine learning practitioners perceive underﬁtting.

2. The optimization procedure fails to ﬁnd a solution that well approximates the target

function despite the model having suﬃcient capacity for the task. While in practice

it is often diﬃcult to decouple the two eﬀects, this cause for underﬁtting is thought

to be fairly common with deep learning methods. This is one reason why there is so

much emphasis on models and methods that promote more eﬀective optimization.

We discuss these in Chapter 8.

5.4 Estimating and Monitoring Generalization Error

TODO

Number of training examples

Optimal capacity

Training error at ﬁxed capacity

Generalization error at ﬁxed capacity

Generalization error at optimal capacity

Asymptotic error at optimal (non-parametric) capacity

Asymptotic error at ﬁxed (parametric) capacity

Figure 5.4: We consider here both the classical parametric case (red curves) where

capacity is ﬁxed (but we see the eﬀect of increasing number of training examples),

e.g., a logistic regression or neural network with ﬁxed number of hidden units, and

the non-parametric case where capacity (e.g., number of hidden units, number of mix-

ture components, etc.) is allowed to increase optimally (e.g. by cross-validation) as

the number of training examples increases. So, with more training examples, optimal

capacity (bold black) increases (we can aﬀord a bigger and more ﬂexible model), and

the associated generalization error (green bold) would decrease, eventually reaching the

(non-parametric) asymptotic error (green dashed line, the best possible performance

or Bayes error). If capacity was ﬁxed (parametric setting), increasing the number of

training examples would also decrease generalization error (top red curve), but not as

fast (we are underﬁtting, capacity is too small), and training error would slowly increase

(bottom red curve), so that both would meet at an asymptotic value (dashed red line)

corresponding to the best achievable solution in the chosen ﬁxed-size parametric class

of learned functions.

5.5 Estimators, Bias, and Variance

TODO TODO: discuss sample variance and covariance using sample mean versus known

true mean

As a discipline, machine learning borrows many useful concepts from statistics. Foun-

dational concepts such as parameter estimation, bias and variance are all borrowed from

statistics. In this section we brieﬂy review these concepts and relate them to the machine

learning concepts that we have already discussed, such as generalization.

5.5.1 Point Estimation

Point estimation is the attempt to provide the single “best” prediction of some quantity

of interest. In general the quantity of interest can be a single parameter or a vector

of parameters in some parametric model, such as the weights in our linear regression

example in Section 5.2, but it can also be a function.

In order to distinguish estimates of parameters from their true value, our convention

will be to denote a point estimate of a parameter θ by

θ.

In the remainder of this section we will be assuming the orthodox perspective on

statistics. That is, we assume that the true parameter value θ is ﬁxed but unknown,

while the point estimate

θ is a function of the data and is therefore a random variable.

Let x

, . . . , x

be n independent and identically distributed (IID) data points. A

point estimator is any function of the data:

= g(x

, . . . , x

). (5.2)

In other words, any statistic

is a point estimate. Notice that no mention is made of

any correspondence between the estimator and the parameter being estimated. There

is also no constraint that the range of g(x

, . . . , x

) should correspond to that of the

parameter.

This deﬁnition of a point estimator is very general and allows the designer of an

estimator great ﬂexibility. What distinguishes “just any” function of the data from

most of the estimators that are in common usage is their properties.

Point estimation can also refer to the estimation of the relationship between input

and target variables. We refer to these types of point estimates as function estimators.

Function Estimation As we mentioned above, sometimes we are interested in per-

forming function estimation (or function approximation). Here we are trying to predict

a variable (or vector) y given an input vector x (also called the covariates). We con-

sider that there is a function f(x) that describes the relationship between y and x. For

example, we may assume that y = f(x) + .

In function estimation, we are interested in approximating f with a model or estimate

f. Note that we are really not adding anything new here to our notion of a point

estimator, the function estimator

f is simply a point estimator in function space.

A statistic is a function of the data, typically of the whole training set, such as the mean.

The linear regression example we discussed above in Section. 5.2 and the polynomial

regression example discussed in Section. 5.3 are both examples of function estimation

where we estimate a model

f of the relationship between an input x and target y.

In the following we will review the most commonly studied properties of point esti-

mators and discuss what they tell us about point estimators.

θ and

f are random variables (or vectors, or functions), they are distributed

according to some probability distribution. We refer to this distribution as the sampling

distribution. When we discuss properties of the estimator, we are really describing

properties of the sampling distribution.

5.5.2 Bias

The bias of an estimator is deﬁned as:

bias(

) = E(

) −θ (5.3)

where the expectation is over the data (seen as samples from a random variable). An

estimator

is said to be unbiased if bias(

) = 0, i.e., if E(

) = θ. An estimator

said to be asymptotically unbiased if lim

n→∞

bias(

) = 0, i.e., if lim

n→∞

) = θ.

Example: Bernoulli distribution Consider a data samples x ∈ {0, 1}

that are

independently and identically distributed according to a Bernoulli distribution (x

∼

Bernoulli(θ), where i ∈ [1, n]). The Bernoulli p.m.f. (probability mass function, or

probability function) is given by P (x

; θ) = θ

(1 −θ)

(1−x

)

We are interested in knowing if the estimator

i=1

is biased.

bias(

θ) = E[

] −θ

= E

i=1

− θ

i=1

E [x

] −θ

i=1



(1 −θ)

(1−x

)



− θ

i=1

(θ) −θ

= θ − θ = 0

Since bias(

θ) = 0, we say that our estimator

θ is unbiased.

TODO: another example Unbiased estimators are clearly desirable, though they

are not always the “best” estimators. As we will see we often use biased estimators that

possess other important properties.

5.5.3 Variance

Another property of the estimator that we might want to consider is how much we

expect it to vary as a function of the data sample. Just as we computed the expectation

of the estimator to determine its bias, we can compute its variance.

Var[

θ] = E[

] −E[

θ]

(5.4)

We can also deﬁne the standard error (se) of the estimator as

se(

θ) =

Var[

θ] (5.5)

Example: Bernoulli distribution Let’s once again consider a dataset x ∈ {0, 1}

that are independently and identically distributed according to the Bernoulli distribu-

tion, where P (x

; θ) = θ

(1 − θ)

(1−x

)

. This time we are interested in computing the

variance of the estimator

i=1

Var





= Var

i=1

Var (x

)

i=1

θ(1 −θ)

nθ(1 −θ)

θ(1 −θ)

TODO: another example

5.5.4 Trading oﬀ Bias and Variance and the Mean Squared Error

Bias and variance measure two diﬀerent sources of error in an estimator. Bias measures

the expected deviation from the true value of the function or parameter. Variance on the

other hand, provides a measure of the deviation from the true value that any particular

sampling of the data is likely to cause.

What happens when we are given a choice between two estimators, one with more

bias and one with less variance? How do we choose between them? For example, let’s

imagine that we are interested in approximating the function shown in Fig. 5.2 and we

are only oﬀered the choice between a model with large bias and one that suﬀers from

large variance. How do we choose between them?

In machine learning, perhaps the most common and empirically successful way to ne-

gotiate this kind of trade-oﬀ, in general is by cross-validation, discussed (TODO where?)

Alternatively, we can also compare the mean squared error (MSE) of the estimates:

MSE = E[

− θ]

= Bias(

)

+ Var[

] (5.6)

The MSE measures the overall expected deviation – in a squared error sense – between

the estimator and the true value of the parameter θ. As is clear from Eq. 5.6, evaluating

the MSE incorporates both the bias and the variance. Desirable estimators are those

with small MSE and these are estimators that manage to keep both their bias and

variance somewhat in check.

The relationship between bias and variance is tightly linked to the machine learning

concepts of capacity, underﬁtting and overﬁtting discussed in Section. 5.3. In the

case where generalization error is measured by the MSE (where bias and variance are

meaningful components of generalization error), increasing capacity tends to increase

variance and decrease bias. We thus see again the U-shaped curve of generalization

error as a function of of capacity, as in Section 5.3.3 and Figure 5.3.

TODO

TODO Example trading oﬀ bias and variance let’s compare two estimators for

the variance of a Gaussian distribution. One known as the sample variance:

SampleVariance (5.7)

TODO

5.5.5 Consistency

As we have already discussed, sometimes we may wish to choose an estimator that is

biased. For example, in order to minimize the variance of the estimator. However we

might still wish that, as the number of data points in our dataset increases, our point

estimates converge to the true value of the parameter. More formally, we would like

that lim

n→∞

→ θ.

This condition is known as consistency

and ensures that the

bias induced by the estimator is assured to diminish as the number of data examples

grows.

Asymtotic unbiasedness is not equivalent to consistency. For example, consider es-

timating the mean parameter µ of a normal distribution N(µ, σ

), with a dataset con-

sisting of n samples: {x

, . . . , x

}. We could use the ﬁrst sample x

of the dataset as

The symbol

→ means that the convergence is in probability, i.e. for any  > 0, P (|

− θ| > ) → 0

as n → ∞.

This is sometime referred to as weak consistency, with strong consistency referring to the almost

sure convergence of

θ to θ.

an unbiased estimator:

θ = x

, In that case, E(

) = θ so the estimator is unbiased

no matter how many data points are seen. This, of course, implies that the estimate is

asymptotically unbiased. However this not a consistent estimator as it is not the case

that

→ θ as n → ∞.

5.6 Maximum Likelihood Estimation

TODO: refer back to sec:linreg, we promised we’d justify using mean squared error as

a performance measure for linear regression justify minimizing mean squared error as a

learning algorithm for linear regression TODO: relate it to KL divergence TODO

In the previous section we discussed a number of common properties of estimators

but we never mentioned where these estimators come from. In this section we discuss

one of the most common approaches to deriving estimators: via the maximum likelihood

principle.

5.6.1 Properties of Maximum Likelihood

TODO: Asymptotic convergence and eﬃciency

TODO: how the log-likelihood criterion forces a learner that is not able to generalize

perfectly to yield an estimator that is much smoother than the target distribution.

5.6.2 Regularized Likelihood

In practice, many machine learning algorithms actually do not ﬁt the maximum like-

lihood framework but instead a slightly more general one, in which we introduce an

additional term in the objective function, besides the objective function. We call that

term the regularizer and it depends on the chosen function or parameters, indicating

a preference over some functions or parameters over others. Typically, the regularizer

prefers simpler solutions, such as the weight decay regularizer

λ||θ||

where θ is the parameter vector and λ is a scalar that controls the strength of the

regularizer (here, to be minimized) against the negative log-likelihood. Regularization

is treated in much greater depth in Chapter 7. When we interpret the regularizer as an

a priori log-probability over parameters, it is also related to Bayesian statistics, treated

next.

5.7 Bayesian Statistics

TODO: a quick overview of Bayesian statistics and the Bayesian world view

In this section we will brieﬂy consider estimation theory from a Bayesian perspective.

Historically, the statistic community has been divided between what has become known

as orthodox statistics and Bayesian statistics. The diﬀerence is mainly one of world view

but can have important practical implications.

As we mentioned above, the orthodox perspective is that the true parameter value

θ is ﬁxed but unknown, while the point estimate

θ is a random variable on account of

it being a function of the data (which are seen as random).

The Bayesian perspective on statistics is quite diﬀerent and, in some sense, more

intuitive. The Bayesian uses probability to reﬂect states of knowledge (actually the lack

thereof, i.e. probability represents the certainty of states of knowledge). The data is

directly observed and so is not random. On the other hand, the true parameter θ is

unknown or uncertain and thus is represented as a random variable.

Before observing the data, we represent our knowledge of θ using the prior probability

distribution, p(θ) (sometimes referred to as simply ’the prior’). Generally, the prior

distribution is quite broad (i.e. with high entropy) to reﬂect a high degree of uncertainty

in the value of θ before observing any data. For example, we might assume a priori that

θ lies in some ﬁnite range or volume, with a uniform distribution. Many priors instead

reﬂect a preference for “simpler” solutions (such as smaller magnitude coeﬃcients, or a

function that is closer to being constant).

We can recover the eﬀect of data on our belief about θ by combining the data

likelihood p(x | θ) with the prior via Bayes’ rule:

p(θ | x) =

p(x | θ)p(θ)

p(x)

(5.8)

If the data is at all informative about the value of θ, the posterior distribution p(θ | x)

will have less entropy (will be more ‘peaky’) than the prior p(θ).

We can consider the posterior distribution over values of θ as the basis for an estima-

tor of θ. Just as with the orthodox perspective, our Bayesian estimator of θ is associated

with a probability distribution. If we need a single point estimate, a good choice would

be to use the maximum a posteriori (MAP) point.

MAP

= arg max

p(θ | x) = arg max

log p(x | θ) + log p(θ) (5.9)

where we recognize the regularized maximum likelihood training criterion, on the right

hand side, with log p(x | θ) being the log-likelihood and log p(θ) corresponding to the

regularizer. This shows that a more peaking prior corresponds to a stronger regularizer,

while a completely ﬂat prior (e.g. p(θ) a constant that does not depend on θ) means no

regularization at all.

TODO: Redo the regression example with a prior.

5.8 Supervised Learning

5.8.1 Estimating Conditional Expectation by Minimizing Squared Er-

ror

In the context of linear regression, we discussed in Section 5.2 the squared error criterion

E[||Y − f(X)||

]

where the expectation is over the training set during training, and over the data gener-

ating distribution to obtain generalization error.

We can generalize its interpretation beyond the case where f is linear or aﬃne,

uncovering an interesting property: minimizing it yields an estimator of the conditional

expectation of the output variable Y given the input variable X, i.e.,

arg min

E[||Y − f(X)||

] = E[Y |X] (5.10)

where the expectations on the left and the right are over the same distribution over

(X, Y ). In practive we will minimize over the training set (mean squared error) but

hope to obtain an estimator of the conditional expectation with respect to the data

generating distribution, i.e., to generalize. Note that the minimization in Eq. 5.10 is

over all possible functions f, i.e., it is a non-parametric result. The equation remains

true if f belongs to a particular parametric family (such as the aﬃne functions) so long

as the true expected value also belongs to the same parametric family. The above can

be proven by computing derivatives of the expected squared error with respect to each

particular value f(x) (for every particular x), and setting the derivative to 0.

5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-

mum Likelihood

A demonstration similar to the above can be written down to show that the maximum

likelihood criterion yields an estimator of the data generating probability distribution

(or conditional distribution).

We show the conditional case, estimating the true Q(Y |X), but this immediately

applies to the unconditional case by ignoring X. Consider the negative log-likelihood

(NLL) training criterion,

NLL = E[−log P

(Y |X)] = −

x,y

Q(x, y) log P

(y|x) (5.11)

where the expectation is over the data generating distribution P (X, Y ) and we have

used sums (which apply to discrete variables) but they should be replaced by integrals

when the variables are continuous. We can transform the NLL into a Kullback-Liebler

divergence between P

(Y |X) and P (Y |X), by subtracting the entropy term

H(Q(Y |X = x)) = E

Q(Y |X=x)

[−log Q(Y |X = x)] = −

Q(y|x) log Q(y|x)

where again we used sums but they should be replaced by integrals for continuous

variables. Subtracting the conditional entropy of Q(Y |X) (which does not depend on

) from the NLL yields the KL-divergence between Q(Y |X = x) and P

(Y |X = x),

with Q as the reference:

−

x,y

Q(x, y) log P

(y|x) +

Q(y|x) log Q(y|x) =

Q(x)

Q(y|x) log

Q(y|x)

(y|x)

= E

Q(X)

KL(Q(Y |X)||P

(Y |X)).

As we have seen in Section 3.9, the KL divergence is minimized when the two distri-

butions are equal, which proves that the non-parametric minimizer of the NLL over

(Y |X) is the target conditional distribution Q(Y |X).

The above motivates the maximum likelihood criterion (or minimizing the NLL)

when trying to estimate probabilities or conditional probability distibutions.

TODO– logistic regression TODO– SVMs

5.9 Unsupervised Learning

Unsupervised learning is another broad class of machine learning methods that is distinct

from supervised learning. In supervised learning, the goal is to use input-label pairs, x, y

to learn a function f that predicts a label (or a distribution over labels) given the input,

i.e. ˆy = f(x). In unsupervised learning, no such label or other target is provided. The

data consists of a set of examples x and the objective is to learn about the statistical

structure of x itself.

Learning a representation of data A classic unsupervised learning task is to ﬁnd

the ‘best’ representation of the data. By ‘best’ we can mean diﬀerent things, but gen-

erally speaking we are looking for a representation that reserves as much information

about x as possible while obeying some penalty or constraint aimed at keeping the

representation simpler or more accessible than x itself.

There are multiple ways of deﬁning a simpler representation, some of the most com-

mon include lower dimensional representations, sparse representations and independent

representations. Low-dimensional representations attempt to compress as much infor-

mation about x as possible in a smaller representation. Sparse representations generally

embed the dataset into a high-dimensional representation

where the number of non-zero

entries is small. This results in an overall structure of the representation that tends to

distribute data along the axes of the representation space. Independent representations

attempt to disentangle the sources of variation underlying the data distribution such

that the dimensions of the representation are statistically independent.

Of course these three criteria are certainly not mutually exclusive. Low-dimensional

representations often yield elements that are more-or-less mutually independent. This

happens because the pressure to encode as much information about the data x as possi-

ble into a low-dimensional representation drives the elements of this representation to be

more independent. Any dependency between the variables in f(x) is evidence of redun-

dancy and implies that the representation f(x) could have captured more information

about x.

The notion of representation is one of the central themes of deep learning and there-

fore one of the central themes in this book. Section 5.11 of this chapter discusses some

of the qualities we would like in our learned representations.

sparse representations often use over-complete representations: the representation dimension is

greater than the original dimensionality of the data.

5.9.1 Principal Components Analysis

In the remainder of this section we will consider one of the most widely used unsupervised

learning methods: Principle Components Analysis (PCA). PCA is an orthogonal, linear

transformation of the data that projects it into a representation where the elements are

uncorrelated (shown in Figure 5.5).

Z =XW

Figure 5.5: Illustration of the data representation learned via PCA.

In section 2.11, we saw that we could learn a one-dimensional representation that

best reconstructs the original data (in the sense of mean-squared error) and that this

representation actually corresponds to the ﬁrst principle component of the data. Thus we

can use PCA as a simple and eﬀective dimensionality reduction method that preserves as

much of the information in the data (again, as measured by least-squares reconstruction

error) as possible. In the following, e will take a look at other properties of the PCA

representation. Speciﬁally, we will study how the PCA representation can be said to

decorrelate the original data representation X.

Let us consider the n × m-dimensional design matrix X. We will assume that the

data has a mean of zero, E[x] = 0. If this is not the case, the data can easily be centered

(mean removed). The sample covariance matrix associated with X is given by:

Var[x] =

n −1

X (5.12)

One important aspect of PCA is that it ﬁnds a representation (through linear transfor-

mation) Z = XW where Var[z] is diagonal. To do this, we will make use of the singular

value decomposition (SVD) of X: X = UΣW

, where Σ is an n ×m-dimensional rect-

angular diagonal matrix with the singular values of X on the diagonal, U is an n × n

matrix whose columns are orthonormal (i.e. unit length and othogonal) and W is an

m ×m matrix also composed of orthonormal column vectors.

Using the SVD of X, we can re-express the variance of X as:

Var[x] =

n −1

X (5.13)

n −1

(UΣW

)

UΣW

(5.14)

n −1

W Σ

UΣW

(5.15)

n −1

W Σ

, (5.16)

where we use the orthonormality of U (U

U = I) and deﬁne D

as an m × m-

dimensional diagonal matrix with the squares of the singular values of X on the diagonal,

i.e. the ith diagonal elements is given by D

i,i

This shows that if we take Z = XV , we can ensure that the covariance associated

with Z is diagonal as required.

Var[z] =

n −1

Z (5.17)

n −1

XW (5.18)

n −1

W W

(5.19)

n −1

(5.20)

Similar to our analysis of the variance of X above, we exploit the orthonormality of W

W = I). Our use of SVD to solve for the PCA components of X (i.e. elements of

z) reveals an interesting connection to the eigen-decomposition of a matrix related to

X. Speciﬁcally, the columns of W are the eigenvectors of the matrix n ×n-dimensional

matrix X

The above analysis shows that when we project the data X to Z, via the linear

transformation W , the resulting representation has a diagonal covariance matrix (as

given by Σ

) which immediately implies that the individual elements of Z are mutually

uncorrelated.

This ability of PCA to transform data into a representation where the elements are

mutually exclusive is a very important property of PCA. It is a simple example of a

representation that disentangling the factors of variation underlying the data. In the

case of PCA, this disentangling takes the form of ﬁnding a rotation of the input space

(mediated via the transformation W ) that aligns the principle axes of variance with the

basis of the new representation space associated with Z, as illustrated in Fig. 5.5. While

correlation is an important category of dependency between elements of the data, we are

also interested in learning representations that disentangle more complicated forms of

feature dependencies. For this, we will need more than what can be done with a simple

linear transformation. These issues are discussed below in Sec. 5.11 and later in detail

in Chapter 17.

5.10 Weakly Supervised Learning

Weakly supervised learning is another class of learning methods that stands between

supervised and unsupervised learning. It refers to a setting where the datasets consists

of x, y pairs, as in supervised learning, but where the labels y are either unreliably

present (i.e. with missing values) or noisy (i.e. where the label given is not the true

label).

Methods for working with weakly labeled data have recently grown in importance

due to the – largely untapped – potential for using large quantities of readily available

weakly labeled data in a transfer learning paradigm to help solve problems where large,

clean datasets are hard to come-by. The internet has become a major source of this

kind of noisy data.

5.11 The Smoothness Prior, Local Generalization and Non-

Parametric Models

A fairly large number of machine learning algorithms are exploiting a single basic prior:

the smoothness prior. It can be expressed in many diﬀerent ways, but it basically says

that the target function or distribution of interest f

∗

is smooth, i.e.,

∗

(x) ≈ f

∗

(x + ) (5.21)

for most conﬁgurations x and small change .

For example, if we force the learned f(·) to be piecewise constant, we obtain a his-

togram, with the number of pieces being equal to the number of distinguishable regions,

or “bins” in which the diﬀerent x values can fall. Another example of piecewise constant

learned function is what we obtain with K-nearest neighbors predictors, where f(x) is

constant in some region R containing all the points x that have the same set of K nearest

neighbors from the training set. If we are doing classiﬁcation and K=1, f(x) is just the

output class associated with the nearest neighbor of x in the training set. If we are doing

regression, f(x) is the average of the outputs associated with the K nearest neighbors of

x. Note that in both cases, for K = 1, the number of distinguishable regions of cannot

be more than the number of training examples.

Figure 5.6: Illustration of interpolation and kernel-based methods, which construct a

smooth function by interpolating in various ways between the training examples (circles),

which act like knot points controlling the shape of the implicit regions that separate them

as well as the values to output within each region. Depending on the type of kernel, one

obtains a piecewise constant (histogram-like, in dotted red), a piecewise linear (dashed

black) or a smoother kernel (bold blue). The underlying assumption is that the target

function is smooth, i.e., does not vary much locally. It allows to generalize locally, and

this works very well so long as, like in the ﬁgure, there are enough examples to cover

most of the ups and downs of the target function.

To obtain even more smoothness, we can interpolate between neighboring training

examples, as illustrated in Figure 5.6. For example, non-parametric kernel density es-

timation methods and kernel regression methods construct a learned function f of the

form

f(x) = b +

i=1

K(x, x

) (5.22)

where x

are the n training examples, or

f(x) = b +

i=1

K(x, x

)

j=1

K(x, x

)

If the kernel function K is discrete (e.g. 0 or 1), then this can include the above cases

where f is piecewise constant and a discrete set of regions (no more than one per training

example) can be distinguished. However, better results can often be obtained if K is

smooth, e.g., the Gaussian kernel

K(u, v) = N(u −v; 0, σ

where N(x; µ, Σ) is the standard normal density. With K a local kernel (Bengio et al.,

2006a; Bengio and LeCun, 2007a; Bengio, 2009)

, we can think of each x

as a template

and the kernel function as a similarity function that matches a template and a test

example.

With the Gaussian kernel, we do not have a piecewise constant function but instead a

continuous and smooth function. In fact, the choice of K can be shown to correspond to

a particular form of smoothness. Equivalently, we can think of many of these estimators

as the result of smoothing the empirical distribution by convolving it with a function

associated with the kernel, e.g., the Gaussian kernel density estimator is the empirical

distribution convolved with the Gaussian density.

Although in classical estimators α

of Eq. 5.22 are ﬁxed (e.g. to 1/n for density

estimation and to y

for supervised learning from examples (x

, y

)), they can be opti-

mized, and this is the basis of more modern non-parametric kernel methods (Sch¨olkopf

and Smola, 2002) such as the Support Vector Machine (Boser et al., 1992; Cortes and

Vapnik, 1995).

However, as illustrated in Figure 5.6, even though these smooth kernel methods

generalize better, the main thing that has changed is that one can basically interpolate

between the neighboring examples, in some space associated with the kernel. One can

then think of the training examples as control knots which locally specify the shape of

each region and the associated output.

Another type of non-parametric learning algorithm that also breaks the input space

into regions and has separate parameters for each region is the decision tree (Breiman

et al., 1984) and its many variants. Each node of the decision tree is associated with a

region, and it breaks it into one sub-region for each child of the node (typically using

an axis-aligned cut). Typically, a constant output is returned for all inputs associated

with a leaf node, i.e., within the associated region. Because each example only informs

the region in which it falls about the target output, one cannot have more regions

than training examples. If the target function can be well approximated by cutting the

input space into N regions (with a diﬀerent answer in each region), then at least N

examples are needed (and a multiple of N is needed to achieve some level of statistical

conﬁdence in the predicted output). All this is also true if the tree is used for density

estimation (the output is simply an estimate of the density within the region, which can

be obtained by the ratio of the number of training examples in the region by the region

volume) or whether a non-constant (e.g. linear) predictor is associated with each leaf

(then more examples are needed within each leaf node, but the relationship between

number of regions and number of examples remains linear). Below (Section 5.12) we

examine how this makes decision trees and other such learning algorithms based only

on the smoothness prior potential victims of the curse of dimensionality.

In all cases, the smoothness assumption (Eq. 5.21) allows the learner to generalize

locally. Since we assume that the target function obeys f

∗

(x) ≈ f

∗

(x + ) most of

the time for small , we can generalize the empirical distribution (or the (x, y) training

pairs) to their neighborhood. If (x

, y

) is a supervised training example, then we expect

∗

) ≈ y

, and therefore if x

is a near neighbor of x, we expect that f

∗

(x) ≈ y

. By

i.e., with K(u, v) large when u = v and decreasing as they get farther apart

considering more neighbors, we can obtain better generalization, by better executing

the smoothness assumption.

Figure 5.7: Illustration of how learning algorithms that exploit only the smoothness prior

typically break up the input space into regions, with examples in those regions being

used both to deﬁne the region boundaries and what the output should be within each

region. The ﬁgure shows the case of clustering or 1-nearest-neighbor classiﬁers, for which

each training example (cross of a diﬀerent color) deﬁnes a region or a template (here, the

diﬀerent regions form a Voronoi tessellation). The number of these contiguous regions

cannot grow faster than the number of training examples. In the case of a decision tree,

the regions are recursively obtained by axis-aligned cuts within existing regions, but for

these and for kernel machines with a local kernel (such as the Gaussian kernel), the

same property holds, and generalization can only be local: each training example only

informs the learner about how to generalize in some neighborhood around it.

In general, to distinguish O(N) regions in input space, all of these methods require

O(N) examples (and typically there are O(N ) parameters associated with the O(N)

regions). This is illustrated in Figure 5.7 in the case of a nearest-neighbor or clustering

scenario, where each training example can be used to deﬁne one region. Is there a way

to represent a complex function that has many more regions to be distinguished than

the number of training examples?

The smoothness assumption and the associated learning algorithms work extremely

well so long as there are enough examples to cover most of the ups and downs of the

target function. This is generally true when the function to be learned is smooth enough,

which is typically the case for low-dimensional data. And if it is not very smooth (we

want to distinguish a huge number of regions compared to the number of examples), is

there any hope to generalize well?

Both of these questions are answered positively in Chapter 17. The key insight is

that a very large number of regions, e.g., O(2

), can be deﬁned with O(N ) examples, so

long as we introduce some dependencies between the regions associated with additional

priors about the underlying data generating distribution. In this way, we can actually

generalize non-locally (Bengio and Monperrus, 2005; Bengio et al., 2006b). For example,

if we assume that the target function is periodic, then we can generalize from examples

of a few periods to an arbitrarily large number of repetitions of the basic pattern.

However, we want a more general-purpose assumption, and the idea in deep learning is

that we assume that the data was generated by the composition of factors or features,

potentially at multiple levels in a hierarchy. These apparently mild assumptions allow

an exponential gain in the relationship between the number of examples and the number

of regions that can be distinguished, as discussed in Chapter 17.

5.12 Manifold Learning and the Curse of Dimensionality

Manifold learning algorithms assume that the data distribution is concentrated in a

small number of dimensions. Manifold learning was introduced in the case of continuous-

valued data and the unsupervised learning setting, although this probability concentra-

tion idea can be generalized to both discrete data and the supervised learning setting:

the key assumption remains that probability mass is highly concentrated.

Is this assumption reasonable? It seems to be true for almost all of the AI tasks such

as those involving images, sounds, and text. To be convinced of this consider that if

the assumption was false, then sampling uniformly at random should produce probable

(data-like) conﬁgurations reasonably often. Instead, take the example of generating

images by independently picking the grey level (or a binary 0 vs 1) for each pixel. What

kind of images do you get? You get “white noise” images, that look like the old television

sets when no signal is coming in, as illustrated in Figure 5.8.

Figure 5.8: Sampling images uniformly at random, e.g., by randomly picking each pixel

according to a uniform distribution, gives rise to white noise images such as illustrated

on the left. Although there is a non-zero probability to generate something that looks

like a natural image (like those on the right), that probability is exponentially tiny

(exponential in the number of pixels!). This suggests that natural images are very

“special”, and that they occupy a tiny volume of the space of images.

The above thought experiment, which is in agreement with the many experimental

results of the manifold learning literature, e.g. (Cayton, 2005; Narayanan and Mitter,

2010; Sch¨olkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand,

2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004),

clearly establishes that for a large class of datasets, the manifold hypothesis is true:

the data generating distribution concentrates in a small number of dimensions, as in

the cartoon of Figure 18.4, from Chapter 18. That chapter explores the relationships

between representation learning and manifold learning: if the data distribution concen-

trates on a smaller number of dimensions, then we can think of these dimensions as

natural coordinates for the data.

An initial hope of early work on manifold learning (Roweis and Saul, 2000; Tenen-

baum et al., 2000) was to reduce the eﬀect of the curse of dimensionality, by reducing

the data to a lower dimensional representation. The curse of dimensionality refers to

the exponential increase of the number of conﬁgurations of interest as a function of the

number of dimensions. This is illustrated in Figure 5.9.

Figure 5.9: As the number of relevant dimensions of the data increases (from left to

right), the number of conﬁgurations of interest may grow exponentially. In the ﬁgure

we ﬁrst consider one-dimensional data (left), i.e., one variable for which we only care

to distinguish 10 regions of interest. With enough examples falling within each of these

regions (cells, in the ﬁgure), we can easily generalize correctly, i.e., estimate the value

of the target function within each region (and possibly interpolate between neighboring

regions). With 2 dimensions (center), but still caring to distinguish 10 diﬀerent values

of each variable, we need to keep track of up to 10×10=100 regions, and we need at

least that many examples to cover all those regions. With 3 dimensions (right) this

grows to 10

= 1000 regions and at least that many examples. For d dimensions and

V values to be distinguished along each axis, it looks like we need O(V

) regions and

examples. This is an instance of the curse of dimensionality. However, note that if the

data distribution is concentrated on a smaller set of regions, we may actually not need

to cover all the possible regions, only those where probability is non-negligeable. Figure

graciously provided by and with authorization from Nicolas Chapados.

The hope was that by non-linearly projecting the data in a new space of lower

dimension, we would reduce the curse of dimensionality by only looking at relevant

dimensions, i.e., a smaller set of regions of interest (cells, in Figure 5.9). This can

indeed be the case, however, as discussed in Chapter 18, the manifolds can be highly

curved and have a very large number of twists, requiring still a very large number of

regions to be distinguished (every up and down of each corner of the manifold). And

even if we were to reduce the dimensionality of an input from 10000 (e.g. 100×100

binary pixels) to 100, V

100

is still too large to hope covering with a training set. This

rules out the use of purely local generalization (i.e., the smoothness prior only) to model

such manifolds, as discussed in Chapter 18 around Figure 18.4 and 18.5. It may also

be that although the eﬀective dimensionality of the data could be small, that some

examples fall outside of the main manifold and that we don’t want to systematically

lose that information. A sparse representation then becomes a possible way to represent

data that is mostly low-dimensional, although occasionally occupying more dimensions.

This can be achieved with a high-dimensional representation whose elements are 0 most

of the time. We can see that the eﬀective dimension (the number of non-zeros) then

can change depending on where we are in input space, which can be useful. Sparse

representations are discussed in Section 16.7.

5.13 Challenges of High-Dimensional Distributions

High-dimensional random variables actually bring two challenges: a statistical challenge

and a computational challenge.

The statistical challenge regards generalization: as sketched in the previous section,

the number of conﬁgurations we may want to distinguish can grow exponentially with the

number of dimensions of interest, and this quickly becomes much larger than the number

of examples one can possibly have (or use with bounded computational resources). This

raises the need to generalize to never-seen conﬁgurations, and not just those in the

immediate neighborhood of the training examples. This is diﬃcult because the volume

of these regions where we want to generalize grows exponentially with the number of

dimensions.

In order to achieve that kind of non-local generalization, it is necessary to introduce

priors other than the smoothness prior, and much of deep learning research is about such

priors. In particular, priors that are based on compositionality, such as arising from

learning distributed representations and from a deep composition of representations,

can give an exponential advantage, which can hopefully counter the exponential curse

of dimensionality. Chapter 17 discusses these questions from the angle of representation

learning and the objective of disentangling the underlying factors of variation.

The computational challenge associated with high-dimensional distributions arises

because many algorithms for learning or using a trained model (especially those based

on estimating an explicit probability function) involve intractable computations that

grow exponentially with the number of dimensions.

For example, suppose that we have been able to learn a probability function p(X)

which maps points x = (x

, x

, . . . , x

) in a d-dimensional space to a scalar p(X = x).

Then simply predicting the joint probability of a subset of d

of these variables given

of the others requires computing a sum (or an integral) over all the conﬁgurations of

d −d

variables:

p(x

, . . . , x

) =

,...,x

p(x

, . . . , x

)

,...,x

p(x

, . . . , x

)

(5.23)

This is an example of inference. Exact inference is generally intractable, but approximate

techniques are discussed in Chapter 20.

Another common example of inference is where we want to ﬁnd the most probable

conﬁguration out of some exponentially large set, e.g.,

arg max

...,x

p(x

, . . . , x

In that case, the answer involves maximizing rather than summing over an exponentiallly

large number of conﬁgurations, and optimization or search methods can be used to help

100

approximating a solution (an exact solution is rarely achievable with non-exponential

computational resources).

Another even more common example is the diﬃculty of computing the normaliza-

tion constant, called partition function, associated with a probability function, studied

in more detail in Section 14.2.3 and Chapter 19. Unless we strongly restrict the math-

ematical form of the probability function, we may end up with a probability function

whose value can be computed eﬃciently only up to a normalization constant Z(θ):

p(x) =

∗

(x)

Z(θ)

where the normalization constant or partition function Z(θ) is a generally intractable

sum or integral

Z(θ) =

∗

(x)

and p

∗

is an unnormalized probability function with parameters θ. This is the case,

for example, with Restricted Boltzmann Machines (Sections 14.6.1 and 21.1) and many

other undirected graphical models. In order to learn θ, we would like to compute the

derivative of Z(θ) with respect to θ and this is generally not computationally tractable

in an exact way, but many approximations have been proposed. Techniques to face the

issues raised with the partition function and its gradient are discussed in chapter 19.

Sometimes the x in the above examples should be viewed as including not just visible

variables (the variables in our dataset) but also latent variables. Latent variables are

unobserved variables that we introduce in our probabilistic models in order to increase

their expressive power. Probabilistic graphical models (including ones with latent vari-

ables) are discussed at length in Chapters 14 and 21. In the context of probabilistic

models, one often resorts to Monte-Carlo Markov Chain (MCMC) methods in order to

estimate such intractable sums (seen as expectated values). MCMC is introduced in

Chapter 14 and comes back in Chapters 19 and 21. One issue that arises with MCMC is

that it may requiring sequentially visiting a large fraction of the modes (regions of high

probability) of the distribution. If the number of these modes (which would be larger

for more complex distributions) is very large (and it is potentially exponentially large),

then the MCMC estimators may be too unreliable. Again, it has to do with having an

exponentially large set of conﬁgurations (or modes, or regions) that we would like to

visit for performing some computation.

One way to confront these intractable computations is to approximate them, and

many approaches have been proposed, discussed in the chapters listed above. Another

way is to avoid these intractable computations altogether by design, and methods that

do not require such computations are thus very appealing, although maybe at the price of

losing the ability to actually compute the probability function itself. Several generative

models based on auto-encoders have been proposed in recent years, with that motivation,

and are discussed at the end of Chapter 21.

101