Chapter 5
Machine Learning Basics
Deep learning is a specific kind of machine learning. In order to understand deep
learning well, one must have a solid understanding of the basic principles of ma-
chine learning. This chapter provides a brief course in the most important general
principles that will be applied throughout the rest of the book. Novice readers or
those that want a wider perspective are encouraged to consider machine learning
textbooks with a more comprehensive coverage of the fundamentals, such as Mur-
phy (2012) or Bishop (2006). If you are already familiar with machine learning
basics, feel free to skip ahead to Section 5.12. That section covers some perspec-
tives on traditional machine learning techniques that have strongly influenced the
development of deep learning algorithms.
5.1 Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn from data. But
what do we mean by learning? A popular definition of learning in the context of
computer programs is “A computer program is said to learn from experience E
with respect to some class of tasks T and performance measure P , if its perfor-
mance at tasks in T , as measured by P , improves with experience E (Mitchell,
1997). One can imagine a very wide variety of experiences E, tasks T , and per-
formance measures P , and we do not make any attempt in this book to provide a
formal definition of what may be used for each of these entities. Instead, the fol-
lowing sections provide intuitive descriptions and examples of the different kinds
of tasks, performance measures, and experiences that can be used to construct
machine learning algorithms.
84
CHAPTER 5. MACHINE LEARNING BASICS
5.1.1 The Task, T
Machine learning is mostly interesting because of the tasks we can accomplish
with it. From an engineering point of view, machine learning allows us to tackle
tasks that are too difficult to solve with fixed programs written and designed by
human beings. From a scientific and philosophical point of view, machine learning
is interesting because understanding it allows us to understand the principles that
underlie intelligent behavior, and intelligent behavior is defined as being able to
accomplish certain tasks.
In this relatively formal definition of the word “task,” the process of learning
itself is not the task. Learning is our means of attaining the ability to perform
the task. For example, if we want a robot to be able to walk, then walking is
the task. We could program the robot to learn to walk, or we could attempt to
directly write a program that specifies how to walk manually.
Many kinds of tasks can be solved with machine learning. Some of the most
common machine learning tasks include the following:
Classification: In this type of task, the computer program is asked to specify
which of k categories some input belongs to. To solve this task, the learning
algorithm is usually asked to output a function f : R
n
{1, . . . , k} which
may then be applied to any input. Here the output of f(x) can be inter-
preted as an estimate of the category that x belongs to. There are other
variants of the classification task, for example, where f outputs a proba-
bility distribution over classes. An example of this type of task is object
recognition, where the input is an image (usually described as a set of pixel
brightness values), and the output is a numeric code identifying the object
in the image. For example, the Willow Garage PR2 robot is able to act
as a waiter that can recognize different kinds of drinks and deliver them
to people on command (Goodfellow et al., 2010). Modern object recogni-
tion is best accomplished with deep learning (Krizhevsky et al., 2012a; Ioffe
and Szegedy, 2015). Object recognition is the same basic technology that
allows computers to recognize faces (Taigman et al., 2014), which can be
used to automatically tag people in photo collections and allow computers
to interact more naturally with their users.
Classification with missing inputs : Classification becomes more challenging
if the computer program is not guaranteed that every measurement in its in-
put vector will always be provided. In order to solve the classification task,
the learning algorithm only had to define a single function mapping from
a vector input to a categorical output. When some of the inputs may be
missing, rather than providing a single classification function, the learning
85
CHAPTER 5. MACHINE LEARNING BASICS
algorithm must learn a set of functions. Each function corresponds to clas-
sifying x with a different subset of its inputs missing. This kind of situation
arises frequently in medical diagnosis, because many kinds of medical tests
are expensive or invasive. Deep probabilistic models can solve these prob-
lems by marginalizing out the missing variables (Goodfellow et al., 2013b).
Many of the other tasks described in this section can also be generalized
to work with missing inputs; classification with missing inputs is just one
example of what machine learning can do.
Regression : In this type of task, the computer program is asked to pre-
dict a numerical value given some input. To solve this task, the learning
algorithm is asked to output a function f : R
n
R. This type of task
is similar to classification, except that the type of output is different. An
example of a regression task is the prediction of the expected claim amount
that an insured person will make (used to set insurance premia), or the
prediction of future prices of securities. These kinds of predictions are used
for algorithmic trading.
Transcription : This type of task is similar to classification, except that
the output is a sequence of symbols, rather than a category code (which
can be thought of as a single symbol). For example, in speech recognition,
machine translation or image captioning, the target output is a sequence of
words. For machine translation, the computer program observes a sentence
in one language (i.e., a variable length input) and outputs a translation
of that sentence in another language (Sutskever et al., 2014a; Bahdanau
et al., 2014). For image captioning, the computer program observes an
image and outputs a natural language sentence describing the image (Kiros
et al., 2014a,b; Mao et al., 2015; Vinyals et al., 2015; Donahue et al., 2014;
Karpathy and Li, 2015; Fang et al., 2015; Xu et al., 2015a). Such tasks fall
under the more general umbrella of structured output tasks, where the target
output is a complex componential object involving many random variables
(such as the individual words in the translated sentence) that have a non-
trivial joint distribution, given the input. Examples of this include speech
recognition, where the machine learning model receives an audio waveform
and must output a complete sequence of words that were recorded in this
waveform. Deep learning has greatly improved the performance of speech
recognition in commercial applications, such as speech recognition on mobile
phones (Hinton et al., 2012b).
Structured output tasks involve any task where the output is a vector con-
taining important relationships between the different elements. This is a
broad category, and includes the transcription tasks described above. Struc-
86
CHAPTER 5. MACHINE LEARNING BASICS
tured modeling is more general and includes other tasks like parsing – map-
ping a natural language sentence into a tree that describes its grammatical
structure and the relative role of its constituents and pixel-wise segmen-
tation in images: the output is now an image whose pixel-wise elements
are categories (the pixel belongs to road / sky / beach etc., or for medical
images, this is cancerous or not). In all these cases the elements of the
output are random variables that are highly interdependent: if a voxel in a
brain 3D image is categorized as not cancerous then it is very likely that a
neighboring voxel should also be categorized this way.
Anomaly detection: In this type of task, the computer program sifts through
a set of events or objects, and flags some of them as being unusual or atypi-
cal. An example of an anomaly detection task is credit card fraud detection.
By modeling your purchasing habits, a credit card company can detect mis-
use of your cards. If a thief steals your credit card or credit card information,
the thief’s purchases will often come from a different probability distribution
over purchase types than your own. The credit card company can prevent
fraud by placing a hold on an account as soon as that card has been used
for an uncharacteristic purchase.
Synthesis and sampling: In this type of task, the machine learning algo-
rithm is asked to generate new examples that are similar to those in the
training data. This can be useful for media applications where it can be
expensive or boring for an artist to generate large volumes of content by
hand. For example, video games can automatically generate textures for
large objects or landscapes, rather than requiring an artist to manually la-
bel each pixel (Luo et al., 2013). Other examples include the generation of
sounds and text (for speech synthesis, special effects, dialogue generation,
etc.). Often the object should be synthesized in a context (e.g., generate
the answer of non-playing character in a video game or of an automated
help system, but in some appropriate context which includes the previous
dialogue turns). These then become structured output problems, but where
the output should be sampled from an estimated conditional distribution
(of the output variable given the context information).
Imputation of missing values: In this type of task, the machine learning
algorithm is given a new example x R
n
, but with some entries x
i
of
x missing. The algorithm must provide a prediction of the values of the
missing entries.
Denoising: In this type of task, the machine learning algorithm is given
in input a corrupted example
˜
x R
n
obtained by an unknown corruption
87
CHAPTER 5. MACHINE LEARNING BASICS
process from a clean example x R
n
. The learner must predict the clean
example x from its corrupted version
˜
x, or more generally predict the con-
ditional probability distribution P (x|
˜
x).
Density or probability function estimation: In the density estimation prob-
lem, the machine learning algorithm is asked to learn a function p
model
:
R
n
R, where p
model
(x) can be interpreted as a probability density func-
tion (if x is continuous) or a probability function (if x is discrete) on the
space that the examples were drawn from. To do such a task well (we will
specify exactly what that means when we discuss performance measures P ),
the algorithm needs to learn the structure of the data it has seen. It must
know where examples cluster tightly and where they are unlikely to occur.
In order to do well on some tasks such as density estimation, sampling and
denoising, one must actually fully capture the structure of the data distri-
bution. In the case of density estimation, we can even write formulae that
allow us to solve most other tasks in principle, given the estimated density
(the actual computations involved may be intractable, though).
Of course, many other tasks and types of tasks are possible. The types of tasks
we listed here are only intended to provide examples of what machine learning
can do, not to define a rigid taxonomy of tasks.
5.1.2 The Performance Measure, P
In order to evaluate the abilities of a machine learning algorithm, we must design
a quantitative measure of its performance. Usually this performance measure P
is specific to the task T being carried out by the system.
For tasks such as classification, classification with missing inputs, and tran-
scription, we often measure the accuracy of the model. In the simplest case this is
just the proportion of examples for which the model produces the correct output.
For tasks such as density estimation, we can measure the probability the model
assigns to some examples.
Usually we are interested in how well the machine learning algorithm performs
on data that it has not seen before, since this determines how well it will work
when deployed in the real world. We therefore evaluate these performance mea-
sures using a test set of data that is separate from the data used for training the
machine learning system.
The choice of performance measure may seem straightforward and objective,
but it is often difficult to choose a performance measure that corresponds well to
the desired behavior of the system.
In some cases, this is because it is difficult to decide what should be mea-
sured. For example, when performing a transcription task, should we measure
88
CHAPTER 5. MACHINE LEARNING BASICS
the accuracy of the system at transcribing entire sequences, or should we use a
more fine-grained performance measure that gives partial credit for getting some
elements of the sequence correct? When performing a regression task, should we
penalize the system more if it frequently makes medium-sized mistakes or if it
rarely makes very large mistakes? These kinds of design choices depend on the
application.
In other cases, we know what quantity we would ideally like to measure, but
measuring it is impractical. For example, this arises frequently in the context of
density estimation. Many of the best probabilistic models represent probability
distributions only implicitly. Computing the actual probability value assigned
to a specific point in space is intractable. In these cases, one must design an
alternative criterion that still corresponds to the design objectives, or design a
good approximation to the desired criterion.
5.1.3 The Experience, E
Machine learning algorithms can be broadly categorized by what kind of experi-
ence they are allowed to have during the learning process.
Most of the learning algorithms in this book can be understood as being
allowed to experience an entire dataset. A dataset is a collection of many objects
called examples, with each example containing many features that have been
objectively measured. Sometimes we will also call examples data points.
One of the oldest datasets studied by statisticians and machine learning re-
searchers is the Iris dataset (Fisher, 1936). It is a collection of measurements of
different parts of iris plants. In total, 150 individual plants were measured. Each
individual plant corresponds to one example. The features within each example
are the measurements of each of the parts of the plant: the sepal length, sepal
width, petal length, and petal width. The dataset also records which species each
plant belonged to. Three different species are represented in the dataset.
Unsupervised learning algorithms experience a dataset containing many fea-
tures, then learn useful properties of the structure of this dataset. From the
probabilistic point of view the quintessential unsupervised learning algorithm is
density estimation, in which the learning algorithm attempts to discover the prob-
ability distribution that generated the examples in the dataset. However, other
unsupervised learning tasks (such as sampling and denoising) share the property
that they force the learner to fully capture the statistical structure of the data.
Supervised learning algorithms experience a dataset containing features, but
each example is also associated with a label or target. For example, the Iris
dataset is annotated with the species of each iris plant. A supervised learning
algorithm can study the Iris dataset and learn to classify iris plants into three
different species based on their measurements.
89
CHAPTER 5. MACHINE LEARNING BASICS
Roughly speaking, unsupervised learning involves observing several examples
of a random vector x, and attempting to implicitly or explicitly learn the proba-
bility distribution p(x), or some interesting properties of that distribution, while
supervised learning involves observing several examples of a random vector x and
an associated value or vector y, and learning to predict y from x, e.g. estimating
p(y | x). The term supervised learning originates from the view of the target
y being provided by an instructor or teacher that shows the machine learning
system what to do. In unsupervised learning, there is no instructor or teacher,
and the algorithm must learn to make sense of the data without this guide.
Unsupervised learning and supervised learning are not formally defined terms,
and the lines between them are often blurred. Many machine learning technologies
can be used to perform both tasks. For example, the chain rule of probability
states that for a vector x R
n
, the joint distribution can be decomposed as
p(x) = Π
n
i=1
p(x
i
| x
1
, . . . , x
i1
).
This decomposition means that we can solve the ostensibly unsupervised problem
of modeling p(x) by splitting it into n supervised learning problems. Alternatively,
we can solve the supervised learning problem of learning p(y | x) by using tra-
ditional unsupervised learning technologies to learn the joint distribution p(x, y)
and inferring
p(y | x) =
p(x, y)
P
y
0
p(x, y
0
)
.
Though unsupervised learning and supervised learning are not completely formal
or distinct concepts, they do help to roughly categorize some of the things we
do with machine learning algorithms. Traditionally, people refer to regression,
classification, and structured output problems as supervised learning. Density
estimation in support of other tasks is usually considered unsupervised learning.
Some machine learning algorithms do not just experience a fixed dataset.
For example, reinforcement learning algorithms interact with an environment, so
there is a feedback loop between the learning system and its experiences. Such
algorithms are beyond the scope of this book.
A dataset can be described in many ways. In all cases, a dataset is a collection
of examples. Each example is a collection of observations called features collected
from a different time or place. If we wish to make a system for recognizing
objects from photographs, we might use a machine learning algorithm where each
example is a photograph, and the features within the example are the brightness
values of each of the pixels within the photograph. If we wish to perform speech
recognition, we might collect a dataset where each example is a recording of a
person saying a word or sentence, and each of the features is the amplitude of the
sound wave at a particular moment in time.
90
CHAPTER 5. MACHINE LEARNING BASICS
One common way of describing a dataset is with a design matrix. A design
matrix is a matrix containing a different example in each row. Each column of the
matrix corresponds to a different feature. For instance, the Iris dataset contains
150 examples with four features for each example. This means we can represent
the dataset with a design matrix X R
150×4
, where X
i,1
is the sepal length
of plant i, X
i,2
is the sepal width of plant i, etc. We will describe most of the
learning algorithms in this book in terms of how they operate on design matrix
datasets.
Of course, to describe a dataset as a design matrix, it must be possible to
describe each example as a vector, and each of these vectors must be the same size.
This is not always possible. For example, if you have a collection of photographs
with different widths and heights, then different photographs will contain different
numbers of pixels, so not all of the photographs may be described with the same
length of vector. Different sections of this book describe how to handle different
types of heterogeneous data.
In cases like these, rather than describing the dataset as a matrix with m rows,
we will describe it as a set containing m elements, e.g. {x
(1)
, x
(2)
, . . . , x
(m)
}. This
notation does not imply that any two example vectors x
(i)
and x
(j)
have the same
size.
In the case of supervised learning, the example contains a label or target as
well as a collection of features. For example, if we want to use a learning algorithm
to perform object recognition from photographs, we need to specify which object
appears in each of the photos. We might do this with a numeric code, with 0
signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working
with a dataset containing a design matrix of feature observations X, we also
provide a vector of labels y, with y
i
providing the label for example i.
Of course, sometimes the label may be more than just a single number. For
example, if we want to train a speech recognition system to transcribe entire
sentences, then the label for each example sentence is a sequence of words. These
cases require more sophisticated data structures.
Just as there is no formal definition of supervised and unsupervised learning,
there is no rigid taxonomy of datasets or experiences. The structures described
here cover most cases, but it is always possible to design new ones for new appli-
cations.
5.2 Example: Linear Regression
In the previous section, we saw that a machine learning algorithm is an algorithm
that is capable of improving a computer program’s performance at some task via
experience. Now it is time to define some specific machine learning algorithms.
91
CHAPTER 5. MACHINE LEARNING BASICS
Let’s begin with an example of a simple machine learning algorithm. One of
the simplest machine learning algorithms is linear regression. In this section, we
will only describe what the linear regression algorithm does. We wait until later
sections of this chapter to justify the algorithm and show more formally that it
actually works.
As the name implies, linear regression solves a regression problem. In other
words, the goal is to build a system that can take a vector x R
n
as input and
predict the value of a scalar y R as its output. In the case of linear regression,
the output is a linear function of the input. Let ˆy be the value that our model
predicts y should take on. We define the output to be
ˆy = w
>
x
where w R
n
is a vector of parameters.
Parameters are values that control the behavior of the system. In this case, w
i
is the coefficient that we multiply by feature x
i
before summing up the contribu-
tions from all the features. We can think of w as a set of weights that determine
how each feature affects the prediction. If a feature x
i
receives a positive weight
w
i
, then increasing the value of that feature increases the value of our prediction
ˆy. If a feature receives a negative weight, then increasing the value of that feature
decreases the value of our prediction. If a feature’s weight is large in magnitude,
then it has a large effect on the prediction. If a feature’s weight is zero, it has no
effect on the prediction.
We thus have a definition of our task T : to predict y from x by outputting
ˆy = w
>
x. Next we need a definition of our performance measure, P .
Let’s suppose that we have a design matrix of m example inputs that we will
not use for training, only for evaluating how well the model performs. We also
have a vector of regression targets providing the correct value of y for each of
these examples. Because this dataset will only be used for evaluation, we call it
the test set. Let’s refer to the design matrix of inputs as X
(test)
and the vector
of regression targets as y
(test)
.
One way of measuring the performance of the model is to compute the mean
squared error of the model on the test set. If
ˆ
y
(test)
is the predictions of the model
on the test set, then the mean squared error is given by
MSE
test
=
1
m
X
i
(
ˆ
y
(test)
y
(test)
)
2
i
.
Intuitively, one can see that this error measure decreases to 0 when
ˆ
y
(test)
= y
(test)
.
We can also see that
MSE
test
=
1
m
||
ˆ
y
(test)
y
(test)
||
2
2
,
92
CHAPTER 5. MACHINE LEARNING BASICS
so the error increases whenever the Euclidean distance between the predictions
and the targets increases.
To make a machine learning algorithm, we need to design an algorithm that
will improve the weights w in a way that reduces MSE
test
when the algorithm
is allowed to gain experience by observing a training set (X
(train)
, y
(train)
). One
intuitive way of doing this (which we will justify later) is just to minimize the
mean squared error on the training set, MSE
train
.
To minimize MSE
train
, we can simply solve for where its gradient is 0:
w
MSE
train
= 0
w
1
m
||
ˆ
y
(train)
y
(train)
||
2
2
= 0
1
m
w
||X
(train)
w y
(train)
||
2
2
= 0
w
(X
(train)
w y
(train)
)
>
(X
(train)
w y
(train)
) = 0
w
(w
>
X
(train)>
X
(train)
w 2w
>
X
(train)>
y
(train)
+ y
(train)>
y
(train)
) = 0
2X
(train)>
X
(train)
w 2X
(train)>
y
(train)
= 0
w = (X
(train)>
X
(train)
)
1
X
(train)>
y
(train)
(5.1)
The system of equations defined by Eq. 5.1 is known as the normal equations.
Solving these equations constitutes a simple learning algorithm. For an example
of the linear regression learning algorithm in action, see Fig. 5.1.
It’s worth noting that the term linear regression is often used to refer to a
slightly more sophisticated model with one additional parameter—an intercept
term b. In this model
ˆy = w
>
x + b
so the mapping from parameters to predictions is still a linear function but the
mapping from features to predictions is now an affine function. This extension
to affine functions means that the plot of the model’s predictions still looks like
a line, but it need not pass through the origin. We will frequently use the term
“linear” when referring to affine functions throughout this book.
Linear regression is of course an extremely simple and limited learning al-
gorithm, but it provides an example of how a learning algorithm can work. In
the subsequent sections we will describe some of the basic principles underlying
learning algorithm design and demonstrate how these principles can be used to
build more complicated learning algorithms.
93
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.1: Consider this example linear regression problem, with a training set consisting
of 5 data points, each containing one feature. This means that the weight vector w
contains only a single parameter to learn, w
1
. (Left) Observe that linear regression learns
to set w
1
such that the line y = w
1
x comes as close as possible to passing through all the
training points. (Right) The plotted point indicates the value of w
1
found by the normal
equations, which we can see minimizes the mean squared error on the training set.
5.3 Generalization, Capacity, Overfitting and Under-
fitting
The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.
Typically, when training a machine learning model, we have access to a train-
ing set, we can compute some error measure on the training set called the training
error, and we reduce this training error. So far, what we have described is simply
an optimization problem. What separates machine learning from optimization is
that we want the generalization error to be low as well. The generalization error
is defined as the expected value of the error on a new input. Here the expectation
is taken across different possible inputs, drawn from the distribution of inputs we
expect the system to encounter in practice.
We typically estimate the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separate
from the training set.
In our linear regression example, we trained the model by minimizing the
training error,
1
m
(train)
||X
(train)
w y
(train)
||
2
2
,
but we actually care about the test error,
1
m
(test)
||X
(test)
w y
(test)
||
2
2
.
94
CHAPTER 5. MACHINE LEARNING BASICS
How can we affect performance on the test set when we only get to observe
the training set? The field of statistical learning theory provides some answers.
If the training and the test set were collected arbitrarily, there is indeed little we
can do. If we are allowed to make some assumptions about how the training and
test set are collected, then we can make some progress.
We typically make a set of assumptions known collectively as the i.i.d. as-
sumptions. These assumptions are that the examples in each dataset are indepen-
dent from each other, and that the train set and test set are identically distributed,
drawn from the same probability distribution as each other. We call that shared
underlying distribution the data generating distribution, or data generating pro-
cess (which is particularly relevant if the examples are not independent). This
probabilistic framework allows us to mathematically study the relationship be-
tween training error and test error.
One immediate connection we can observe between the training and test error
is that for a randomly selected model, the two have the same expected value.
Suppose we have a probability distribution p(x, y) and we sample from it repeat-
edly to generate the train set and the test set. For some fixed value w, then the
expected training set error under this sampling process is exactly the same as the
test set error under this sampling process. The only difference between the two
conditions is the name we assign to the dataset we sample. From this observation,
we can see that it is natural for there to be some relationship between training
and test error under these assumptions.
Of course, when we use a machine learning algorithm, we do not fix the
parameters ahead of time, then sample both datasets. We sample the training
set, then use it to choose the parameters, then sample the test set. Under this
process, the expected test error is greater than or equal to the expected value of
training error, if the learner is really trying to minimize its training error. The
two main questions in determining how a machine learning algorithm will perform
are
1. How low is the machine learning algorithm expected to drive the training
error?
2. How big is the gap between training and test error expected to be?
These two factors correspond to the two central challenges in machine learning:
underfitting and overfitting. Underfitting occurs when the model is not able to
obtain a sufficiently low error value on the training set. Overfitting occurs when
the gap between the training error and test error is too large.
We can control whether a model is more likely to overfit or underfit by altering
its capacity. Informally, a model’s capacity is its ability to fit a wide variety of
functions. Models with low capacity may struggle to fit the training set. Models
95
CHAPTER 5. MACHINE LEARNING BASICS
with high capacity can overfit, i.e., memorize properties of the training set that
do not serve them well on the test set.
For example, if we solve a regression problem by using a polynomial function of
a scalar input x, polynomials of higher degree have higher capacity. A polynomial
of degree one gives us the linear regression model with which we are already
familiar, with prediction
ˆy = b + wx.
By introducing x
2
as another feature provided to the linear regression model, we
can learn a model that is quadratic as a function of x:
ˆy = b + w
1
x + w
2
x
2
.
We can continue to add more powers of x as additional features, for example to
obtain a polynomial of degree 10:
ˆy = b +
10
X
i=1
w
i
x
i
.
Machine learning algorithms will generally perform best when their capacity
is appropriate in regard to true complexity of the task they need to perform and
the amount of training data they are provided with. Models with too low capacity
are unable to solve complex tasks. Model with high capacity can solve complex
tasks, but when their capacity is too high they may overfit.
Fig. 5.2 shows this principle in action. We compare a linear, quadratic, and
degree-10 predictor attempting to fit a problem where the true underlying func-
tion is quadratic. The linear function is unable to capture the curvature in the
true underlying problem, so it underfits. Consider the case where the amount of
noise is very small: there is only one degree-2 polynomial that can pass through
the examples, so long as there are at least two examples, independent of what
those two examples are, so long as they were sampled from the data generating
distribution. On the other hand, depending on the sample of training examples,
very different degree-10 polynomials could be obtained. The degree-10 polyno-
mial is capable of representing the correct function (and better capable of fitting
the noise), but when fitting to only a small number of points, there are so many
different degree-10 polynomials capable of explaining the 20 randomly chosen
training examples that is very unlikely we will choose the right one, so we overfit.
With more noise, the differences between these fitted degree-10 polynomials can
grow enormously, whereas the different degree-2 polynomials fitted with different
samples of 20 noisy points must all be similar to each other. In this example, only
the quadratic model recovers a prediction close to the true structure of the task
and generalizes well to new data.
96
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.2: We fit three models to this example training set. The training data was
generated synthetically, by adding a small amount of noise to a quadratic function. (Left)
A linear function fit to the data suffers from underfitting—it cannot capture the curvature
that is present in the data. (Center) A quadratic function fit to the data generalizes well
to unseen points. It does not suffer from a significant amount of overfitting or underfitting.
(Right) A polynomial of degree 10 fit to the data suffers from overfitting. It is tightly fit
to noise in the data and has inappropriately steep slope at the edges of the data.
Here we have only described changing a model’s capacity by changing the
number of input features it has (and simultaneously adding new parameters asso-
ciated with those features). There are many other ways of controlling the capacity
of a machine learning algorithm, which we will explore in the sections ahead.
Many of these ideas date back to Occam’s razor (c. 1287-1347), also known as
the principle of parsimony, which states that among competing hypotheses (here,
read functions that could explain the observed data), one should choose the “sim-
pler” one. This idea was formalized and made more precise in the 20th century by
the founders of statistical learning theory (Vapnik and Chervonenkis, 1971; Vap-
nik, 1982; Blumer et al., 1989; Vapnik, 1995). This body of work provides various
means of quantifying model capacity and showing that the discrepancy between
training error and generalization error is bounded by a quantity that grows with
the ratio of capacity to number of training examples (Vapnik and Chervonenkis,
1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide
intellectual justification that machine learning algorithms can work, but they are
rarely used in practice when working with deep learning algorithms. This is in
part because the bounds are often quite loose, and in part because it can be quite
difficult to determine the capacity of deep learning algorithms.
Typically, training error decreases until it asymptotes to the minimum pos-
sible error value as model capacity increases (assuming your error measure has
a minimum value). Typically, generalization error has a U-shaped curve as a
function of model capacity. This is illustrated in Figure 5.3.
Training and generalization error also vary as the size of the training set varies.
See Fig. 5.4 for an illustration. The figure introduces the notion of parametric
and non-parametric learning algorithms. Parametric ones have a fixed maximum
97
CHAPTER 5. MACHINE LEARNING BASICS
Capacity
training
error
generalization
error
Error
generalization gap
optimal
capacity
Overfitting zoneUnderfitting zone
Figure 5.3: Typical relationship between capacity (horizontal axis) and both training
(bottom curve, dotted) and generalization (or test) error (top curve, bold). At the left end
of the graph, training error and generalization error are both high. This is the underfitting
regime. As we increase capacity, training error decreases, but the gap between training
and generalization error increases. Eventually, the size of this gap outweighs the decrease
in training error, and we enter the overfitting regime, where capacity is too large, above
the optimal capacity.
98
CHAPTER 5. MACHINE LEARNING BASICS
capacity (their capacity can still be reduced by various means, such as a poor
optimization procedure), and they are called like this because they have a fixed-
size parameter vetor. On the other hand, non-parametric learners are allowed
to set their capacity based on the given data, i.e., the number of parameters is
something that can be determined after the data is observed, and typically more
data allows a greater capacity, i.e., more parameters. Note that it is possible
for the model to have optimal capacity and yet still have a large gap between
training and generalization error. In this situation, we can only reduce this gap
by gathering more training examples.
It’s worth mentioning that capacity is not just determined by which model
we use. The model specifies which family of functions the learning algorithm can
choose from when varying the parameters in order to reduce a training objective.
This is called the representational capacity of the model. In many cases, finding
the best function within this family is a very difficult optimization problem. In
practice, the learning algorithm does not actually find the best function, just one
that significantly reduces the training error. These additional restrictions mean
that the model’s effective capacity may be less than its representational capacity.
5.4 The No Free Lunch Theorem
Although learning theory, sketched above, suggests that it is possible to generalize,
one should consider a serious caveat, discussed here. Generally speaking, inductive
reasoning, or inferring general rules from a limited set of examples, is not logically
valid. To logically infer a rule describing every member of a set, one must have
information about every member of that set. One may wonder then how the
claims that machine learning can generalize well are logically valid.
In part, machine learning avoids this problem by offering only probabilistic
rules, rather than the entirely certain rules used in purely logical reasoning. Ma-
chine learning promises to find rules that are probably correct about most members
of the set they concern.
Unfortunately, even this does not resolve the entire problem. The no free
lunch theorem for machine learning (Wolpert, 1996) states that, averaged over
all possible data generating distributions, every classification algorithm has the
same error rate when classifying previously unobserved points. In other words,
in some sense, no machine learning algorithm is universally any better than any
other. The most sophisticated algorithm we can conceive of has the same average
performance (over all possible tasks) as merely predicting that every point belongs
to the same class.
Fortunately, these results hold only when we average over all possible data
generating distributions. If we make assumptions about the kinds of probability
99
CHAPTER 5. MACHINE LEARNING BASICS
Number of training examples
Optimal capacity
Training error at fixed capacity
Generalization error at fixed capacity
Generalization error at optimal capacity
Asymptotic error at optimal (non-parametric) capacity
Asymptotic error at fixed (parametric) capacity
Figure 5.4: This plot shows the effect of the dataset size on the train and test error of
the model, as well as on the optimal model capacity. Note that the y-axis is used to show
two different values which are not actually comparable—error, and capacity. If we choose
a single model with fixed capacity (red), known as a parametric learning algorithm and
retrain it with different amounts of training data, then the training error will increase as
the size of the training set increases. This is because larger datasets are harder to fit.
Simultaneously, the test error will decrease, because fewer incorrect hypotheses will be
consistent with the training data. Ultimately the train and test error will converge. If
we instead consider a learning algorithm that can adapt its capacity with training set
size, then the optimal capacity (black) increases with the number of training examples,
and reaches an asymptote which only depends on the required complexity for the task to
be learned. This kind of learning algorithm is called non-parametric. The generalization
error of the optimal capacity model (green) decreases and approaches an asymptote,
called the Bayes error (the error made by an oracle that knows the data generating
distribution). Usually the asymptotic error is greater than zero because there is some
noise in the true distribution that the model is asked to capture.
100
CHAPTER 5. MACHINE LEARNING BASICS
distributions we encounter in real-world applications, then we can design learning
algorithms that perform well on these distributions.
This means that the goal of machine learning research is not to seek a universal
learning algorithm or the absolute best learning algorithm. Instead, our goal is
to understand what kinds of distributions are relevant to the “real world” that
an AI agent experiences, and what kinds of machine learning algorithms perform
well on data drawn from the kinds of data generating distributions we care about.
5.5 Hyperparameters, Validation Sets and Cross-Validation
Most machine learning algorithms have several settings that we can use to control
the behavior of the learning algorithm. These settings are called hyperparameters.
The values of hyperparameters are not adapted by the learning algorithm itself
(though we can design a nested learning procedure where one learning algorithm
learns the best hyperparameters for another learning algorithm).
In the polynomial regression example we saw in Fig. 5.2, there is a single
hyperparameter: the degree of the polynomial, which acts as a capacity hyper-
parameter.
Sometimes a setting is chosen to be a hyperparameter that the learning algo-
rithm does not learn because it is difficult to optimize. More frequently, we do
not learn the hyperparameter because it is not appropriate to learn that hyper-
parameter on the training set. This applies to all hyperparameters that control
model capacity. If learned on the training set, such hyperparameters would always
choose the maximum possible model capacity, resulting in underfitting (refer to
Figure 5.3).
To solve this problem, we need a validation set of examples that the training
algorithm does not observe.
Earlier we discussed how a held-out test set, composed of examples coming
from the same distribution as the training set, can be used to estimate the general-
ization error of a learner, after the learning process has completed. It is important
that the test examples are not used in any way to make choices about the model,
including its hyperparameters. For this reason, no example from the test set can
be used in the validation set.
For this reason, we always construct the validation set from the training data.
Specifically, we split the training data into two disjoint subsets. One of these
subsets is used to learn the parameters. The other subset is our validation set,
used to estimate the generalization error during or after training, allowing for the
hyperparameters to be updated accordingly. The subset of data used to learn
the parameters is still typically called the training set, even though this may
be confused with the larger pool of data used for the entire training process.
101
CHAPTER 5. MACHINE LEARNING BASICS
The subset of data used to guide the selection of hyperparameters is called the
validation set. Since the validation set is used to “train” the hyperparameters,
the validation set error will underestimate the test set error, though typically by
a smaller amount than the training error. Typically, one uses about 80% of the
data for training and 20% for validation.
In practice, when the same test set has been used repeatedly to evaluate
performance of different algorithms over many years, and especially if we consider
all the attempts from the scientific community at beating the reported state-of-
the-art performance on that test set, we end up having optimistic evaluations with
the test set as well. Benchmarks can thus become stale and then do not reflect
the true field performance of a trained system. Thankfully, the community tends
to move on to new (and usually more ambitious and larger) benchmark datasets.
5.5.1 Cross-Validation
One issue with the idea of splitting the dataset into train/test or train/validation/test
subsets is that only a small fraction of examples are used to evaluate generaliza-
tion. The consequence is that there is a lot of statistical uncertainty around the
estimated average test error, making it difficult to claim that algorithm A works
better than algorithm B on the given task.
With large datasets with hundreds of thousands of examples or more, this
is not a serious issue, but when the dataset is too small, there are alternative
procedures, which allow one to use all of the examples in the estimation of the
mean test error, at the price of increased computational cost. These procedures
are based on the idea of repeating the training / testing computation on different
randomly chosen subsets or splits of the original dataset. The most common of
these is the k-fold cross-validation procedure, in which a partition of the dataset
is formed by splitting it in k non-overlapping subsets. Then k train/test splits
can be obtained by keeping each time the i-th subset as a test set and the rest as a
training set. The average test error across all these k training/testing experiments
can then be reported. One problem is that there exists no unbiased estimators of
the variance of such average error estimators (Bengio and Grandvalet, 2004), but
approximations are typically used.
If model selection or hyperparameter optimization is required, things get more
computationally expensive: one can recurse the k-fold cross-validation idea, in-
side the training set. So we can have an outer loop that estimates test error and
provides a “training set” for a hyperparameter-free learner, calling it k times to
“train”. That hyperparameter-free learner can then split its received training set
by k-fold cross-validation into internal training/validation subsets (for example,
splitting into k 1 subsets is convenient, to reuse the same test blocks as the
outer loop), call a hyperparameter-specific learner for each choice of hyperparam-
102
CHAPTER 5. MACHINE LEARNING BASICS
eter value on each of the training partition of this inner loop, and compute the
validation error by averaging across the k 1 validation sets the errors made by
the k 1 hyperparameter-specific learners trained on each of the internal training
subsets.
5.6 Estimators, Bias, and Variance
The field of statistics gives us many tools that can be used to achieve the machine
learning goal of solving a task not only on the training set but also to generalize.
Foundational concepts such as parameter estimation, bias and variance are useful
to formally characterize notions of generalization, underfitting and overfitting.
5.6.1 Point Estimation
Point estimation is the attempt to provide the single “best” prediction of some
quantity of interest. In general the quantity of interest can be a single parameter
or a vector of parameters in some parametric model, such as the weights in our
linear regression example in Section 5.2, but it can also be a whole function.
In order to distinguish estimates of parameters from their true value, our
convention will be to denote a point estimate of a parameter θ by
ˆ
θ.
Let {x
(1)
, . . . , x
(m)
} be a set of m independent and identically distributed
(i.i.d.) data points. A point estimator is any function of the data:
ˆ
θ
m
= g(x
(1)
, . . . , x
(m)
). (5.2)
In other words, any statistic
1
is a point estimate. Notice that no mention is made
of any correspondence between the estimator and the parameter being estimated.
There is also no constraint that the range of g(x
(1)
, . . . , x
(m)
) should correspond
to that of the true parameter.
This definition of a point estimator is very general and allows the designer
of an estimator great flexibility. What distinguishes “just any” function of the
data from most of the estimators that are in common usage is their properties.
For now, we take the frequentist perspective on statistics. That is, we assume
that the true parameter value θ is fixed but unknown, while the point estimate
ˆ
θ is a function of the data. Since the data is drawn from a random process, any
function of the data is random. Therefore
ˆ
θ is a random variable.
Point estimation can also refer to the estimation of the relationship between
input and target variables. We refer to these types of point estimates as function
estimators.
1
A statistic is a function of the data, typically of the whole training set, such as the mean.
103
CHAPTER 5. MACHINE LEARNING BASICS
Function Estimation As we mentioned above, sometimes we are interested in
performing function estimation (or function approximation). Here we are trying
to predict a variable (or vector) y given an input vector x (also called the co-
variates). We consider that there is a function f(x) that describes the relationship
between y and x. For example, we may assume that y = f(x)+, where stands
for the part of y that is not predictable from x.
In function estimation, we are interested in approximating f with a model or
estimate
ˆ
f. Note that we are really not adding anything new here to our notion of
a point estimator, the function estimator
ˆ
f is simply a point estimator in function
space.
The linear regression example we discussed above in Section. 5.2 and the
polynomial regression example discussed in Section. 5.3 are both examples of
function estimation where we estimate a model
ˆ
f of the relationship between an
input x and target y.
In the following we will review the most commonly studied properties of point
estimators and discuss what they tell us about these estimators.
As
ˆ
θ and
ˆ
f are random variables (or vectors, or functions), they are distributed
according to some probability distribution. We refer to this distribution as the
sampling distribution. When we discuss properties of the estimator, we are really
describing properties of the sampling distribution.
5.6.2 Bias
The bias of an estimator is defined as:
bias(
ˆ
θ
m
) = E(
ˆ
θ
m
) θ (5.3)
where the expectation is over the data (seen as samples from a random variable)
and θ is the true underlying value of θ according to the data generating distribu-
tion. An estimator
ˆ
θ
m
is said to be unbiased if bias(
ˆ
θ
m
) = 0, i.e., if E(
ˆ
θ
m
) = θ.
An estimator
ˆ
θ
m
is said to be asymptotically unbiased if lim
m→∞
bias(
ˆ
θ
m
) = 0,
i.e., if lim
m→∞
E(
ˆ
θ
m
) = θ.
Example: Bernoulli Distribution Consider a set of samples {x
(1)
, . . . , x
(m)
}
that are independently and identically distributed according to a Bernoulli dis-
tribution, x
(i)
{0, 1}, where i [1, m]. The Bernoulli p.m.f. (probability mass
function, or probability function) is given by P (x
(i)
; θ) = θ
x
(i)
(1 θ)
(1x
(i)
)
.
We are interested in knowing if the estimator
ˆ
θ
m
=
1
m
P
m
i=1
x
(i)
is biased.
104
CHAPTER 5. MACHINE LEARNING BASICS
bias(
ˆ
θ
m
) = E[
ˆ
θ
m
] θ
= E
"
1
m
m
X
i=1
x
(i)
#
θ
=
1
m
m
X
i=1
E
h
x
(i)
i
θ
=
1
m
m
X
i=1
1
X
x
(i)
=0
x
(i)
θ
x
(i)
(1 θ)
(1x
(i)
)
θ
=
1
m
m
X
i=1
(θ) θ
= θ θ = 0
Since bias(
ˆ
θ) = 0, we say that our estimator
ˆ
θ is unbiased.
Example: Gaussian Distribution Estimator of the Mean Now, consider a
set of samples {x
(1)
, . . . , x
(m)
} that are independently and identically distributed
according to a Gaussian (Normal) distribution (x
(i)
Gaussian(µ, σ
2
), where
i [1, m]). The Gaussian p.d.f. (probability density function) is given by
p(x
(i)
; µ, σ
2
) =
1
2πσ
2
exp
1
2
(x
(i)
µ)
2
σ
2
.
A common estimator of the Gaussian mean parameter is known as the sample
mean:
ˆµ
m
=
1
n
m
X
i=1
x
(i)
(5.4)
To determine the bias of the sample mean, we are again interested in calculating
its expectation:
bias(ˆµ
m
) = E[ˆµ
m
] µ
= E
"
1
m
m
X
i=1
x
(i)
#
µ
=
1
m
m
X
i=1
E
h
x
(i)
i
!
µ
=
1
m
m
X
i=1
µ
!
µ
= µ µ = 0
105
CHAPTER 5. MACHINE LEARNING BASICS
Thus we find that the sample mean is an unbiased estimator of Gaussian mean
parameter.
Example: Gaussian Distribution Estimators of the Variance Sticking
with the Gaussian family of distributions. We consider two different estimators
of the variance parameter σ
2
. We are interested in knowing if either estimator is
biased.
The first estimator of σ
2
we consider is known as the sample variance:
ˆσ
2
m
=
1
m
m
X
i=1
x
(i)
ˆµ
m
2
, (5.5)
where ˆµ
m
is the sample mean, defined above. More formally, we are interested in
computing
bias(ˆσ
2
m
) = E[ˆσ
2
m
] σ
2
We now simplify the term E[ˆσ
2
m
]
E[ˆσ
2
m
] = E
"
1
m
m
X
i=1
x
(i)
ˆµ
m
2
#
= E
"
1
m
m
X
i=1
(x
(i)
)
2
2x
(i)
ˆµ
m
+ ˆµ
2
m
#
=
1
m
m
X
i=1
E
h
(x
(i)
)
2
i
2E
x
(i)
1
m
m
X
j=1
x
(j)
+ E
1
m
m
X
j=1
x
(j)
1
m
m
X
k=1
x
(k)
!
=
1
m
m
X
i=1
1
2
m
E
h
(x
(i)
)
2
i
2
m
X
j6=i
E
h
x
(i)
x
(j)
i
+
1
m
2
m
X
j=1
E
h
(x
(j)
)
2
i
+
1
m
2
m
X
j=1
X
k6=j
E
h
x
(j)
x
(k)
i
=
1
m
m
X
i=1
(
m 2
m
)(µ
2
+ σ
2
)
2(m 1)
m
(µ
2
) +
1
m
(µ
2
+ σ
2
) +
(m 1)
m
(µ
2
)
=
m 1
m
σ
2
So the bias of ˆσ
2
m
) = σ
2
/m, and therefore the sample variance is a biased
estimator.
106
CHAPTER 5. MACHINE LEARNING BASICS
We now consider a modified estimator of the variance someitmes called the
unbiased sample variance:
˜σ
2
m
=
1
m 1
m
X
i=1
x
(i)
ˆµ
m
2
(5.6)
As the name suggests this estimator is unbiased, that is, we find that E[˜σ
2
m
] = σ
2
:
E[˜σ
2
m
] = E
"
1
m 1
m
X
i=1
x
(i)
ˆµ
m
2
#
=
m
m 1
E[ˆσ
2
m
]
=
m
m 1
m 1
m
σ
2
= σ
2
.
We have two estimators: one is biased and the other is not. While unbiased
estimators are clearly desirable, they are not always the “best” estimators. As we
will see we often use biased estimators that possess other important properties.
5.6.3 Variance
Another property of the estimator that we might want to consider is how much
we expect it to vary as a function of the data sample. Just as we computed the
expectation of the estimator to determine its bias, we can compute its variance.
Var(
ˆ
θ) = E[
ˆ
θ
2
] E[
ˆ
θ]
2
(5.7)
The variance of an estimator provides a measure of how we would expect the
estimate we compute from data to vary as we independently resample the dataset
from the underlying data generating process. Just as we might like an estimator
to exhibit low bias we would also like it to have relatively low variance.
We can also define the standard error (se) of the estimator as
se(
ˆ
θ) =
q
Var[
ˆ
θ] (5.8)
Example: Bernoulli Distribution Let’s once again consider a set samples
({x
(1)
, . . . , x
(m)
}) drawn independently and identically from a Bernoulli distri-
bution (recall P(x
(i)
; θ) = θ
x
(i)
(1 θ)
(1x
(i)
)
). This time we are interested in
107
CHAPTER 5. MACHINE LEARNING BASICS
computing the variance of the estimator
ˆ
θ
m
=
1
m
P
m
i=1
x
(i)
.
Var
ˆ
θ
m
= Var
1
m
m
X
i=1
x
(i)
!
=
1
m
2
m
X
i=1
Var
x
(i)
=
1
m
2
m
X
i=1
θ(1 θ)
=
1
m
2
(1 θ)
=
1
m
θ(1 θ)
Note that the variance of the estimator decreases as a function of m, the num-
ber of examples in the dataset. This is a common property of popular estimators
that we will return to when we discuss consistency (see Sec. 5.6.5).
Example: Gaussian Distribution Estimators of the Variance We again
consider a set of samples {x
(1)
, . . . , x
(m)
} independently and identically distributed
according to a Gaussian distribution (x
(i)
Gaussian(µ, σ
2
), where i [1, m]).
We now consider the variance of the two estimators of the variance: the sample
variance,
ˆσ
2
m
=
1
m
m
X
i=1
x
(1)
ˆµ
m
2
, (5.9)
and the unbiased sample variance,
˜σ
2
m
=
1
m 1
m
X
i=1
x
(1)
ˆµ
m
2
. (5.10)
In order to determine the variance of these estimators we will take advantage
of a known relationship between the sample variance and the Chi Squared distri-
bution, ppecifically, that
m1
σ
2
ˆσ
2
happens to be χ
2
distributed. We can then use
this together with the fact that the variance of a χ
2
random variable with m 1
degrees of freedom is 2(m 1).
Var
m 1
σ
2
˜σ
2
= 2(m 1)
(m 1)
2
σ
4
Var
˜σ
2
= 2(m 1)
Var
˜σ
2
=
2σ
4
(m 1)
108
CHAPTER 5. MACHINE LEARNING BASICS
By noticing that ˆσ
2
=
m1
m
˜σ
2
, and using ˜σ
2
’s relationship to the χ
2
distribution,
it is straightforward to show that Var
ˆσ
2
=
2(m1)σ
4
m
2
.
To derive this last relation, we used the fact that that Var
˜σ
2
=
m
m1
2
Var
ˆσ
2
,
that is Var
˜σ
2
> Var
ˆσ
2
. So while the bias of ˜σ
2
is smaller than the bias of
ˆσ
2
, the variance of ˜σ
2
is greater.
5.6.4 Trading off Bias and Variance and the Mean Squared Error
Bias and variance measure two different sources of error in an estimator. Bias
measures the expected deviation from the true value of the function or parameter.
Variance on the other hand, provides a measure of the deviation from the true
value that any particular sampling of the data is likely to cause.
What happens when we are given a choice between two estimators, one with
more bias and one with less variance? How do we choose between them? For
example, let’s imagine that we are interested in approximating the function shown
in Fig. 5.2 and we are only offered the choice between a model with large bias
and one that suffers from large variance. How do we choose between them?
In machine learning, perhaps the most common and empirically successful way
to negotiate this kind of trade-off, in general is by cross-validation, discussed in
Section 5.5.1. Alternatively, we can also compare the mean squared error (MSE)
of the estimates:
MSE = E[
ˆ
θ
n
θ]
2
= Bias(
ˆ
θ
n
)
2
+ Var(
ˆ
θ
n
) (5.11)
The MSE measures the overall expected deviation—in a squared error sense—
between the estimator and the true value of the parameter θ. As is clear from Eq.
5.11, evaluating the MSE incorporates both the bias and the variance. Desirable
estimators are those with small MSE and these are estimators that manage to
keep both their bias and variance somewhat in check.
The relationship between bias and variance is tightly linked to the machine
learning concepts of capacity, underfitting and overfitting discussed in Section.
5.3. In the case where generalization error is measured by the MSE (where
bias and variance are meaningful components of generalization error), increas-
ing capacity tends to increase variance and decrease bias. This is illustrated in
Figure 5.5, where we see again the U-shaped curve of generalization error as a
function of of capacity, as in Section 5.3 and Figure 5.3.
Example: Gaussian Distribution Estimators of the Variance In the last
section we saw that when we compared the sample variance, ˆσ
2
, and the unbiased
sample variance, ˜σ
2
, we see that while ˆσ
2
has higher bias, ˜σ
2
has higher variance.
109
CHAPTER 5. MACHINE LEARNING BASICS
Capacity
bias
generalization
error
variance
optimal
capacity
Overfitting zoneUnderfitting zone
Figure 5.5: As capacity increases (x-axis), bias (dotted) decreases and variance (dashed)
increases, yielding another U-shaped curve for generalization error (bold curve). If we
vary capacity along one axis, there is an optimal capacity, with underfitting when the
capacity is below this optimum and overfitting when it is above.
The mean squared error offers a way of balancing the tradeoff between bias
and variance and suggest which estimator we might prefer. For ˆσ
2
, the mean
squared error is given by:
MSE(ˆσ
2
m
) = Bias(ˆσ
2
m
)
2
+ Var(ˆσ
2
m
) (5.12)
=
σ
2
m
2
+
2(m 1)σ
4
m
2
(5.13)
=
1 + 2(m 1)
m
2
σ
4
(5.14)
=
2m 1
m
2
σ
4
(5.15)
The mean squared error of the unbiased alternative is given by:
MSE(˜σ
2
m
) = Bias(˜σ
2
m
)
2
+ Var(˜σ
2
m
) (5.16)
= 0 +
2σ
4
(m 1)
(5.17)
=
2
(m 1)
σ
4
. (5.18)
Comparing the two, we see that the MSE of the unbiased sample variance, ˜σ
2
m
, is
actually higher than the MSE of the (biased) sample variance, ˆσ
2
m
. This implies
that despite incurring bias in the estimator ˆσ
2
m
, the resulting reduction in variance
more than makes up for the difference, at least in a mean squared sense.
110
CHAPTER 5. MACHINE LEARNING BASICS
5.6.5 Consistency
As we have already discussed, sometimes we may wish to choose an estimator
that is biased. For example, in order to minimize the variance of the estimator.
However we might still wish that, as the number of data points in our dataset
increases, our point estimates converge to the true value of the parameter. More
formally, we would like that lim
n→∞
ˆ
θ
n
p
θ.
2
This condition is known as consis-
tency
3
and ensures that the bias induced by the estimator is assured to diminish
as the number of data examples grows.
Asymptotic unbiasedness is not equivalent to consistency. For example, con-
sider estimating the mean parameter µ of a normal distribution N(µ, σ
2
), with
a dataset consisting of n samples: {x
1
, . . . , x
n
}. We could use the first sample
x
1
of the dataset as an unbiased estimator:
ˆ
θ = x
1
, In that case, E(
ˆ
θ
n
) = θ so
the estimator is unbiased no matter how many data points are seen. This, of
course, implies that the estimate is asymptotically unbiased. However, this is not
a consistent estimator as it is not the case that
ˆ
θ
n
θ as n .
5.7 Maximum Likelihood Estimation
In the previous section we discussed a number of common properties of estima-
tors but we never mentioned where these estimators come from. In this section,
we discuss one of the most common approaches to deriving estimators: via the
maximum likelihood principle.
Consider a set of m independent examples X = (x
(1)
, . . . , x
(m)
) with x
(i)
P (x) independently, where P (x) is the true but unknown data generating distri-
bution. More generally, the data may not need to be sampled independently, so we
have a data generating process which produces the sequence X, i.e., X P (X).
Consider a family of probability functions P, parameterized by θ, over the
same space, i.e., P(x; θ) maps any configuration x to a real number estimating
the true probability P (x), or more generally (in the non-independent case), we
have a P (X; θ) that returns the probability of any whole sequence X.
The maximum likelihood estimator for θ is then defined as
θ
ML
= arg max
θ
P (X; θ). (5.19)
2
The symbol
p
means that the convergence is in probability, i.e. for any > 0, P (|
ˆ
θ
n
θ| >
) 0 as n .
3
This is sometime referred to as weak consistency, with strong consistency referring to the
almost sure convergence of
ˆ
θ to θ.
111
CHAPTER 5. MACHINE LEARNING BASICS
In the i.i.d. scenario, because of the i.i.d. assumptions, we can rewrite
P
θ
(X) =
n
Y
i=1
P (x
(i)
; θ). (5.20)
Note that sometimes we make our model assume that the examples are i.i.d. even
though we know they are not, because it simplifies the model (which sometimes
mean that better generalization can be achieved), so this is a modeling choice and
not necessarily an assumption on the true data generating process. Combining the
above two equations and noting that the logarithm of the arg max is the arg max
of the logarithm, we obtain the ordinary maximum likelihood estimator under a
model that assumes i.i.d. examples:
θ
ML
= arg max
θ
m
X
i=1
log P (x
(i)
; θ). (5.21)
This formulation is convenient because it corresponds to an objective func-
tion that is additive in the examples, something that is exploited in numerical
optimization methods such as stochastic gradient descent (Section 8.3.2), which
is heavily used for deep learning. In practice, we will often use numerical opti-
mization to approximately maximize the likelihood, so we will not have the true
maximum likelihood estimator, but something that approximates it.
There is an interesting connection between the objective function for maxi-
mum likelihood, on the right hand side of Eq. 5.21, and the notion KL divergence
introduced in Section 3.9 with Eq. 3.3. The KL divergence compares a candidate
distribution Q with a target distribution P . If we replace Q by the empirical
distribution, we obtain the average negative log-likelihood plus the entropy of Q,
which is a constant as far as P is concerned:
D
KL
(QkP ) = E
xQ
log
Q(x)
P (x)
(5.22)
= E
xQ
[log P (x)] + E
xQ
[log Q(x)] (5.23)
(5.24)
= E
xQ
[log P (x)] H[Q] (5.25)
=
1
m
m
X
i=1
log P (x
(i)
) log m (5.26)
where Eq. 5.22 is the definition of KL divergence, Eq. 5.23 splits it into two terms,
the first one being the cross-entropy between Q and P (with Q as the reference)
and second one being the entropy of Q (see Eq. 3.2). Eq. 5.25 is then obtained by
taking Q as the empirical distribution (Eq. 3.4), and we obtain Eq. 5.26, which is
112
CHAPTER 5. MACHINE LEARNING BASICS
the average negative log-likelihood plus a constant. Hence maximizing likelihood
is minimizing the cross-entropy between the empirical distribution and the model
as well as minimizing the KL divergence between these two distributions. Note
that when we make Q the data generating distribution, we obtain the generaliza-
tion or expected negative log-likelihood.
5.7.1 Conditional Log-Likelihood and Mean Squared Error
The maximum likelihood estimator can readily be generalized to the case where
our goal is not to estimate a probability function but rather a conditional prob-
ability, e.g., P(y|x; θ), to predict y given x. This is actually the most common
situation where we do supervised learning (Section 5.9), i.e., the examples are
pairs (x, y). If X represents all our inputs and Y all our observed targets, then
the conditional maximum likelihood estimator is
θ
ML
= arg max
θ
P (Y |X; θ). (5.27)
If the examples are assumed to be i.i.d., then this can be decomposed into
θ
ML
= arg max
θ
m
X
i=1
log P (y
(i)
|x
(i)
; θ). (5.28)
Example: Linear Regression Let us consider as an example the special case
of linear regression, introduced earlier in Section 5.2. In that case, the conditional
density of y, given x = x, is a Gaussian with mean µ(x) that is a learned function
of x, with unconditional variance σ
2
. Since the examples are assumed to be i.i.d.,
the conditional log-likelihood (Eq. 5.27) becomes
log P (Y |X; θ) =
m
X
i=1
log P (y
(i)
|x
(i)
; θ) =
m
X
i=1
1
2σ
2
||
ˆ
y
(i)
y
(i)
||
2
m log σ
m
2
log(2π)
where
ˆ
y
(i)
= µ(x
(i)
) is the output of the linear regression on the i-th input x
(i)
and m is the dimension of the y vectors. Comparing the above with the mean
squared error (Section 5.2) we immediately see that if σ is fixed, maximizing the
above is equivalent (up to an additive and a multiplicative constant that do not
change the value of the optimal parameter) to minimizing the training set mean
squared error, i.e.,
MSE
train
=
1
m
m
X
i=1
||
ˆ
y
(i)
y
(i)
||
2
.
Note that the MSE is an average rather than a sum, which is more practical from
a numerical point of view (so you can compare MSEs of sets of different sizes
113
CHAPTER 5. MACHINE LEARNING BASICS
more easily). In practice, researchers reporting log-likelihoods and conditional
log-likelihoods also tend to report the per-example average log-likelihood, for the
very same reason. The exponential of the average log-likelihood is also called the
perplexity and is used in language modeling applications.
Whereas in the case of linear regression we have µ(x) = w·x, the above equally
applies to other forms of regression, e.g., with a neural network predicting with
µ(x) the expected value of y given x.
5.7.2 Properties of Maximum Likelihood
The main appeal of the maximum likelihood estimator is that it can be shown
to be the best estimator asymptotically, as the number of examples m , in
terms of its rate of convergence as m increases.
The maximum likelihood estimator has the property of consistency (see Sec. 5.6.5
above), i.e., as more training are considered, the estimator converges to the best
one in some sense. There are other inductive principles besides the maximum
likelihood estimator, many of which share the property of being consistent esti-
mators. However, there is the question of how many training examples one needs
to achieve a particular generalization error, or equivalently what estimation error
one gets for a given number of training examples, also called efficiency. This is
typically studied in the parametric case (like in linear regression) where our goal
is to estimate the value of a parameter (and assuming it is possible to identify
the true parameter), not the value of a function. A way to measure how close
we are to the true parameter is by the expected mean squared error, computing
the squared difference between the estimated and true parameter values, where
the expectation is over m training samples from the data generating distribution.
That parametric mean squared error decreases as m increases, and for m large,
the Cram´er-Rao lower bound (Rao, 1945; Cram´er, 1946) shows that no consistent
estimator has a lower mean squared error than the maximum likelihood estimator.
For these reasons (consistency and efficiency), the maximum likelihood induc-
tion principle is often considered the preferred one in machine learning, modulo
slight adjustments such as described in the next Section, to better deal with
the non-asymptotic case where the number of examples is small enough to yield
overfitting behavior.
5.8 Bayesian Statistics and Prior Probability Distri-
butions
We know from the No Free Lunch theorem that machine learning algorithms
must in some sense be tailored to the world that they live in. So far, we have
114
CHAPTER 5. MACHINE LEARNING BASICS
only described one way to design a learning algorithm to fit a particular world:
the model family. This approach is somewhat inflexible because each possible
model is either included in the family or excluded from it entirely. There is no
way to use the model family to specify that some models are preferable to others.
In many cases, we would like to say that some models are in some way extreme
or unlikely, but that the learning algorithm can still choose to believe in those
models if a preponderence of evidence assembles in favor of them.
Bayesian statistics provides a natural and theoretically elegant way to achieve
this flexibility. In the Bayesian approach to statistical estimation, the statistician
selects a prior probability distribution over the model parameters, expressing what
we know or believe about how likely each member of the model family is to be
the correct choice, before any data is observed and the training algorithm begins.
Historically, statics has become divided between two communities. One of
these communities is known as frequentist statistics or orthodox statistics. The
other is known as Bayesian statistics. The difference is mainly one of world view
but can have important practical implications.
As discussed in Sec. 5.6.1, the frequentist perspective is that the true param-
eter value θ is fixed but unknown, while the point estimate
ˆ
θ is a random variable
on account of it being a function of the data (which are seen as random).
The Bayesian perspective on statistics is quite different and, in some sense,
more intuitive. The Bayesian uses probability to reflect degrees of certainty of
states of knowledge. The data is directly observed and so is not random. On the
other hand, the true parameter θ is unknown or uncertain and thus is represented
as a random variable.
Before observing the data, we represent our knowledge of θ using the prior
probability distribution, p(θ) (sometimes referred to as simply ’the prior’). Gener-
ally, the prior distribution is quite broad (i.e. with high entropy) to reflect a high
degree of uncertainty in the value of θ before observing any data. For example, we
might assume a priori that θ lies in some finite range or volume, with a uniform
distribution. Many priors instead reflect a preference for “simpler” solutions (such
as smaller magnitude coefficients, or a function that is closer to being constant).
Now consider that we have a set of data samples {x
(1)
, . . . , x
(m)
}. We can
recover the effect of data on our belief about θ by combining the data likelihood
p(x
(1)
, . . . , x
(m)
| θ) with the prior via Bayes’ rule:
p(θ | x
(1)
, . . . , x
(m)
) =
p(x
(1)
, . . . , x
(m)
| θ)p(θ)
p(x
(1)
, . . . , x
(m)
)
(5.29)
If the data is at all informative about the value of θ, the posterior distribution
p(θ | x
(1)
, . . . , x
(m)
) will have less entropy (will be more ‘peaky’) than the prior
p(θ).
115
CHAPTER 5. MACHINE LEARNING BASICS
Relative to maximum likelihood estimation, Bayesian estimation offers two
important differences. First, unlike the maximum likelihood point estimate of
θ, the Bayesian makes decision with respect to a full distribution over θ. For
example, after observing m examples, the predicted distribution over the next
data sample, x
(m+1)
, is given by
p(x
(m+1)
| x
(1)
, . . . , x
(m)
) =
Z
p(x
(m+1)
| θ)p(θ | x
(1)
, . . . , x
(m)
) (5.30)
Here each value of θ with positive probability density contributes to the prediction
of the next example, with the contribution weighted by the posterior density itself.
After having observed {x
(1)
, . . . , x
(m)
}, if we are still quite uncertain about the
value of θ, then this uncertainty is incorporated directly into any predictions we
might make.
In Sec. 5.6, we discussed how the frequentist statistics addresses the uncer-
tainty in a given point estimator of θ by evaluating its variance. The variance of
the estimator is an assessment of how the estimate might change will alternative
samplings of the observed (or training) data. The Bayesian answer to the ques-
tion of how to deal with the uncertainty in the estimator is to simply integrate
over it, which tends to protect well against overfitting.
The second important difference between the Bayesian approach to estimation
and the Maximum Likelihood approach is due to the contribution of the Bayesian
prior distribution. The prior has an influence by shifting probability mass density
towards regions of the parameter space that are preferred a priori. In practice,
the prior often expresses a preference for models that are simpler or more smooth.
One important effect of the prior is to actually reduce the uncertainty (or entropy)
in the posterior density over θ.
We have already noted that combining the prior, p(θ), with the data likelihood
p(x
(1)
, . . . , x
(m)
| θ) results in a distribution that is less entropic (more peaky) than
the prior. This is just the result of a basic property of probability distributions:
Entropy(product of two densities) Entropy(either density). This implies that
the posterior density on θ is also less entropic than the data likelihood alone
(when viewed and normalized as a density over θ). The hypothesis space with the
Bayesian approach is, to some extent, more constrained than that with an ML
approach. Thus we expect a contribution of the prior to be a further reduction
in overfitting as compared to ML estimation.
Example: Linear Regression Here we consider the Bayesian estimation ap-
proach to learning the linear regression parameters. In linear regression, we learn
a linear mapping from an input vector x R
n
to predict the value of a scalar
y R. The prediction is parametrized by the vector w R
n
:
ˆy = w
>
x.
116
CHAPTER 5. MACHINE LEARNING BASICS
Given a set of m training samples (X
(train)
, y
(train)
), we can express the prediction
of y over the entire training set as:
ˆ
y
(train)
= X
(train)
w.
Expressed as a Gaussian conditional distribution on y
(train)
, we have
p(y
(train)
| X
(train)
, w) = N(y
(train)
; X
(train)>
w, I)
exp
1
2
(y
(train)
X
(train)
w)
>
(y
(train)
X
(train)
w)
,
where we will follow the standard MSE formulation in assuming that the Gaussian
variance on y is one. In what follows, to reduce the notational burden, we refer
to (X
(train)
, y
(train)
) as simply (X, y).
To determine the posterior distribution over the model parameter vector w,
we first need to specify a prior distribution. The prior should reflect our naive
belief about the value of these parameters. While it is sometimes difficult or
unnatural to express our prior beliefs in terms of the parameters of the model, in
practice we typically assume a fairly broad distribution expressing a high degree
of uncertainty about θ in our prior belief.
For real-valued parameters it is common to use a Gaussian as a prior distri-
bution:
p(w) = N(w; µ
0
, Λ
0
) exp
1
2
(w µ
0
)
>
Λ
1
0
(w µ
0
)
where µ
0
and Λ
0
are the prior distribution mean vector and covariance matrix
(inverse of covariance matrix) respectively.
4
With the prior thus specified, we can now proceed in determining the posterior
distribution over the model parameters.
p(w | X, y) p(y | X, w)p(w)
exp
1
2
(y Xw)
>
(y Xw)
exp
1
2
(w µ
0
)
>
Λ
1
0
(w µ
0
)
exp
1
2
2y
>
Xw + w
>
X
>
Xw + w
>
Λ
1
0
w 2µ
>
0
Λ
1
0
w
We now make the substitutions Λ
m
=
X
>
X + Λ
1
0
1
and
µ
m
= Λ
m
X
>
y + Λ
1
0
µ
0
into the derivation of the posterior (and complete the
4
Unless there is a reason to assume a particular covariance structure, we typically assume a
diagonal covariance matrix Λ
0
= diag(λ
0
).
117
CHAPTER 5. MACHINE LEARNING BASICS
square) to get:
p(w | X, y) exp
1
2
(w µ
m
)
>
Λ
1
m
(w µ
m
) +
1
2
µ
>
m
Λ
1
m
µ
m
(5.31)
exp
1
2
(w µ
m
)
>
Λ
1
m
(w µ
m
)
. (5.32)
In the above, we have dropped all terms that do not include the parameter vector
w. In Eq. 5.32, we recognize that the posterior distribution has the form of
a Gaussian distribution with mean vector µ
m
and covariance matrix Λ
m
. It is
interesting to note that this justifies our dropping all terms unrelated to w, since
we know that the posterior distribution must be normalized and, as a Gaussian,
we know what that normalization constant must be (where n is the dimension of
the input):
p(w | X
(train)
, y
(train)
) =
1
p
(2π)
n
|Λ
m
|
exp
1
2
(w µ
m
)
>
Λ
1
m
(w µ
m
)
.
(5.33)
5.8.1 Maximum A Posteriori (MAP) Estimation
While, in principle, we can use the full Bayesian posterior distribution over the
parameter θ as our estimate of this parameter, it is still often desirable to have a
single point estimate (for example, most operations involving the Bayesian pos-
terior for most interesting models are intractable and must be heavily approxi-
mated). Rather than simply returning to the maximum likelihood estimate, we
can still gain some of the benefit of the Bayesian approach by allowing the prior to
influence the choice of the point estimate. One rational way to do this is to choos-
ing the maximum a posteriori (MAP) point estimate. The MAP estimate chooses
the point of maximal posterior probability (or maximal probability density in the
more common case of continuous θ).
θ
MAP
= arg max
θ
p(θ | x) = arg max
θ
log p(x | θ) + log p(θ) (5.34)
We recognize, above on the right hand side, log p(x | θ), i.e. the standard log-
likelihood term and log p(θ) corresponding to the prior distribution.
As discussed above the advantage brought by introducing the influence of the
prior on the MAP estimate is to leverage information other than that contained
in the training data. This additional information helps to reduce the variance in
the MAP point estimate (in comparison to the ML estimate). However, it does
so at the price of increased bias.
118
CHAPTER 5. MACHINE LEARNING BASICS
Example: Regularized Linear Regression We discussed above the Bayesian
approach to linear regression. Given a set of m training samples of input out-
put pairs: (X
(train)
, y
(train)
), we can express the prediction of y over the entire
training set as:
ˆ
y
(train)
= X
(train)
w.
where prediction is parametrized by the vector w R
n
.
Recall from Sec. 5.7.1 that the maximum likelihood estimate for the model
parameters is given by:
ˆ
w
ML
= (X
(train)>
X
(train)
)
1
X
(train)>
y
(train)
(5.35)
For the sake of comparison to the maximum likelihood solution, we will make
the simplifying assumption that the prior covariance matrix is scalar: Λ
0
= λ
0
I.
As mentioned previously, in practice, this is a very common form of prior distri-
bution. We will also assume that µ
0
= 0. This is also a very common assumption
in practice and corresponds to acknowledging that a priori, we do not know if
the features of x have a positive or negative correlation with y. Adding these
assumptions, the MAP estimate of the model parameters (corresponding to the
mean of the Gaussian posterior density, in Eq. 5.32) becomes:
ˆ
w
MAP
= Λ
m
X
(train)>
y
(train)
(5.36)
where µ
0
and Λ
0
are the prior mean and covariance respectively and Λ
m
is the
posterior covariance and is given by:
Λ
m
=
X
(train)>
X
(train)
+ λ
1
0
I
1
(5.37)
Comparing Eqs. 5.35 and 5.36, we see that the MAP estimate amounts to a
weighted combination of the prior maximum probability value, µ
0
, and the ML
estimate. As the variance of the prior distribution tends to infinity, the MAP
estimate reduces to the ML estimate. As the variance of the prior tends to zero,
the MAP estimate tends to zero (actually it tends to µ
0
which here is assumed
to be zero).
We can make the model capacity tradeoff between the ML estimate and the
MAP estimate more explicit by analyzing the bias and variance of these estimates.
It is relatively easy to show that the ML estimate is unbiased, i.e. that
E[
ˆ
w
ML
] = w and that it has a variance given by:
Var(w
ML
) = (X
(train)>
X
(train)
)
1
(5.38)
119
CHAPTER 5. MACHINE LEARNING BASICS
In order to derive the bias of the MAP estimate, we need to calculate the
expectation:
E[
ˆ
w
MAP
] = E
m
X
(train)>
y
(train)
]
= E
h
Λ
m
X
(train)>
X
(train)
w +
i
= Λ
m
X
(train)>
X
(train)
w
+ Λ
m
X
(train)>
E []
=
X
(train)>
X
(train)
+ λ
1
0
I
1
X
(train)>
X
(train)
w, (5.39)
We see that while the expected value of the ML estimate is the true parameter
value w (i.e. the parameters that we assume generated the data); the expected
value of the MAP estimate is a weighted average of w and the prior mean µ. We
comput the bias as:
Bias(
ˆ
w
MAP
) = E[
ˆ
w
MAP
] w
=
λ
0
X
(train)>
X
(train)
+ I
1
w.
Since the bias is not zero, we can conclude that the MAP estimate is biased, and
as expected we can see that as the variance of the prior λ
0
, the bias tends
to zero. As the variance of the prior λ
0
0, the bias tends to w.
In order to compute the variance, we use the identity Var(
ˆ
θ) = E
h
ˆ
θ
2
i
E
h
ˆ
θ
i
2
.
So before computing the variance we need to compute E
ˆ
w
MAP
ˆ
w
>
MAP
:
E
h
ˆ
w
MAP
ˆ
w
>
MAP
i
= E
h
Λ
m
X
(train)>
ˆ
y
(train)
y
(train)>
X
(train)
Λ
m
i
= E
Λ
m
X
(train)>
X
(train)
w +
X
(train)
w +
>
X
(train)
Λ
m
= Λ
m
X
(train)>
X
(train)
ww
>
X
(train)>
X
(train)
Λ
m
+ Λ
m
X
(train)>
E
h

>
i
X
(train)
Λ
m
= Λ
m
X
(train)>
X
(train)
ww
>
X
(train)>
X
(train)
Λ
m
+ Λ
m
X
(train)>
X
(train)
Λ
m
= E[
ˆ
w
MAP
]E[
ˆ
w
MAP
]
>
+ X
(train)>
X
(train)
Λ
m
With E
ˆ
w
MAP
ˆ
w
>
MAP
thus computed, the variance of the MAP estimate of
120
CHAPTER 5. MACHINE LEARNING BASICS
our linear regression model is given by:
Var(
ˆ
w
MAP
) = E
h
ˆ
w
MAP
ˆ
w
>
MAP
i
E [
ˆ
w
MAP
] E
h
ˆ
w
>
MAP
i
= E[
ˆ
w
MAP
]E[
ˆ
w
MAP
]
>
+ Λ
m
X
(train)>
X
(train)
Λ
m
E[
ˆ
w
MAP
]E[
ˆ
w
MAP
]
>
= Λ
m
X
(train)>
X
(train)
Λ
m
=
X
(train)>
X
(train)
+ λ
1
0
I
1
X
(train)>
X
(train)
×
X
(train)>
X
(train)
+ λ
1
0
I
1
(5.40)
It is perhaps difficult to compare Eqs. 5.38 and 5.40. But if we assume that
w is one-dimensional (along with x), it becomes a bit easier to see that, as long
as λ
0
is bounded, then Var( ˆw
ML
) =
1
P
m
i=1
x
2
i
> Var( ˆw
MAP
) =
λ
0
P
m
i=1
x
2
i
(
1+λ
0
P
m
i=1
x
2
i
)
2
.
From the above analysis we can see that the role of the prior in the MAP
estimate is to trade increased bias for a reduction in variance. The goal, of
course, is to try to avoid overfitting. The incurred bias is a consequence of the
reduction in model capacity caused by limiting the space of hypotheses to those
with significant probability density under the prior.
5.8.2 Regularization
In the previous section we saw how incorporating a prior distribution reduces the
variance of the parameter estimate and prevents overfitting. In practice, many
machine learning algorithms actually do not fit into the probabilistic framework,
where principles such as maximum likelihood, and maximum a posteriori estima-
tion apply. Fortunately, this variance reduction strategy readily generalizes.
Regularization is the process of adding information not available in the
training data in an effort to control model capacity and prevent overfitting.
We introduce an additional term, sometime referred to as the “regularizer”, in
the objective function. The regularizer is a function of the model parameters (or
more generally, a function of the model) and indicates a preference over some
functions or parameter values (or models) over others: typically the regularizer
prefer simpler solutions.
One popular regularizer applied to neural networks and other deep learning
models is known as the weight decay regularizer
η||θ||
2
where θ is the parameter vector and λ is a scalar that controls the strength of
the regularizer against the other terms in the objective function to be minimized.
121
CHAPTER 5. MACHINE LEARNING BASICS
Weight decay actually corresponds to the MAP objective function with a Gaus-
sian prior on θ, where η is inversely proportional to the prior Gaussian variance.
Although regularization is more general, popular regularizers are frequently in-
terpretable as Bayesian priors. Regularization is treated in considerable depth in
Chapter 7.
5.9 Supervised Learning
Supervised learning algorithms are, roughly speaking, learning algorithms that
learn to associate some input with some output, given a training set of examples
of inputs x and outputs y.
5.9.1 Probabilistic Supervised Learning
Most supervised learning algorithms in this book are based on estimating a proba-
bility distribution p(y | x). We can do this simply by using maximum conditional
likelihood estimation (Sec. 5.7.1) or just maximum likelihood for short – to find
the best parameter vector θ for a parametric family of distributions p(y | x; θ).
We have already seen that linear regression corresponds to the family p(y |
x; θ) = N(y | θ
>
x, I). We can generalize linear regression to the classification
scenario by defining a different family of probability distributions. If we have two
classes, class 0 and class 1, then we need only specify the probability of one of
these classes. The probability of class 1 determines the probability of class 0,
because these two values must add up to 1.
The normal distribution over real-valued numbers that we used for linear
regression is parameterized in terms of a mean. Any value we supply for this
mean is valid. A distribution over a binary variable is slightly more complicated,
because its mean must always be between 0 and 1. One way to solve this problem
is to use the logistic sigmoid function to squash the output of the linear function
into the interval (0, 1) and interpret that value as a probability:
p(y = 1 | x; θ) = σ(θ
>
x).
This approach is known as logistic regression (a somewhat strange name since we
use the model for classification rather than regression).
In the case of linear regression, we were able to find the optimal weights
by solving the normal equations. Logistic regression is somewhat more difficult.
There is no closed-form solution for its optimal weights. Instead, we must search
for them by maximizing the log-likelihood. We can do this by minimizing the
negative log-likelihood (NLL) using gradient descent.
122
CHAPTER 5. MACHINE LEARNING BASICS
This same strategy can be applied to essentially any supervised learning prob-
lem, by writing down a parametric family of probability of conditional distribu-
tions over the right kind of input and output variables.
5.9.2 Support Vector Machines
One of the most influential approaches to supervised learning is the support vector
machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to
logistic regression in that it is driven by a linear function w
>
x+b. Unlike logistic
regression, the support vector machine does not provide probabilities, but only
outputs a class identity.
One key innovation associated with support vector machines is the kernel trick.
The kernel trick consists of observing that many machine learning algorithms can
be written exclusively in terms of dot products between examples. For example,
it can be shown that the linear function used by the support vector machine can
be re-written as
w
>
x + b = b +
m
X
i=1
α
i
x
>
x
(i)
where x
(i)
is a training example and α is a vector of coefficients. Rewriting the
learning algorithm this way allows us to replace x by the output of a given feature
function φ(x) and the dot product with a function k(x, x
(i)
) = φ(x)
>
φ(x
(i)
) called
a kernel.
We can then make predictions using the function
f(x) = b +
X
i
α
i
k(x, x
(i)
). (5.41)
This function is linear in the space that φ maps to, but non-linear as a function
of x.
The kernel trick is powerful for two reasons. First, it allows us to learn models
that are non-linear as a function of x using convex optimization techniques that
are guaranteed to converge efficiently. This is only possible because we consider φ
fixed and only optimize α, i.e., the optimization algorithm can view the decision
function as being linear in a different space. Second, the kernel function k need not
be implemented in terms of explicitly applying the φ mapping and then applying
the dot product. The dot product in φ space might be equivalent to a non-
linear but computationally less expensive operation in x space. For example, we
could design an infinite-dimensional feature mapping φ(x) over the non-negative
integers. Suppose that this mapping returns a vector containing x ones followed
by infinitely many zeros. Explicitly constructing this mapping, or taking the dot
product between two such vectors, costs infinite time and memory. But we can
123
CHAPTER 5. MACHINE LEARNING BASICS
write a kernel function k(x, x
(i)
) = min(x, x
(i)
) that is exactly equivalent to to
this infinite-dimensional dot product. The most commonly used kernel is the
Gaussian kernel
k(u, v) = N(u v; 0, σ
2
I) (5.42)
where N(x; µ, Σ) is the standard normal density. This kernel corresponds to the
dot product k(u, v) = φ(x)
>
φ(x) on an infinite-dimensional feature space φ and
also has an interpretation as a similarity function, acting like a kind of template
matching.
Support vector machines are not the only algorithm that can be enhanced
using the kernel trick. Many linear models can be enhanced in this way. This
category of algorithms is known as kernel machines or kernel methods.
A major drawback to kernel machines is that the cost of learning the α coeffi-
cients is quadratic in the number of training examples. A related problem is that
the cost of evaluating the decision function is linear in the number of training
examples, because the i-th example contributes a term α
i
k(x, x
(i)
) to the deci-
sion function. Support vector machines are able to mitigate this by learning an α
vector that contains mostly zeros. Classifying a new example then requires eval-
uating the kernel function only for the training examples that have non-zero α
i
.
These training examples are known as support vectors. Another major drawback
of common kernel machines (such as those using the Gaussian kernel) is more
statistical and regards their difficulty in generalizing to complex variations far
from the training examples, as discussed in Section 5.12.
The analysis of the statistical limitations of support vector machines with
general purpose kernels like the Gaussian kernels actually motivated the rebirth
of neural networks through deep learning. Support vector machines and other
kernel machines have often been viewed as a competitor to deep learning (though
some deep networks can in fact be interpreted as support vector machines with
learned kernels). The current deep learning renaissance began when deep net-
works were shown to outperform support vector machines on the MNIST bench-
mark dataset (Hinton et al., 2006). One of the main reasons for the current
popularity of deep learning relative to support vector machines is the fact that
the cost of training kernel machines usually scales quadratically with the number
of examples in the training set. For a deep network of fixed size, the memory cost
of training is constant with respect to training set size (except for the memory
needed to store the examples themselves) and the runtime of a single pass through
the training set is linear in training set size. These asymptotic results meant that
kernelized SVMs dominated while datasets were small, but deep models currently
dominate now that datasets are large.
124
CHAPTER 5. MACHINE LEARNING BASICS
5.10 Unsupervised Learning
Whereas supervised learning is geared at a very specific task such as predicting
a variable y given a variable x from the observed data, unsupervised learning
tries to extract more general-purpose statistical structure from the data. In fact,
several more advanced forms unsupervised learning can be thought of as the rather
general task of extracting all the possible information from the observed data. Let
us call the observed data z (which could correspond to a pair (x, y) or maybe to
observing just x alone), corresponding to the joint observation of many individual
variables z
1
, z
2
, z
3
, then some unsupervised learning procedures basically amount
to learning what it takes to be able to predict any subset of the variables given
any other subset.
Unsupervised learning can also be seen as the types of learning algorithms
that extract potentially useful information from inputs x alone, without any pre-
specified label y. The objective is then to use the extracted information later, for
some supervised task involving the prediction of y given x. This would be a form
of semi-supervised learning, in which we combine unlabeled examples (with only
examples of x) with labeled examples (with (x, y) pairs).
Learning a representation of data A classic unsupervised learning task is to
find the ‘best’ representation of the data. By ‘best’ we can mean different things,
but generally speaking we are looking for a representation that preserves as much
information about x as possible while obeying some penalty or constraint aimed
at keeping the representation simpler or more accessible than x itself.
There are multiple ways of defining a simpler representation, some of the
most common include lower dimensional representations, sparse representations
and independent representations. Low-dimensional representations attempt to
compress as much information about x as possible in a smaller representation.
Sparse representations generally embed the dataset into a high-dimensional rep-
resentation
5
where the number of non-zero entries is small. This results in an
overall structure of the representation that tends to distribute data along the
axes of the representation space. Independent representations attempt to dis-
entangle the sources of variation underlying the data distribution such that the
dimensions of the representation are statistically independent.
Of course these three criteria are certainly not mutually exclusive. Low-
dimensional representations often yield elements that are more-or-less mutually
independent. This happens because the pressure to encode as much information
about the data x as possible into a low-dimensional representation drives the el-
5
sparse representations often use over-complete representations: the representation dimension
is greater than the original dimensionality of the data.
125
CHAPTER 5. MACHINE LEARNING BASICS
ements of this representation to be more independent. Any dependency between
the variables in f(x) is evidence of redundancy and implies that the representation
f(x) could have captured more information about x.
The notion of representation is one of the central themes of deep learning
and therefore one of the central themes in this book. Chapter 17 discusses some
of the qualities we would like in our learned representations, along with specific
representation learning algorithms more powerful than the simple one presented
next, Principal Components Analysis.
5.10.1 Principal Components Analysis
In the remainder of this section we will consider one of the most widely used un-
supervised learning methods: Principle Components Analysis (PCA). PCA is an
orthogonal, linear transformation of the data that projects it into a representation
where the elements are uncorrelated (shown in Figure 5.6).
Z =XW
x
1
x
2
x
1
x
2
z
2
z
1
Figure 5.6: Illustration of the data representation learned via PCA.
In section 2.12, we saw that we could learn a one-dimensional representation
that best reconstructs the original data (in the sense of mean squared error) and
that this representation actually corresponds to the first principal component of
the data. Thus we can use PCA as a simple and effective dimensionality reduction
method that preserves as much of the information in the data as possible (again,
as measured by least-squares reconstruction error). In the following, we will
take a look at other properties of the PCA representation. Specifically, we will
study how the PCA representation can be said to decorrelate the original data
representation X.
Let us consider the n ×m-dimensional design matrix X. We will assume that
the data has a mean of zero, E[x] = 0. If this is not the case, the data can easily
be centered (mean removed). The unbiased sample covariance matrix associated
126
CHAPTER 5. MACHINE LEARNING BASICS
with X is given by:
Var[x] =
1
n 1
X
>
X (5.43)
One important aspect of PCA is that it finds a representation (through linear
transformation) z = W x where Var[z] is diagonal. To do this, we will make use
of the singular value decomposition (SVD) of X: X = UΣW
>
, where Σ is an
n ×m-dimensional rectangular diagonal matrix with the singular values of X on
the main diagonal, U is an n×n matrix whose columns are orthonormal (i.e. unit
length and orthogonal) and W is an m ×m matrix also composed of orthonormal
column vectors.
Using the SVD of X, we can re-express the variance of X as:
Var[x] =
1
n 1
X
>
X (5.44)
=
1
n 1
(UΣW
>
)
>
UΣW
>
(5.45)
=
1
n 1
W Σ
>
U
>
UΣW
>
(5.46)
=
1
n 1
W Σ
2
W
>
, (5.47)
where we use the orthonormality of U (U
>
U = I) and define Σ
2
as an m × m-
dimensional diagonal matrix with the squares of the singular values of X on the
diagonal, i.e. the ith diagonal elements is given by Σ
2
i,i
. This shows that if we
take z = W x, we can ensure that the covariance of z is diagonal as required.
Var[z] =
1
n 1
Z
>
Z (5.48)
=
1
n 1
W
>
X
>
XW (5.49)
=
1
n 1
W W
>
Σ
2
W W
>
(5.50)
=
1
n 1
Σ
2
(5.51)
Similar to our analysis of the variance of X above, we exploit the orthonormality
of W (i.e., W
>
W = I). Our use of SVD to solve for the PCA components of X
(i.e. elements of z) reveals an interesting connection to the eigen-decomposition
of a matrix related to X. Specifically, the columns of W are the eigenvectors of
the n ×n-dimensional matrix X
>
X.
The above analysis shows that when we project the data x to z, via the linear
transformation W , the resulting representation has a diagonal covariance matrix
127
CHAPTER 5. MACHINE LEARNING BASICS
(as given by Σ
2
) which immediately implies that the individual elements of z are
mutually uncorrelated.
This ability of PCA to transform data into a representation where the ele-
ments are mutually uncorrelated is a very important property of PCA. It is a
simple example of a representation that attempt to disentangle the unknown fac-
tors of variation underlying the data. In the case of PCA, this disentangling takes
the form of finding a rotation of the input space (mediated via the transformation
W ) that aligns the principal axes of variance with the basis of the new represen-
tation space associated with z, as illustrated in Fig. 5.6. While correlation is
an important category of dependency between elements of the data, we are also
interested in learning representations that disentangle more complicated forms of
feature dependencies. For this, we will need more than what can be done with a
simple linear transformation. These issues are discussed below in Sec. 5.12 and
later in detail in Chapter 17.
5.11 Weakly Supervised Learning
Weakly supervised learning is another class of learning methods that stands be-
tween supervised and unsupervised learning. It refers to a setting where the
datasets consists of (x, y) pairs, as in supervised learning, but where the labels
y are either unreliably present (i.e. with missing values) or noisy (i.e. where the
label given is not the true label).
Methods for working with weakly labeled data have recently grown in impor-
tance due to the—largely untapped—potential for using large quantities of readily
available weakly labeled data in a transfer learning paradigm to help solve prob-
lems where large, clean datasets are hard to come-by. The Internet has become
a major source of this kind of noisy data.
For example, although we would like to train a computer vision system with
labels indicating the presence and location of every object (and which pixels
correspond to which object) in every image, such labeling is very human-labor
intensive. Instead, we want to take advantage of images for which only the main
object is identified, like the ImageNet dataset (Deng et al., 2009), or worse, of
video for which some general and high-level semantic spoken caption is approx-
imately temporally aligned with the corresponding frames of the video, like the
DVS data (Descriptive Video service) which has recently been released (Torabi
et al., 2015).
128
CHAPTER 5. MACHINE LEARNING BASICS
5.12 The Curse of Dimensionality and Statistical Lim-
itations of Local Generalization
The number of variable configurations grows exponentially with the number of
variables, i.e., with dimension, which brings up a statistical form of the curse
of dimensionality, introduced in the next section. Many non-parametric learning
algorithms, such as kernel machines with a Gaussian kernel, rely on a simple pref-
erence over functions which corresponds to an assumption of smoothness or local
constancy. As argued in Section 5.12.2 that follows, this allows these algorithms to
generalize near the training examples, but does not allow them to generalize in a
non-trivial way far from them: the number of ups and downs that can be captured
is limited by the number of training examples. This is particularly problematic
with high-dimensional data, because of the curse of dimensionality. In order to
reduce that difficulty, researchers have introduced the idea of dimensionality re-
duction and manifold learning, introduced in Section 5.12.3. This motivates the
introduction of additional a priori about the task to be learned, as well as the
idea of learning to better represent the data, the topic which constitutes the bulk
of the rest of this book.
5.12.1 The Curse of Dimensionality
Many machine learning problems become exceedingly difficult when the number
of dimensions in the data is high. This phenomenon is known as the curse of
dimensionality. Of particular concern is that the number of possible distinct con-
figurations of the variables of interest increases exponentially as the dimensionality
increases.
129
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.7: As the number of relevant dimensions of the data increases (from left to right),
the number of configurations of interest may grow exponentially. In the figure we first
consider one-dimensional data (left), i.e., one variable for which we only care to distinguish
10 regions of interest. With enough examples falling within each of these regions (cells,
in the figure), learning algorithms can easily generalize correctly, i.e., estimate the value
of the target function within each region (and possibly interpolate between neighboring
regions). With 2 dimensions (center), but still caring to distinguish 10 different values of
each variable, we need to keep track of up to 10×10=100 regions, and we need at least
that many examples to cover all those regions. With 3 dimensions (right) this grows to
10
3
= 1000 regions and at least that many examples. For d dimensions and V values to
be distinguished along each axis, it looks like we need O(V
d
) regions and examples. This
is an instance of the curse of dimensionality. However, note that if the data distribution is
concentrated on a smaller set of regions, we may actually not need to cover all the possible
regions, only those where probability is non-negligible. Figure graciously provided by, and
with authorization from, Nicolas Chapados.
The curse of dimensionality rears its ugly head in many places in computer
science, and especially so in machine learning.
One challenge posed by the curse of dimensionality is a statistical challenge.
As illustrated in Figure 5.7, a statistical challenge arises because the number of
possible configurations of the variables of interest is much larger than the number
of training examples. To understand the issue, let us consider that the input space
is organized into a grid, like in the figure. In low dimensions we can describe this
space with a low number of grid cells that are mostly occupied by the data. The
least we can assume about the data generating distribution is that our learner
should provide the same answer to two examples falling in the same grid cell.
It is a form of local constancy assumption, a notion that we develop further in
the next section. When generalizing to a new data point, we can usually tell
what to do simply by inspecting the training examples that lie in the same cell
as the new input. For example, if estimating the probability density at some
point x, we can just return the number of training examples in the same unit
volume cell as x, divided by the total number of training examples. If we wish to
130
CHAPTER 5. MACHINE LEARNING BASICS
classify an example, we can return the most common class of training examples
in the same cell. If we are doing regression we can average the target values
observed over the examples in that cell. But what about the cells for which
we have seen no example? Because in high-dimensional spaces the number of
configurations is going to be huge, much larger than our number of examples,
most configurations will have no training example associated with it. How could
we possibly say something meaningful about these new configurations? A simple
answer is to extend the local constancy assumption into a smoothness assumption,
as explained next.
5.12.2 Smoothness and Local Constancy A Priori Preference
As argued previously, and especially in high-dimensional spaces (because of the
curse of dimensionality introduced above), machine learning algorithms need pri-
ors, i.e., a preference over the space of solutions, in order to generalize to new
configurations not seen in the training set. The specification of these preferences
includes the choice of model family, as well as any regularizer or other aspects
of the algorithm that influence the final outcome of training. We consider here
a particular family of preferences which underlie many classical machine learning
algorithms, and which we call the smoothness prior or the local constancy prior.
We find that when the function to be learned has many ups and downs, and this
is typically the case in high-dimensional spaces because of the curse of dimen-
sionality (see above), then the smoothness prior is insufficient to achieve good
generalization. We argue that more assumptions are needed in order to gener-
alize better, in this setting. Deep learning algorithms typically introduce such
additional assumptions. This starts with the classical multi-layer neural networks
studied in the next chapter (Chapter 6), and in Chapter 17 we return to the ad-
vantages that representation learning, distributed representations and depth can
bring towards generalization, even in high-dimensional spaces.
Different smoothness or local constancy priors can be expressed, but what
they basically say is that the target function or distribution of interest f
is such
that
f
(x) f
(x + ) (5.52)
for most configurations x and small change . In other words, if we know a
good answer (e.g., for an example x) then that answer is probably good in the
neighborhood of x, and if we have several good answers in some neighborhood we
would combine them (e.g., by some form of averaging or interpolation) to produce
an answer that agrees with them as much as possible.
An example of locally constant family of functions is the histogram, which
breaks the input space into a number of distinguishable regions or “bins” in which
any x can fall, and produces a constant output within each region. Another
131
CHAPTER 5. MACHINE LEARNING BASICS
example of piecewise constant learned function is what we obtain with k-nearest
neighbors predictors, where f(x) is constant in some region R containing all the
points x that have the same set of k nearest neighbors from the training set. If we
are doing classification and k=1, f(x) is just the output class associated with the
nearest neighbor of x in the training set. If we are doing regression, f(x) is the
average of the outputs associated with the k nearest neighbors of x. Note that
in both cases, for k = 1, the number of distinguishable regions of cannot be more
than the number of training examples. See Murphy (2012); Bishop (2006) or other
machine learning textbooks for more material on histograms and nearest-neighbor
classifiers.
Figure 5.8: Illustration of interpolation and kernel-based methods, which construct a
smooth function by interpolating in various ways between the training examples (circles),
which act like knot points controlling the shape of the implicit regions that separate them
as well as the values to output within each region. Depending on the type of kernel, one
obtains a piecewise constant (histogram-like, in dotted red), a piecewise linear (dashed
black) or a smoother kernel (bold blue). The underlying assumption is that the target
function is as smooth or locally as constant as possible. This assumption allows to
generalize locally, i.e., to extend the answer known at some point x to nearby points, and
this works very well so long as, like in the figure, there are enough examples to cover
most of the ups and downs of the target function.
To obtain even more smoothness, we can interpolate between neighboring
training examples, as illustrated in Figure 5.8. For example, non-parametric ker-
nel density estimation methods and kernel regression methods construct a learned
function f of the form of Eq. 5.41 for classification or regression, or alternatively,
132
CHAPTER 5. MACHINE LEARNING BASICS
e.g., in the Parzen regression estimator, of the form
f(x) = b +
n
X
i=1
α
i
k(x, x
(i)
)
P
n
j=1
k(x, x
(j)
)
.
If the kernel function k is discrete (e.g. 0 or 1), then this can include the above
cases where f is piecewise constant and a discrete set of regions (no more than
one per training example) can be distinguished. However, better results can often
be obtained if k is smooth, e.g., the Gaussian kernel from Eq. 5.42. With k a local
kernel (Bengio et al., 2006a; Bengio and LeCun, 2007b; Bengio, 2009)
6
, we can
think of each x
(i)
as a template and the kernel function as a similarity function
that matches a template and a test example.
With the Gaussian kernel, we do not have a piecewise constant function but
instead a continuous and smooth function. In fact, the choice of k can be shown
to correspond to a particular form of smoothness. Equivalently, we can think of
many of these estimators as the result of smoothing the empirical distribution
by convolving it with a function associated with the kernel, e.g., the Gaussian
kernel density estimator is the empirical distribution convolved with the Gaussian
density.
Although in classical non-parametric estimators the α
i
of Eq. 5.41 are fixed
(e.g. to 1/n for density estimation and to y
(i)
for supervised learning from ex-
amples (x
(i)
, y
(i)
)), they can be optimized, and this is the basis of more modern
non-parametric kernel methods (Scolkopf and Smola, 2002) such as the Sup-
port Vector Machine (Boser et al., 1992; Cortes and Vapnik, 1995) (see also Sec-
tion 5.9.2).
However, as illustrated in Figure 5.8, even though these smooth kernel meth-
ods generalize better, the main thing that has changed is that one can basically
interpolate between the neighboring examples, in some space associated with the
kernel. One can then think of the training examples as control knots which locally
specify the shape of each region and the associated output.
6
i.e., with k(u, v) large when u = v and decreasing as they get farther apart
133
CHAPTER 5. MACHINE LEARNING BASICS
0
1
01
11
111
0 1
11
111
011
1111
1110
110
10
010
00
1110 1111
110
100100
010 011
Figure 5.9: Decision tree (right) and how it cuts the input space into regions, with a
constant output in each region (left). Each node of the tree (circle or square) is associated
with a region (the entire space for the root node, with the empty string identifier).
Internal nodes (circles) split their region in two, in this case (the most common) via
an axis aligned cut. Leaf nodes (squares) are associated with an “answer”, such as the
average target output for the training examples that fall in the corresponding region.
Each node is displayed with a binary string identifier corresponding to its position in
the tree, obtained by adding a bit to its parent (0=choose left or top, 1=choose right or
bottom). Note that the result is a piecewise-constant function, and note how the number
of regions (pieces) cannot be greater than the number of examples, hence it is not possible
to learn a function that has more ups and downs than the number of training examples.
Another type of non-parametric learning algorithm that also breaks the input
space into regions and has separate parameters for each region is the decision
tree (Breiman et al., 1984) and its many variants. We give a brief account here
and illustrate decision trees in Figure 5.9, but please refer as needed to Murphy
(2012); Bishop (2006) or other machine learning textbooks for more material on
decision trees. Each node of the decision tree is associated with a region in the
input space, and internal nodes breaks that region into one sub-region for each
child of the node (typically using an axis-aligned cut). Typically, a constant
output f (n(x)) is returned by the decision tree predictor for any x falling in the
134
CHAPTER 5. MACHINE LEARNING BASICS
region associated with a particular leaf node n(x). Because each example only
informs the region in which it falls about which output to produce, one cannot
have more regions than training examples. If the target function can be well
approximated by cutting the input space into N regions (with a different answer
in each region), then at least N examples are needed (and a multiple of N is
needed to achieve some level of statistical confidence in the predicted output).
All this is also true if the tree is used for density estimation (the output is simply
an estimate of the density within the region, which can be obtained by the ratio of
the number of training examples in the region by the region volume) or whether
a non-constant (e.g. linear) predictor is associated with each leaf (then more
examples are needed within each leaf node, but the relationship between number
of regions and number of examples remains linear). We examine below how this
may hurt the generalization ability of decision trees and other learning algorithms
that are based only on the smoothness or local constancy priors, when the input
is high-dimensional, i.e., because of the curse of dimensionality.
In all cases, the smoothness assumption (Eq. 5.52) allows the learner to gen-
eralize locally. Since we assume that the target function obeys f
(x) f
(x + )
most of the time for small , we can generalize the empirical distribution (or the
(x, y) training pairs) to the neighborhood of the training examples. If (x
(i)
, y
(i)
)
is a supervised (input,target) training example, then we expect f
(x
(i)
) y
(i)
,
and therefore if x is a near neighbor of x
(i)
, we expect that f
(x) y
(i)
. By con-
sidering more neighbors, we can obtain better generalization, by better executing
the smoothness assumption.
In general, to distinguish O(N) regions in input space, all of these meth-
ods require O(N ) examples (and typically there are O(N) parameters associated
with the O(N) regions). This is illustrated in Figure 5.10 in the case of a nearest-
neighbor or clustering scenario, where each training example can be used to define
one region. Is there a way to represent a complex function that has many more
regions to be distinguished than the number of training examples? Clearly, as-
suming only smoothness of the underlying function will not allow a learner to do
that. For example, imagine that the target function is a kind of checkerboard,
i.e., with a lot of variations, but a simple structure to them, and imagine that
the number of training examples is substantially less than the number of black
and white regions. Based on local generalization and the smoothness or local
constancy prior, we could get the correct answer within a constant-colour region,
but we could not correctly predict the checkerboard pattern. The only thing that
an example tells us, with this prior, is that nearby points should have the same
colour, and the only way to get the checkerboard right is to cover all of its cells
with at least one example.
The smoothness assumption and the associated non-parametric learning algo-
135
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.10: Illustration of how non-parametric learning algorithms that exploit only the
smoothness or local constancy priors typically break up the input space into regions, with
examples in those regions being used both to define the region boundaries and what the
output should be within each region. The figure shows the case of clustering or 1-nearest-
neighbor classifiers, for which each training example (cross of a different color) defines a
region or a template (here, the different regions form a Voronoi tessellation). The number
of these contiguous regions cannot grow faster than the number of training examples. In
the case of a decision tree, the regions are recursively obtained by axis-aligned cuts within
existing regions, but for these and for kernel machines with a local kernel (such as the
Gaussian kernel), the same property holds, and generalization can only be local: each
training example only informs the learner about how to generalize in some neighborhood
around it.
136
CHAPTER 5. MACHINE LEARNING BASICS
rithms work extremely well so long as there are enough examples to cover most of
the ups and downs of the target function. This is generally true when the function
to be learned is smooth enough, which is typically the case for low-dimensional
data. And if it is not very smooth (we want to distinguish a huge number of
regions compared to the number of examples), is there any hope to generalize
well?
Both of these questions are answered positively in Chapter 17. The key insight
is that a very large number of regions, e.g., O(2
N
), can be defined with O(N)
examples, so long as we introduce some dependencies between the regions via ad-
ditional priors about the underlying data generating distribution. In this way, we
can actually generalize non-locally (Bengio and Monperrus, 2005; Bengio et al.,
2006b). A neural network can actually learn a checkerboard pattern. Similarly,
some recurrent neural networks can learn the n-bit parity (at least for some not
too large values of n). Of course we could also solve the checkerboard task by
making a much stronger assumption, e.g., that the target function is periodic.
However, neural networks can generalize to a much wider variety of structures,
and indeed our AI tasks have structure that is much too complex to be limited
to periodicity, so we want learning algorithms that embody more general-purpose
assumptions. The core idea in deep learning is that we assume that the data was
generated by the composition of factors or features, potentially at multiple levels
in a hierarchy. These apparently mild assumptions allow an exponential gain in
the relationship between the number of examples and the number of regions that
can be distinguished, as discussed in Chapter 17. Priors that are based on com-
positionality, such as arising from learning distributed representations and from
a deep composition of representations, can give an exponential advantage, which
can hopefully counter the exponential curse of dimensionality. Chapter 17 dis-
cusses these questions from the angle of representation learning and the objective
of disentangling the underlying factors of variation.
5.12.3 Manifold Learning and the Curse of Dimensionality
We consider here a particular type of machine learning task called manifold learn-
ing. Although they have been introduced to reduce the curse of dimensionality.
We will argue that they allow one to visualize and highlight how the smoothness
prior is not sufficient to generalize in high-dimensional spaces. Chapter 18 is de-
voted to the manifold perspective on representation learning and goes in much
greater details in this topic as well as in actual manifold learning algorithms based
on neural networks.
A manifold is a connected region, i.e., a set of points, associated with a neigh-
borhood around each point, which makes it locally look like a Euclidean space.
The notion of neighbor implies the existence of transformations that can be ap-
137
CHAPTER 5. MACHINE LEARNING BASICS
plied to move on the manifold from one position to a neighboring one. Although
there is a formal mathematical meaning to this term, in machine learning it tends
to be used more loosely to talk about a connected set of points that can be well
approximated by considering only a small number of degrees of freedom, or di-
mensions, embedded in a higher-dimensional space. Each dimension corresponds
to a local direction of variation, i.e., moving along the manifold in some direction.
The manifolds we talk about in machine learning are subsets of points, also called
a submanifold, of the embedding space (which is also a manifold).
Manifold learning algorithms assume that the data distribution is concen-
trated in a small number of dimensions, i.e., that the set of high-probability con-
figurations can be approximated by a low-dimensional manifold. Figure 5.6 (left)
illustrates a distribution that is concentrated near a linear manifold (the manifold
is along a 1-dimensional straight line). Manifold learning was introduced in the
case of continuous-valued data and the unsupervised learning setting, although
this probability concentration idea can be generalized to both discrete data and
the supervised learning setting: the key assumption remains that probability mass
is highly concentrated.
Is this assumption reasonable? It seems to be true for almost all of the AI
tasks such as those involving images, sounds, and text. To be convinced of this we
will invoke (a) the observation that probability mass is concentrated and (b) the
observed objects can generally be transformed into other plausible configurations
via some small changes (which indicates a notion of direction of variation while
staying on the “manifold”). For (a), consider that if the assumption of probabil-
ity concentration was false, then sampling uniformly at random from in the set
of all configurations (e.g., uniformly in R
n
) should produce probable (data-like)
configurations reasonably often. But this is not what we observe in practice. For
example, generate pixel configurations for an image by independently picking the
grey level (or a binary 0 vs 1) for each pixel. What kind of images do you get?
You get “white noise” images, that look like the old television sets when no signal
is coming in, as illustrated in Figure 5.11 (left). What is the probability that
you would obtain something that looks like a natural image, with this procedure?
Almost zero, because the set of probable configurations (near the manifold of
natural images) occupies a very small volume out of the total set of pixel con-
figurations. Similarly, if you generate a document by picking letters randomly,
what is the probability that you will get a meaningful English-language text? Al-
most zero, again, because most of the long sequences of letters do not correspond
to a natural language sequence: the distribution of natural language sequences
occupies a very small volume in the total space of sequences of letters.
138
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.11: Sampling images uniformly at random, e.g., by randomly picking each pixel
according to a uniform distribution, gives rise to white noise images such as illustrated on
the left. Although there is a non-zero probability to generate something that looks like a
natural image (like those on the right), that probability is exponentially tiny (exponential
in the number of pixels!). This suggests that natural images are very “special”, and that
they occupy a tiny volume of the space of images.
The above thought experiments, which are in agreement with the many exper-
imental results of the manifold learning literature, e.g. (Cayton, 2005; Narayanan
and Mitter, 2010; Scolkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum
et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003;
Weinberger and Saul, 2004), clearly establish that for a large class of datasets of
interest in AI, the manifold hypothesis is true: the data generating distribution
concentrates in a small number of dimensions, as in the cartoon of Figure 18.4,
from Chapter 18. That chapter explores the relationships between representation
learning and manifold learning: if the data distribution concentrates on a smaller
number of dimensions, then we can think of these dimensions as natural coordi-
nates for the data, and we can think of representation learning algorithms as ways
to map the input space to a new and often lower-dimensional space which captures
the leading dimensions of variation present in the data as axes or dimensions of
the representation.
An initial hope of early work on manifold learning (Roweis and Saul, 2000;
Tenenbaum et al., 2000) was to reduce the effect of the curse of dimensionality,
by first reducing the data to a lower dimensional representation (e.g. mapping
139
CHAPTER 5. MACHINE LEARNING BASICS
(x
1
, x
2
) to z
1
in Figure 5.6 (right)), and then applying ordinary machine learning
in that transformed space. This dimensionality reduction can be achieved by
learning a transformation (generally non-linear, unlike with PCA introduced in
Section 5.10.1) of the data that is invertible for most training examples, i.e., that
keeps the information in the input example. It is only possible to reconstruct input
examples from their low-dimensional representation because they lie on a lower-
dimensional manifold, of course. This is basically how auto-encoders (Chapter 16)
are trained.
The hope was that by non-linearly projecting the data in a new space of lower
dimension, we would reduce the curse of dimensionality by only looking at rel-
evant dimensions, i.e., a smaller set of regions of interest (cells, in Figure 5.7).
This can indeed be the case, however, as discussed in Chapter 18, the manifolds
can be highly curved and have a very large number of twists, requiring still a very
large number of regions to be distinguished (every up and down of each corner of
the manifold). And even if we were to reduce the dimensionality of an input from
10000 (e.g. 100×100 binary pixels) to 100, 2
100
is still too large to hope cover-
ing with a training set. This still rules out the use of purely local generalization
(i.e., the smoothness prior only) to model such manifolds, as discussed in Chap-
ter 18 around Figure 18.4 and 18.5. It may also be that although the effective
dimensionality of the data could be small, some examples could fall outside of the
main manifold and that we do not want to systematically lose that information.
A sparse representation then becomes a possible way to represent data that is
mostly low-dimensional, although occasionally occupying more dimensions. This
can be achieved with a high-dimensional representation whose elements are 0 most
of the time. We can see that the effective dimension (the number of non-zeros)
then can change depending on where we are in input space, which can be useful.
Sparse representations are discussed in Section 16.7.
The next part of the book introduces specific deep learning algorithms that
aim at discovering representations that are useful for some task, i.e., trying to
extract the directions of variations that matter for the task of interest, often in a
supervised setting. The last part of the book concentrates more on unsupervised
representation learning algorithms, which attempt to capture all of the directions
of variation that are salient in the data distribution.
140