Chapter 5
Machine Learning Basics
Deep learning is a specific kind of machine learning. In order to understand
deep learning well, one must have a solid understanding of the basic principles
of machine learning. This chapter provides a brief course in the most important
general principles that will be applied throughout the rest of the book. Novice
readers or those who want a wider perspective are encouraged to consider machine
learning textbooks with a more comprehensive coverage of the fundamentals, such
as Murphy (2012) or Bishop (2006). If you are already familiar with machine
learning basics, feel free to skip ahead to Sec. 5.11. That section covers some per-
spectives on traditional machine learning techniques that have strongly influenced
the development of deep learning algorithms.
We begin with a definition of what a learning algorithm is, and present an
example: the linear regression algorithm. We then proceed to describe how the
challenge of fitting the training data differs from the challenge of finding patterns
that generalize to new data. Most machine learning algorithms have settings
called hyperparameters that must be determined external to the learning algorithm
itself; we discuss how to set these using additional data. Machine learning is
essentially a form of applied statistics with increased emphasis on the use of
computers to statistically estimate complicated functions and a decreased emphasis
on proving confidence intervals around these functions; we therefore present the
two central approaches to statistics: frequentist estimators and Bayesian inference.
Most machine learning algorithms can be divided into the categories of supervised
learning and unsupervised learning; we describe these categories and give some
examples of simple learning algorithms from each category. Most deep learning
algorithms are based on an optimization algorithm called stochastic gradient
descent. We describe how to combine various algorithm components such as an
98
CHAPTER 5. MACHINE LEARNING BASICS
optimization algorithm, a cost function, a model, and a dataset to build a machine
learning algorithm. Finally, in Sec. 5.11, we describe some of the factors that have
limited the ability of traditional machine learning to generalize. These challenges
have motivated the development of deep learning algorithms that overcome these
obstacles.
5.1 Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn from data. But
what do we mean by learning? Mitchell (1997) provides the definition “A computer
program is said to learn from experience
E
with respect to some class of tasks
T
and performance measure
P
, if its performance at tasks in
T
, as measured by
P
,
improves with experience
E
.” One can imagine a very wide variety of experiences
E
, tasks
T
, and performance measures
P
, and we do not make any attempt in this
book to provide a formal definition of what may be used for each of these entities.
Instead, the following sections provide intuitive descriptions and examples of the
different kinds of tasks, performance measures and experiences that can be used
to construct machine learning algorithms.
5.1.1 The Task, T
Machine learning allows us to tackle tasks that are too difficult to solve with
fixed programs written and designed by human beings. From a scientific and
philosophical point of view, machine learning is interesting because developing our
understanding of machine learning entails developing our understanding of the
principles that underlie intelligence.
In this relatively formal definition of the word “task,” the process of learning
itself is not the task. Learning is our means of attaining the ability to perform the
task. For example, if we want a robot to be able to walk, then walking is the task.
We could program the robot to learn to walk, or we could attempt to directly write
a program that specifies how to walk manually.
Machine learning tasks are usually described in terms of how the machine
learning system should process an example. An example is a collection of features
that have been quantitatively measured from some object or event that we want
the machine learning system to process. We typically represent an example as a
vector
x R
n
where each entry
x
i
of the vector is another feature. For example,
the features of an image are usually the values of the pixels in the image.
99
CHAPTER 5. MACHINE LEARNING BASICS
Many kinds of tasks can be solved with machine learning. Some of the most
common machine learning tasks include the following:
Classification: In this type of task, the computer program is asked to specify
which of
k
categories some input belongs to. To solve this task, the learning
algorithm is usually asked to produce a function
f
:
R
n
{
1
, . . . , k}
. When
y
=
f
(
x
), the model assigns an input described by vector
x
to a category
identified by numeric code
y
. There are other variants of the classification
task, for example, where
f
outputs a probability distribution over classes.
An example of a classification task is object recognition, where the input
is an image (usually described as a set of pixel brightness values), and the
output is a numeric code identifying the object in the image. For example,
the Willow Garage PR2 robot is able to act as a waiter that can recognize
different kinds of drinks and deliver them to people on command (Good-
fellow et al., 2010). Modern object recognition is best accomplished with
deep learning (Krizhevsky et al., 2012; Ioffe and Szegedy, 2015). Object
recognition is the same basic technology that allows computers to recognize
faces (Taigman et al., 2014), which can be used to automatically tag people
in photo collections and allow computers to interact more naturally with
their users.
Classification with missing inputs: Classification becomes more challenging if
the computer program is not guaranteed that every measurement in its input
vector will always be provided. In order to solve the classification task, the
learning algorithm only has to define a single function mapping from a vector
input to a categorical output. When some of the inputs may be missing,
rather than providing a single classification function, the learning algorithm
must learn a set of functions. Each function corresponds to classifying
x
with
a different subset of its inputs missing. This kind of situation arises frequently
in medical diagnosis, because many kinds of medical tests are expensive or
invasive. One way to efficiently define such a large set of functions is to learn
a probability distribution over all of the relevant variables, then solve the
classification task by marginalizing out the missing variables. With
n
input
variables, we can now obtain all 2
n
different classification functions needed
for each possible set of missing inputs, but we only need to learn a single
function describing the joint probability distribution. See Goodfellow et al.
(2013b) for an example of a deep probabilistic model applied to such a task
in this way. Many of the other tasks described in this section can also be
generalized to work with missing inputs; classification with missing inputs is
just one example of what machine learning can do.
100
CHAPTER 5. MACHINE LEARNING BASICS
Regression: In this type of task, the computer program is asked to predict a
numerical value given some input. To solve this task, the learning algorithm
is asked to output a function
f
:
R
n
R
. This type of task is similar to
classification, except that the format of output is different. An example of
a regression task is the prediction of the expected claim amount that an
insured person will make (used to set insurance premiums), or the prediction
of future prices of securities. These kinds of predictions are also used for
algorithmic trading.
Transcription: In this type of task, the machine learning system is asked to
observe a relatively unstructured representation of some kind of data and
transcribe it into discrete, textual form. For example, in optical character
recognition, the computer program is shown a photograph containing an
image of text and is asked to return this text in the form of a sequence
of characters (e.g., in ASCII or Unicode format). Google Street View uses
deep learning to process address numbers in this way (Goodfellow et al.,
2014d). Another example is speech recognition, where the computer program
is provided an audio waveform and emits a sequence of characters or word
ID codes describing the words that were spoken in the audio recording. Deep
learning is a crucial component of modern speech recognition systems used
at major companies including Microsoft, IBM and Google (Hinton et al.,
2012b).
Machine translation: In a machine translation task, the input already consists
of a sequence of symbols in some language, and the computer program must
convert this into a sequence of symbols in another language. This is commonly
applied to natural languages, such as to translate from English to French.
Deep learning has recently begun to have an important impact on this kind
of task (Sutskever et al., 2014a; Bahdanau et al., 2015).
Structured output: Structured output tasks involve any task where the output
is a vector (or other data structure containing multiple values) with important
relationships between the different elements. This is a broad category, and
subsumes the transcription and translation tasks described above, but also
many other tasks. One example is parsing—mapping a natural language
sentence into a tree that describes its grammatical structure and tagging nodes
of the trees as being verbs, nouns, or adverbs, and so on. See Collobert (2011)
for an example of deep learning applied to a parsing task. Another example
is pixel-wise segmentation of images, where the computer program assigns
every pixel in an image to a specific category. For example, deep learning can
101
CHAPTER 5. MACHINE LEARNING BASICS
be used to annotate the locations of roads in aerial photographs (Mnih and
Hinton, 2010). The output need not have its form mirror the structure of
the input as closely as in these annotation-style tasks. For example, in image
captioning, the computer program observes an image and outputs a natural
language sentence describing the image (Kiros et al., 2014a,b; Mao et al.,
2015; Vinyals et al., 2015b; Donahue et al., 2014; Karpathy and Li, 2015;
Fang et al., 2015; Xu et al., 2015a). These tasks are called structured output
tasks because the program must output several values that are all tightly
inter-related. For example, the words produced by an image captioning
program must form a valid sentence.
Anomaly detection: In this type of task, the computer program sifts through
a set of events or objects, and flags some of them as being unusual or atypical.
An example of an anomaly detection task is credit card fraud detection. By
modeling your purchasing habits, a credit card company can detect misuse
of your cards. If a thief steals your credit card or credit card information,
the thief’s purchases will often come from a different probability distribution
over purchase types than your own. The credit card company can prevent
fraud by placing a hold on an account as soon as that card has been used
for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of
anomaly detection methods.
Synthesis and sampling: In this type of task, the machine learning algorithm
is asked to generate new examples that are similar to those in the training
data. Synthesis and sampling via machine learning can be useful for media
applications where it can be expensive or boring for an artist to generate large
volumes of content by hand. For example, video games can automatically
generate textures for large objects or landscapes, rather than requiring an
artist to manually label each pixel (Luo et al., 2013). In some cases, we
want the sampling or synthesis procedure to generate some specific kind of
output given the input. For example, in a speech synthesis task, we provide a
written sentence and ask the program to emit an audio waveform containing
a spoken version of that sentence. This is a kind of structured output task,
but with the added qualification that there is no single correct output for
each input, and we explicitly desire a large amount of variation in the output,
in order for the output to seem more natural and realistic.
Imputation of missing values: In this type of task, the machine learning
algorithm is given a new example
x R
n
, but with some entries
x
i
of
x
missing. The algorithm must provide a prediction of the values of the missing
entries.
102
CHAPTER 5. MACHINE LEARNING BASICS
Denoising: In this type of task, the machine learning algorithm is given in
input a corrupted example
˜
x R
n
obtained by an unknown corruption process
from a clean example
x R
n
. The learner must predict the clean example
x
from its corrupted version
˜
x
, or more generally predict the conditional
probability distribution p(x |
˜
x).
Density estimation or probability mass function estimation: In the density
estimation problem, the machine learning algorithm is asked to learn a
function
p
model
:
R
n
R
, where
p
model
(
x
) can be interpreted as a probability
density function (if
x
is continuous) or a probability mass function (if
x
is
discrete) on the space that the examples were drawn from. To do such a task
well (we will specify exactly what that means when we discuss performance
measures
P
), the algorithm needs to learn the structure of the data it
has seen. It must know where examples cluster tightly and where they
are unlikely to occur. Most of the tasks described above require that the
learning algorithm has at least implicitly captured the structure of the
probability distribution. Density estimation allows us to explicitly capture
that distribution. In principle, we can then perform computations on that
distribution in order to solve the other tasks as well. For example, if we
have performed density estimation to obtain a probability distribution
p
(
x
),
we can use that distribution to solve the missing value imputation task. If
a value
x
i
is missing and all of the other values, denoted
x
i
, are given,
then we know the distribution over it is given by
p
(
x
i
| x
i
). In practice,
density estimation does not always allow us to solve all of these related tasks,
because in many cases the required operations on
p
(
x
) are computationally
intractable.
Of course, many other tasks and types of tasks are possible. The types of tasks
we list here are intended only to provide examples of what machine learning can
do, not to define a rigid taxonomy of tasks.
5.1.2 The Performance Measure, P
In order to evaluate the abilities of a machine learning algorithm, we must design
a quantitative measure of its performance. Usually this performance measure
P
is
specific to the task T being carried out by the system.
For tasks such as classification, classification with missing inputs, and transcrip-
tion, we often measure the accuracy of the model. Accuracy is just the proportion
of examples for which the model produces the correct output. We can also obtain
103
CHAPTER 5. MACHINE LEARNING BASICS
equivalent information by measuring the error rate, the proportion of examples for
which the model produces an incorrect output. We often refer to the error rate as
the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly
classified and 1 if it is not. For tasks such as density estimation, it does not make
sense to measure accuracy, error rate, or any other kind of 0-1 loss. Instead, we
must use a different performance metric that gives the model a continuous-valued
score for each example. The most common approach is to report the average
log-probability the model assigns to some examples.
Usually we are interested in how well the machine learning algorithm performs
on data that it has not seen before, since this determines how well it will work when
deployed in the real world. We therefore evaluate these performance measures
using a test set of data that is separate from the data used for training the machine
learning system.
The choice of performance measure may seem straightforward and objective,
but it is often difficult to choose a performance measure that corresponds well to
the desired behavior of the system.
In some cases, this is because it is difficult to decide what should be measured.
For example, when performing a transcription task, should we measure the accuracy
of the system at transcribing entire sequences, or should we use a more fine-grained
performance measure that gives partial credit for getting some elements of the
sequence correct? When performing a regression task, should we penalize the
system more if it frequently makes medium-sized mistakes or if it rarely makes
very large mistakes? These kinds of design choices depend on the application.
In other cases, we know what quantity we would ideally like to measure, but
measuring it is impractical. For example, this arises frequently in the context of
density estimation. Many of the best probabilistic models represent probability
distributions only implicitly. Computing the actual probability value assigned to
a specific point in space in many such models is intractable. In these cases, one
must design an alternative criterion that still corresponds to the design objectives,
or design a good approximation to the desired criterion.
5.1.3 The Experience, E
Machine learning algorithms can be broadly categorized as unsupervised or su-
pervised by what kind of experience they are allowed to have during the learning
process.
Most of the learning algorithms in this book can be understood as being allowed
to experience an entire dataset. A dataset is a collection of many examples, as
104
CHAPTER 5. MACHINE LEARNING BASICS
defined in Sec. 5.1.1. Sometimes we will also call examples data points.
One of the oldest datasets studied by statisticians and machine learning re-
searchers is the Iris dataset (Fisher, 1936). It is a collection of measurements of
different parts of 150 iris plants. Each individual plant corresponds to one example.
The features within each example are the measurements of each of the parts of the
plant: the sepal length, sepal width, petal length and petal width. The dataset
also records which species each plant belonged to. Three different species are
represented in the dataset.
Unsupervised learning algorithms experience a dataset containing many features,
then learn useful properties of the structure of this dataset. In the context of deep
learning, we usually want to learn the entire probability distribution that generated
a dataset, whether explicitly as in density estimation or implicitly for tasks like
synthesis or denoising. Some other unsupervised learning algorithms perform other
roles, like clustering, which consists of dividing the dataset into clusters of similar
examples.
Supervised learning algorithms experience a dataset containing features, but
each example is also associated with a label or target. For example, the Iris dataset
is annotated with the species of each iris plant. A supervised learning algorithm
can study the Iris dataset and learn to classify iris plants into three different species
based on their measurements.
Roughly speaking, unsupervised learning involves observing several examples
of a random vector
x
, and attempting to implicitly or explicitly learn the proba-
bility distribution
p
(
x
), or some interesting properties of that distribution, while
supervised learning involves observing several examples of a random vector
x
and
an associated value or vector
y
, and learning to predict
y
from
x
, usually by
estimating
p
(
y | x
). The term supervised learning originates from the view of
the target
y
being provided by an instructor or teacher who shows the machine
learning system what to do. In unsupervised learning, there is no instructor or
teacher, and the algorithm must learn to make sense of the data without this guide.
Unsupervised learning and supervised learning are not formally defined terms.
The lines between them are often blurred. Many machine learning technologies can
be used to perform both tasks. For example, the chain rule of probability states
that for a vector x R
n
, the joint distribution can be decomposed as
p(x) =
n
i=1
p(x
i
| x
1
, . . . , x
i1
). (5.1)
This decomposition means that we can solve the ostensibly unsupervised problem of
modeling
p
(
x
) by splitting it into
n
supervised learning problems. Alternatively, we
105
CHAPTER 5. MACHINE LEARNING BASICS
can solve the supervised learning problem of learning
p
(
y | x
) by using traditional
unsupervised learning technologies to learn the joint distribution
p
(
x, y
) and
inferring
p(y | x) =
p(x, y)
y
p(x, y
)
. (5.2)
Though unsupervised learning and supervised learning are not completely formal or
distinct concepts, they do help to roughly categorize some of the things we do with
machine learning algorithms. Traditionally, people refer to regression, classification
and structured output problems as supervised learning. Density estimation in
support of other tasks is usually considered unsupervised learning.
Other variants of the learning paradigm are possible. For example, in semi-
supervised learning, some examples include a supervision target but others do
not. In multi-instance learning, an entire collection of examples is labeled as
containing or not containing an example of a class, but the individual members
of the collection are not labeled. For a recent example of multi-instance learning
with deep models, see Kotzias et al. (2015).
Some machine learning algorithms do not just experience a fixed dataset. For
example, reinforcement learning algorithms interact with an environment, so there
is a feedback loop between the learning system and its experiences. Such algorithms
are beyond the scope of this book. Please see Sutton and Barto (1998) or Bertsekas
and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al.
(2013) for the deep learning approach to reinforcement learning.
Most machine learning algorithms simply experience a dataset. A dataset can
be described in many ways. In all cases, a dataset is a collection of examples,
which are in turn collections of features.
One common way of describing a dataset is with a design matrix. A design
matrix is a matrix containing a different example in each row. Each column of the
matrix corresponds to a different feature. For instance, the Iris dataset contains
150 examples with four features for each example. This means we can represent
the dataset with a design matrix
X R
150×4
, where
X
i,1
is the sepal length of
plant
i
,
X
i,2
is the sepal width of plant
i
, etc. We will describe most of the learning
algorithms in this book in terms of how they operate on design matrix datasets.
Of course, to describe a dataset as a design matrix, it must be possible to
describe each example as a vector, and each of these vectors must be the same size.
This is not always possible. For example, if you have a collection of photographs
with different widths and heights, then different photographs will contain different
numbers of pixels, so not all of the photographs may be described with the same
length of vector. Sec. 9.7 and Chapter 10 describe how to handle different types
106
CHAPTER 5. MACHINE LEARNING BASICS
of such heterogeneous data. In cases like these, rather than describing the dataset
as a matrix with
m
rows, we will describe it as a set containing
m
elements:
{x
(1)
, x
(2)
, . . . , x
(m)
}
. This notation does not imply that any two example vectors
x
(i)
and x
(j)
have the same size.
In the case of supervised learning, the example contains a label or target as
well as a collection of features. For example, if we want to use a learning algorithm
to perform object recognition from photographs, we need to specify which object
appears in each of the photos. We might do this with a numeric code, with 0
signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working
with a dataset containing a design matrix of feature observations
X
, we also
provide a vector of labels y, with y
i
providing the label for example i.
Of course, sometimes the label may be more than just a single number. For
example, if we want to train a speech recognition system to transcribe entire
sentences, then the label for each example sentence is a sequence of words.
Just as there is no formal definition of supervised and unsupervised learning,
there is no rigid taxonomy of datasets or experiences. The structures described here
cover most cases, but it is always possible to design new ones for new applications.
5.1.4 Example: Linear Regression
Our definition of a machine learning algorithm as an algorithm that is capable
of improving a computer program’s performance at some task via experience is
somewhat abstract. To make this more concrete, we present an example of a simple
machine learning algorithm: linear regression. We will return to this example
repeatedly as we introduce more machine learning concepts that help to understand
its behavior.
As the name implies, linear regression solves a regression problem. In other
words, the goal is to build a system that can take a vector
x R
n
as input and
predict the value of a scalar
y R
as its output. In the case of linear regression,
the output is a linear function of the input. Let
ˆy
be the value that our model
predicts y should take on. We define the output to be
ˆy = w
x (5.3)
where w R
n
is a vector of parameters.
Parameters are values that control the behavior of the system. In this case,
w
i
is
the coefficient that we multiply by feature
x
i
before summing up the contributions
from all the features. We can think of
w
as a set of weights that determine how
each feature affects the prediction. If a feature
x
i
receives a positive weight
w
i
,
107
CHAPTER 5. MACHINE LEARNING BASICS
then increasing the value of that feature increases the value of our prediction
ˆy
.
If a feature receives a negative weight, then increasing the value of that feature
decreases the value of our prediction. If a feature’s weight is large in magnitude,
then it has a large effect on the prediction. If a feature’s weight is zero, it has no
effect on the prediction.
We thus have a definition of our task
T
: to predict
y
from
x
by outputting
ˆy = w
x. Next we need a definition of our performance measure, P .
Suppose that we have a design matrix of
m
example inputs that we will not
use for training, only for evaluating how well the model performs. We also have
a vector of regression targets providing the correct value of
y
for each of these
examples. Because this dataset will only be used for evaluation, we call it the test
set. We refer to the design matrix of inputs as
X
(test)
and the vector of regression
targets as y
(test)
.
One way of measuring the performance of the model is to compute the mean
squared error of the model on the test set. If
ˆ
y
(test)
gives the predictions of the
model on the test set, then the mean squared error is given by
MSE
test
=
1
m
i
(
ˆ
y
(test)
y
(test)
)
2
i
. (5.4)
Intuitively, one can see that this error measure decreases to 0 when
ˆ
y
(test)
=
y
(test)
.
We can also see that
MSE
test
=
1
m
||
ˆ
y
(test)
y
(test)
||
2
2
, (5.5)
so the error increases whenever the Euclidean distance between the predictions
and the targets increases.
To make a machine learning algorithm, we need to design an algorithm that
will improve the weights
w
in a way that reduces
MSE
test
when the algorithm
is allowed to gain experience by observing a training set (
X
(train)
, y
(train)
). One
intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to
minimize the mean squared error on the training set, MSE
train
.
To minimize MSE
train
, we can simply solve for where its gradient is 0:
w
MSE
train
= 0 (5.6)
w
1
m
||
ˆ
y
(train)
y
(train)
||
2
2
= 0 (5.7)
1
m
w
||X
(train)
w y
(train)
||
2
2
= 0 (5.8)
108
CHAPTER 5. MACHINE LEARNING BASICS
1.0 0.5 0.0 0.5 1.0
x
1
3
2
1
0
1
2
3
y
Linear regression example
0.5 1.0 1.5
w
1
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
MSE
(train)
Optimization of w
Figure 5.1: A linear regression problem, with a training set consisting of ten data points,
each containing one feature. Because there is only one feature, the weight vector
w
contains only a single parameter to learn,
w
1
. (Left) Observe that linear regression learns
to set
w
1
such that the line
y
=
w
1
x
comes as close as possible to passing through all the
training points. (Right) The plotted point indicates the value of
w
1
found by the normal
equations, which we can see minimizes the mean squared error on the training set.
w
X
(train)
w y
(train)
X
(train)
w y
(train)
= 0 (5.9)
w
w
X
(train)
X
(train)
w 2w
X
(train)
y
(train)
+ y
(train)
y
(train)
= 0
(5.10)
2X
(train)
X
(train)
w 2X
(train)
y
(train)
= 0 (5.11)
w =
X
(train)
X
(train)
1
X
(train)
y
(train)
(5.12)
The system of equations whose solution is given by Eq. 5.12 is known as the
normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm.
For an example of the linear regression learning algorithm in action, see Fig. 5.1.
It is worth noting that the term linear regression is often used to refer to a
slightly more sophisticated model with one additional parameter—an intercept
term b. In this model
ˆy = w
x + b (5.13)
so the mapping from parameters to predictions is still a linear function but the
mapping from features to predictions is now an affine function. This extension to
affine functions means that the plot of the model’s predictions still looks like a
line, but it need not pass through the origin. Instead of adding the bias parameter
b
, one can continue to use the model with only weights but augment
x
with an
109
CHAPTER 5. MACHINE LEARNING BASICS
extra entry that is always set to 1. The weight corresponding to the extra 1 entry
plays the role of the bias parameter. We will frequently use the term “linear” when
referring to affine functions throughout this book.
The intercept term
b
is often called the bias parameter of the affine transfor-
mation. This terminology derives from the point of view that the output of the
transformation is biased toward being
b
in the absence of any input. This term
is different from the idea of a statistical bias, in which a statistical estimation
algorithm’s expected estimate of a quantity is not equal to the true quantity.
Linear regression is of course an extremely simple and limited learning algorithm,
but it provides an example of how a learning algorithm can work. In the subsequent
sections we will describe some of the basic principles underlying learning algorithm
design and demonstrate how these principles can be used to build more complicated
learning algorithms.
5.2 Capacity, Overfitting and Underfitting
The central challenge in machine learning is that we must perform well on
new,
previously unseen
inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.
Typically, when training a machine learning model, we have access to a training
set, we can compute some error measure on the training set called the training
error, and we reduce this training error. So far, what we have described is simply
an optimization problem. What separates machine learning from optimization is
that we want the generalization error, also called the test error, to be low as well.
The generalization error is defined as the expected value of the error on a new
input. Here the expectation is taken across different possible inputs, drawn from
the distribution of inputs we expect the system to encounter in practice.
We typically estimate the generalization error of a machine learning model by
measuring its performance on a test set of examples that were collected separately
from the training set.
In our linear regression example, we trained the model by minimizing the
training error,
1
m
(train)
||X
(train)
w y
(train)
||
2
2
, (5.14)
but we actually care about the test error,
1
m
(test)
||X
(test)
w y
(test)
||
2
2
.
How can we affect performance on the test set when we get to observe only the
training set? The field of statistical learning theory provides some answers. If the
110
CHAPTER 5. MACHINE LEARNING BASICS
training and the test set are collected arbitrarily, there is indeed little we can do.
If we are allowed to make some assumptions about how the training and test set
are collected, then we can make some progress.
The train and test data are generated by a probability distribution over datasets
called the data generating process. We typically make a set of assumptions known
collectively as the i.i.d. assumptions These assumptions are that the examples
in each dataset are independent from each other, and that the train set and test
set are identically distributed, drawn from the same probability distribution as
each other. This assumption allows us to describe the data generating process
with a probability distribution over a single example. The same distribution is
then used to generate every train example and every test example. We call that
shared underlying distribution the data generating distribution, denoted
p
data
. This
probabilistic framework and the i.i.d. assumptions allow us to mathematically
study the relationship between training error and test error.
One immediate connection we can observe between the training and test error
is that the expected training error of a randomly selected model is equal to the
expected test error of that model. Suppose we have a probability distribution
p
(
x, y
) and we sample from it repeatedly to generate the train set and the test set.
For some fixed value
w
, then the expected training set error is exactly the same as
the expected test set error, because both expectations are formed using the same
dataset sampling process. The only difference between the two conditions is the
name we assign to the dataset we sample.
Of course, when we use a machine learning algorithm, we do not fix the
parameters ahead of time, then sample both datasets. We sample the training set,
then use it to choose the parameters to reduce training set error, then sample the
test set. Under this process, the expected test error is greater than or equal to
the expected value of training error. The factors determining how well a machine
learning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning:
underfitting and overfitting. Underfitting occurs when the model is not able to
obtain a sufficiently low error value on the training set. Overfitting occurs when
the gap between the training error and test error is too large.
We can control whether a model is more likely to overfit or underfit by altering
its capacity. Informally, a model’s capacity is its ability to fit a wide variety of
111
CHAPTER 5. MACHINE LEARNING BASICS
functions. Models with low capacity may struggle to fit the training set. Models
with high capacity can overfit by memorizing properties of the training set that do
not serve them well on the test set.
One way to control the capacity of a learning algorithm is by choosing its
hypothesis space, the set of functions that the learning algorithm is allowed to
select as being the solution. For example, the linear regression algorithm has the
set of all linear functions of its input as its hypothesis space. We can generalize
linear regression to include polynomials, rather than just linear functions, in its
hypothesis space. Doing so increases the model’s capacity.
A polynomial of degree one gives us the linear regression model with which we
are already familiar, with prediction
ˆy = b + wx. (5.15)
By introducing
x
2
as another feature provided to the linear regression model, we
can learn a model that is quadratic as a function of x:
ˆy = b + w
1
x + w
2
x
2
. (5.16)
Though this model implements a quadratic function of its
input
, the output is
still a linear function of the
parameters
, so we can still use the normal equations
to train the model in closed form. We can continue to add more powers of
x
as
additional features, for example to obtain a polynomial of degree 9:
ˆy = b +
9
i=1
w
i
x
i
. (5.17)
Machine learning algorithms will generally perform best when their capacity
is appropriate in regard to the true complexity of the task they need to perform
and the amount of training data they are provided with. Models with insufficient
capacity are unable to solve complex tasks. Models with high capacity can solve
complex tasks, but when their capacity is higher than needed to solve the present
task they may overfit.
Fig. 5.2 shows this principle in action. We compare a linear, quadratic and
degree-9 predictor attempting to fit a problem where the true underlying function
is quadratic. The linear function is unable to capture the curvature in the true un-
derlying problem, so it underfits. The degree-9 predictor is capable of representing
the correct function, but it is also capable of representing infinitely many other
functions that pass exactly through the training points, because we have more
112
CHAPTER 5. MACHINE LEARNING BASICS
parameters than training examples. We have little chance of choosing a solution
that generalizes well when so many wildly different solutions exist. In this example,
the quadratic model is perfectly matched to the true structure of the task so it
generalizes well to new data.
x
0
y
Underfitting
x
0
y
Appropriate capacity
x
0
y
Overfitting
Figure 5.2: We fit three models to this example training set. The training data was
generated synthetically, by randomly sampling
x
values and choosing
y
deterministically
by evaluating a quadratic function. (Left) A linear function fit to the data suffers from
underfitting—it cannot capture the curvature that is present in the data. (Center) A
quadratic function fit to the data generalizes well to unseen points. It does not suffer from
a significant amount of overfitting or underfitting. (Right) A polynomial of degree 9 fit to
the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve
the underdetermined normal equations. The solution passes through all of the training
points exactly, but we have not been lucky enough for it to extract the correct structure.
It now has a deep valley in between two training points that does not appear in the true
underlying function. It also increases sharply on the left side of the data, while the true
function decreases in this area.
So far we have only described changing a model’s capacity by changing the
number of input features it has (and simultaneously adding new parameters
associated with those features). There are in fact many ways of changing a model’s
capacity. Capacity is not determined only by the choice of model. The model
specifies which family of functions the learning algorithm can choose from when
varying the parameters in order to reduce a training objective. This is called the
representational capacity of the model. In many cases, finding the best function
within this family is a very difficult optimization problem. In practice, the learning
algorithm does not actually find the best function, but merely one that significantly
reduces the training error. These additional limitations, such as the imperfection
113
CHAPTER 5. MACHINE LEARNING BASICS
of the optimization algorithm, mean that the learning algorithm’s effective capacity
may be less than the representational capacity of the model family.
Our modern ideas about improving the generalization of machine learning
models are refinements of thought dating back to philosophers at least as early
as Ptolemy. Many early scholars invoke a principle of parsimony that is now
most widely known as Occam’s razor (c. 1287-1347). This principle states that
among competing hypotheses that explain known observations equally well, one
should choose the “simplest” one. This idea was formalized and made more precise
in the 20th century by the founders of statistical learning theory (Vapnik and
Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).
Statistical learning theory provides various means of quantifying model capacity.
Among these, the most well-known is the Vapnik-Chervonenkis dimension, or VC
dimension. The VC dimension measures the capacity of a binary classifier. The
VC dimension is defined as being the largest possible value of
m
for which there
exists a training set of
m
different
x
points that the classifier can label arbitrarily.
Quantifying the capacity of the model allows statistical learning theory to
make quantitative predictions. The most important results in statistical learning
theory show that the discrepancy between training error and generalization error
is bounded from above by a quantity that grows as the model capacity grows but
shrinks as the number of training examples increases (Vapnik and Chervonenkis,
1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide
intellectual justification that machine learning algorithms can work, but they are
rarely used in practice when working with deep learning algorithms. This is in
part because the bounds are often quite loose and in part because it can be quite
difficult to determine the capacity of deep learning algorithms. The problem of
determining the capacity of a deep learning model is especially difficult because the
effective capacity is limited by the capabilities of the optimization algorithm, and
we have little theoretical understanding of the very general non-convex optimization
problems involved in deep learning.
We must remember that while simpler functions are more likely to generalize
(to have a small gap between training and test error) we must still choose a
sufficiently complex hypothesis to achieve low training error. Typically, training
error decreases until it asymptotes to the minimum possible error value as model
capacity increases (assuming the error measure has a minimum value). Typically,
generalization error has a U-shaped curve as a function of model capacity. This is
illustrated in Fig. 5.3.
To reach the most extreme case of arbitrarily high capacity, we introduce
the concept of non-parametric models. So far, we have seen only parametric
114
CHAPTER 5. MACHINE LEARNING BASICS
0 Optimal Capacity
Capacity
Error
Underfitting zone Overfitting zone
Generalization gap
Training error
Generalization error
Figure 5.3: Typical relationship between capacity and error. Training and test error
behave differently. At the left end of the graph, training error and generalization error
are both high. This is the underfitting regime. As we increase capacity, training error
decreases, but the gap between training and generalization error increases. Eventually,
the size of this gap outweighs the decrease in training error, and we enter the overfitting
regime, where capacity is too large, above the optimal capacity.
models, such as linear regression. Parametric models learn a function described
by a parameter vector whose size is finite and fixed before any data is observed.
Non-parametric models have no such limitation.
Sometimes, non-parametric models are just theoretical abstractions (such as
an algorithm that searches over all possible probability distributions) that cannot
be implemented in practice. However, we can also design practical non-parametric
models by making their complexity a function of the training set size. One example
of such an algorithm is nearest neighbor regression. Unlike linear regression, which
has a fixed-length vector of weights, the nearest neighbor regression model simply
stores the
X
and
y
from the training set. When asked to classify a test point
x
,
the model looks up the nearest entry in the training set and returns the associated
regression target. In other words,
ˆy
=
y
i
where
i
=
arg min ||X
i,:
x||
2
2
. The
algorithm can also be generalized to distance metrics other than the
L
2
norm, such
as learned distance metrics (Goldberger et al., 2005). If the algorithm is allowed
to break ties by averaging the
y
i
values for all
X
i,:
that are tied for nearest, then
this algorithm is able to achieve the minimum possible training error (which might
be greater than zero, if two identical inputs are associated with different outputs)
on any regression dataset.
Finally, we can also create a non-parametric learning algorithm by wrapping a
parametric learning algorithm inside another algorithm that increases the number
115
CHAPTER 5. MACHINE LEARNING BASICS
of parameters as needed. For example, we could imagine an outer loop of learning
that changes the degree of the polynomial learned by linear regression on top of a
polynomial expansion of the input.
The ideal model is an oracle that simply knows the true probability distribution
that generates the data. Even such a model will still incur some error on many
problems, because there may still be some noise in the distribution. In the case
of supervised learning, the mapping from
x
to
y
may be inherently stochastic,
or
y
may be a deterministic function that involves other variables besides those
included in
x
. The error incurred by an oracle making predictions from the true
distribution p(x, y) is called the Bayes error.
Training and generalization error vary as the size of the training set varies.
Expected generalization error can never increase as the number of training examples
increases. For non-parametric models, more data yields better generalization until
the best possible error is achieved. Any fixed parametric model with less than
optimal capacity will asymptote to an error value that exceeds the Bayes error. See
Fig. 5.4 for an illustration. Note that it is possible for the model to have optimal
capacity and yet still have a large gap between training and generalization error.
In this situation, we may be able to reduce this gap by gathering more training
examples.
5.2.1 The No Free Lunch Theorem
Learning theory claims that a machine learning algorithm can generalize well from
a finite training set of examples. This seems to contradict some basic principles of
logic. Inductive reasoning, or inferring general rules from a limited set of examples,
is not logically valid. To logically infer a rule describing every member of a set,
one must have information about every member of that set.
In part, machine learning avoids this problem by offering only probabilistic rules,
rather than the entirely certain rules used in purely logical reasoning. Machine
learning promises to find rules that are probably correct about most members of
the set they concern.
Unfortunately, even this does not resolve the entire problem. The no free lunch
theorem for machine learning (Wolpert, 1996) states that, averaged over all possible
data generating distributions, every classification algorithm has the same error
rate when classifying previously unobserved points. In other words, in some sense,
no machine learning algorithm is universally any better than any other. The most
sophisticated algorithm we can conceive of has the same average performance (over
all possible tasks) as merely predicting that every point belongs to the same class.
116
CHAPTER 5. MACHINE LEARNING BASICS
10
0
10
1
10
2
10
3
10
4
10
5
Number of training examples
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Error (MSE)
Bayes error
Train (quadratic)
Test (quadratic)
Test (optimal capacity)
Train (optimal capacity)
10
0
10
1
10
2
10
3
10
4
10
5
Number of training examples
0
5
10
15
20
Optimal capacity (polynomial degree)
Figure 5.4: The effect of the training dataset size on the train and test error, as well as
on the optimal model capacity. We constructed a synthetic regression problem based on
adding moderate amount of noise to a degree 5 polynomial, generated a single test set,
and then generated several different sizes of training set. For each size, we generated 40
different training sets in order to plot error bars showing 95% confidence intervals. (Top)
The MSE on the train and test set for two different models: a quadratic model, and a
model with degree chosen to minimize the test error. Both are fit in closed form. For
the quadratic model, the training error increases as the size of the training set increases.
This is because larger datasets are harder to fit. Simultaneously, the test error decreases,
because fewer incorrect hypotheses are consistent with the training data. The quadratic
model does not have enough capacity to solve the task, so its test error asymptotes to
a high value. The test error at optimal capacity asymptotes to the Bayes error. The
training error can fall below the Bayes error, due to the ability of the training algorithm
to memorize specific instances of the training set. As the training size increases to infinity,
the training error of any fixed-capacity model (here, the quadratic model) must rise to at
least the Bayes error. (Bottom) As the training set size increases, the optimal capacity
(shown here as the degree of the optimal polynomial regressor) increases. The optimal
capacity plateaus after reaching sufficient complexity to solve the task.
117
CHAPTER 5. MACHINE LEARNING BASICS
Fortunately, these results hold only when we average over all possible data
generating distributions. If we make assumptions about the kinds of probability
distributions we encounter in real-world applications, then we can design learning
algorithms that perform well on these distributions.
This means that the goal of machine learning research is not to seek a universal
learning algorithm or the absolute best learning algorithm. Instead, our goal is to
understand what kinds of distributions are relevant to the “real world” that an AI
agent experiences, and what kinds of machine learning algorithms perform well on
data drawn from the kinds of data generating distributions we care about.
5.2.2 Regularization
The no free lunch theorem implies that we must design our machine learning
algorithms to perform well on a specific task. We do so by building a set of
preferences into the learning algorithm. When these preferences are aligned with
the learning problems we ask the algorithm to solve, it performs better.
So far, the only method of modifying a learning algorithm we have discussed is
to increase or decrease the model’s capacity by adding or removing functions from
the hypothesis space of solutions the learning algorithm is able to choose. We gave
the specific example of increasing or decreasing the degree of a polynomial for a
regression problem. The view we have described so far is oversimplified.
The behavior of our algorithm is strongly affected not just by how large we
make the set of functions allowed in its hypothesis space, but by the specific identity
of those functions. The learning algorithm we have studied so far, linear regression,
has a hypothesis space consisting of the set of linear functions of its input. These
linear functions can be very useful for problems where the relationship between
inputs and outputs truly is close to linear. They are less useful for problems
that behave in a very nonlinear fashion. For example, linear regression would
not perform very well if we tried to use it to predict
sin
(
x
) from
x
. We can thus
control the performance of our algorithms by choosing what kind of functions we
allow them to draw solutions from, as well as by controlling the amount of these
functions.
We can also give a learning algorithm a preference for one solution in its
hypothesis space to another. This means that both functions are eligible, but one
is preferred. The unpreferred solution be chosen only if it fits the training data
significantly better than the preferred solution.
For example, we can modify the training criterion for linear regression to
include weight decay. To perform linear regression with weight decay, we minimize
118
CHAPTER 5. MACHINE LEARNING BASICS
a sum comprising both the mean squared error on the training and a criterion
J
(
w
) that expresses a preference for the weights to have smaller squared
L
2
norm.
Specifically,
J(w) = MSE
train
+ λw
w, (5.18)
where
λ
is a value chosen ahead of time that controls the strength of our preference
for smaller weights. When
λ
= 0, we impose no preference, and larger
λ
forces the
weights to become smaller. Minimizing
J
(
w
) results in a choice of weights that
make a tradeoff between fitting the training data and being small. This gives us
solutions that have a smaller slope, or put weight on fewer of the features. As an
example of how we can control a model’s tendency to overfit or underfit via weight
decay, we can train a high-degree polynomial regression model with different values
of λ. See Fig. 5.5 for the results.
x
0
y
Underfitting
(Excessive ¸)
x
0
y
Appropriate weight decay
(Medium ¸)
x
0
y
Overfitting
(¸ !0)
Figure 5.5: We fit a high-degree polynomial regression model to our example training set
from Fig. 5.2. The true function is quadratic, but here we use only models with degree 9.
We vary the amount of weight decay to prevent these high-degree models from overfitting.
(Left) With very large
λ
, we can force the model to learn a function with no slope at
all. This underfits because it can only represent a constant function. (Center) With a
medium value of λ, the learning algorithm recovers a curve with the right general shape.
Even though the model is capable of representing functions with much more complicated
shape, weight decay has encouraged it to use a simpler function described by smaller
coefficients. (Right) With weight decay approaching zero (i.e., using the Moore-Penrose
pseudoinverse to solve the underdetermined problem with minimal regularization), the
degree-9 polynomial overfits significantly, as we saw in Fig. 5.2.
More generally, we can regularize a model that learns a function
f
(
x
;
θ
) by
adding a penalty called a regularizer to the cost function. In the case of weight
decay, the regularizer is Ω(
w
) =
w
w
. In Chapter 7, we will see that many other
119
CHAPTER 5. MACHINE LEARNING BASICS
regularizers are possible.
Expressing preferences for one function over another is a more general way
of controlling a model’s capacity than including or excluding members from the
hypothesis space. We can think of excluding a function from a hypothesis space as
expressing an infinitely strong preference against that function.
In our weight decay example, we expressed our preference for linear functions
defined with smaller weights explicitly, via an extra term in the criterion we
minimize. There are many other ways of expressing preferences for different
solutions, both implicitly and explicitly. Together, these different approaches are
known as regularization.
Regularization is any modification we make to
a learning algorithm that is intended to reduce its generalization error
but not its training error.
Regularization is one of the central concerns of the
field of machine learning, rivaled in its importance only by optimization.
The no free lunch theorem has made it clear that there is no best machine
learning algorithm, and, in particular, no best form of regularization. Instead
we must choose a form of regularization that is well-suited to the particular task
we want to solve. The philosophy of deep learning in general and this book in
particular is that a very wide range of tasks (such as all of the intellectual tasks
that people can do) may all be solved effectively using very general-purpose forms
of regularization.
5.3 Hyperparameters and Validation Sets
Most machine learning algorithms have several settings that we can use to control
the behavior of the learning algorithm. These settings are called hyperparameters.
The values of hyperparameters are not adapted by the learning algorithm itself
(though we can design a nested learning procedure where one learning algorithm
learns the best hyperparameters for another learning algorithm).
In the polynomial regression example we saw in Fig. 5.2, there is a single hyper-
parameter: the degree of the polynomial, which acts as a capacity hyperparameter.
The
λ
value used to control the strength of weight decay is another example of a
hyperparameter.
Sometimes a setting is chosen to be a hyperparameter that the learning algo-
rithm does not learn because it is difficult to optimize. More frequently, we do
not learn the hyperparameter because it is not appropriate to learn that hyper-
parameter on the training set. This applies to all hyperparameters that control
model capacity. If learned on the training set, such hyperparameters would always
120
CHAPTER 5. MACHINE LEARNING BASICS
choose the maximum possible model capacity, resulting in overfitting (refer to
Fig. 5.3). For example, we can always fit the training set better with a higher
degree polynomial and a weight decay setting of
λ
= 0 than we could with a lower
degree polynomial and a positive weight decay setting.
To solve this problem, we need a validation set of examples that the training
algorithm does not observe.
Earlier we discussed how a held-out test set, composed of examples coming from
the same distribution as the training set, can be used to estimate the generalization
error of a learner, after the learning process has completed. It is important that the
test examples are not used in any way to make choices about the model, including
its hyperparameters. For this reason, no example from the test set can be used
in the validation set. Therefore, we always construct the validation set from the
training data. Specifically, we split the training data into two disjoint subsets. One
of these subsets is used to learn the parameters. The other subset is our validation
set, used to estimate the generalization error during or after training, allowing
for the hyperparameters to be updated accordingly. The subset of data used to
learn the parameters is still typically called the training set, even though this
may be confused with the larger pool of data used for the entire training process.
The subset of data used to guide the selection of hyperparameters is called the
validation set. Typically, one uses about 80% of the training data for training and
20% for validation. Since the validation set is used to “train” the hyperparameters,
the validation set error will underestimate the generalization error, though typically
by a smaller amount than the training error. After all hyperparameter optimization
is complete, the generalization error may be estimated using the test set.
In practice, when the same test set has been used repeatedly to evaluate
performance of different algorithms over many years, and especially if we consider
all the attempts from the scientific community at beating the reported state-of-
the-art performance on that test set, we end up having optimistic evaluations with
the test set as well. Benchmarks can thus become stale and then do not reflect the
true field performance of a trained system. Thankfully, the community tends to
move on to new (and usually more ambitious and larger) benchmark datasets.
5.3.1 Cross-Validation
Dividing the dataset into a fixed training set and a fixed test set can be problematic
if it results in the test set being small. A small test set implies statistical uncertainty
around the estimated average test error, making it difficult to claim that algorithm
A works better than algorithm B on the given task.
121
CHAPTER 5. MACHINE LEARNING BASICS
When the dataset has hundreds of thousands of examples or more, this is not
a serious issue. When the dataset is too small, there are alternative procedures,
which allow one to use all of the examples in the estimation of the mean test
error, at the price of increased computational cost. These procedures are based on
the idea of repeating the training and testing computation on different randomly
chosen subsets or splits of the original dataset. The most common of these is the
k
-fold cross-validation procedure, shown in Algorithm 5.1, in which a partition
of the dataset is formed by splitting it into
k
non-overlapping subsets. The test
error may then be estimated by taking the average test error across
k
trials. On
trial
i
, the
i
-th subset of the data is used as the test set and the rest of the data is
used as the training set. One problem is that there exist no unbiased estimators of
the variance of such average error estimators (Bengio and Grandvalet, 2004), but
approximations are typically used.
5.4 Estimators, Bias and Variance
The field of statistics gives us many tools that can be used to achieve the machine
learning goal of solving a task not only on the training set but also to generalize.
Foundational concepts such as parameter estimation, bias and variance are useful
to formally characterize notions of generalization, underfitting and overfitting.
5.4.1 Point Estimation
Point estimation is the attempt to provide the single “best” prediction of some
quantity of interest. In general the quantity of interest can be a single parameter
or a vector of parameters in some parametric model, such as the weights in our
linear regression example in Sec. 5.1.4, but it can also be a whole function.
In order to distinguish estimates of parameters from their true value, our
convention will be to denote a point estimate of a parameter θ by
ˆ
θ.
Let
{x
(1)
, . . . , x
(m)
}
be a set of
m
independent and identically distributed
(i.i.d.) data points. A point estimator or statistic is any function of the data:
ˆ
θ
m
= g(x
(1)
, . . . , x
(m)
). (5.19)
The definition does not require that
g
return a value that is close to the true
θ
or even that the range of
g
is the same as the set of allowable values of
θ
.
This definition of a point estimator is very general and allows the designer of an
estimator great flexibility. While almost any function thus qualifies as an estimator,
122
CHAPTER 5. MACHINE LEARNING BASICS
Algorithm 5.1
The
k
-fold cross-validation algorithm. It can be used to estimate
generalization error of a learning algorithm
A
when the given dataset
D
is too
small for a simple train/test or train/valid split to yield accurate estimation of
generalization error, because the mean of a loss
L
on a small test set may have too
high variance. The dataset
D
contains as elements the abstract examples
z
(i)
(for
the
i
-th example), which could stand for an (input,target) pair
z
(i)
= (
x
(i)
, y
(i)
)
in the case of supervised learning, or for just an input
z
(i)
=
x
(i)
in the case
of unsupervised learning. The algorithm returns the vector of errors
e
for each
example in
D
, whose mean is the estimated generalization error. The errors on
individual examples can be used to compute a confidence interval around the mean
(Eq. 5.31). While these confidence intervals are not well-justified after the use of
cross-validation, it is still common practice to use them to declare that algorithm
A
is better than algorithm
B
only if the confidence interval of the error of algorithm
A lies below and does not intersect the confidence interval of algorithm B.
Define KFoldXV(D, A, L, k):
Require: D, the given dataset, with elements z
(i)
Require: A
, the learning algorithm, seen as a function that takes a dataset as
input and outputs a learned function
Require: L
, the loss function, seen as a function from a learned function
f
and
an example z
(i)
D to a scalar R
Require: k, the number of folds
Split D into k mutually exclusive subsets D
i
, whose union is D.
for i from 1 to k do
f
i
= A(D\D
i
)
for z
(j)
in D
i
do
e
j
= L(f
i
, z
(j)
)
end for
end for
Return e
123
CHAPTER 5. MACHINE LEARNING BASICS
a good estimator is a function whose output is close to the true underlying
θ
that
generated the training data.
For now, we take the frequentist perspective on statistics. That is, we assume
that the true parameter value
θ
is fixed but unknown, while the point estimate
ˆ
θ
is a function of the data. Since the data is drawn from a random process, any
function of the data is random. Therefore
ˆ
θ is a random variable.
Point estimation can also refer to the estimation of the relationship between
input and target variables. We refer to these types of point estimates as function
estimators.
Function Estimation
As we mentioned above, sometimes we are interested in
performing function estimation (or function approximation). Here we are trying to
predict a variable
y
given an input vector
x
. We assume that there is a function
f
(
x
) that describes the approximate relationship between
y
and
x
. For example,
we may assume that
y
=
f
(
x
) +
, where
stands for the part of
y
that is not
predictable from
x
. In function estimation, we are interested in approximating
f
with a model or estimate
ˆ
f
. Function estimation is really just the same as
estimating a parameter
θ
; the function estimator
ˆ
f
is simply a point estimator in
function space. The linear regression example (discussed above in Sec. 5.1.4) and
the polynomial regression example (discussed in Sec. 5.2) are both examples of
scenarios that may be interpreted either as estimating a parameter
w
or estimating
a function
ˆ
f mapping from x to y.
We now review the most commonly studied properties of point estimators and
discuss what they tell us about these estimators.
5.4.2 Bias
The bias of an estimator is defined as:
bias(
ˆ
θ
m
) = E(
ˆ
θ
m
) θ (5.20)
where the expectation is over the data (seen as samples from a random variable) and
θ
is the true underlying value of
θ
used to define the data generating distribution.
An estimator
ˆ
θ
m
is said to be unbiased if
bias
(
ˆ
θ
m
) =
0
, which implies that
E
(
ˆ
θ
m
) =
θ
. An estimator
ˆ
θ
m
is said to be asymptotically unbiased if
lim
m→∞
bias
(
ˆ
θ
m
) =
0
,
which implies that lim
m→∞
E(
ˆ
θ
m
) = θ.
Example: Bernoulli Distribution
Consider a set of samples
{x
(1)
, . . . , x
(m)
}
that are independently and identically distributed according to a Bernoulli distri-
124
CHAPTER 5. MACHINE LEARNING BASICS
bution with mean θ:
P (x
(i)
; θ) = θ
x
(i)
(1 θ)
(1x
(i)
)
. (5.21)
A common estimator for the
θ
parameter of this distribution is the mean of the
training samples:
ˆ
θ
m
=
1
m
m
i=1
x
(i)
. (5.22)
To determine whether this estimator is biased, we can substitute Eq. 5.22 into Eq.
5.20:
bias(
ˆ
θ
m
) = E[
ˆ
θ
m
] θ
= E
1
m
m
i=1
x
(i)
θ
=
1
m
m
i=1
E
x
(i)
θ
=
1
m
m
i=1
1
x
(i)
=0
x
(i)
θ
x
(i)
(1 θ)
(1x
(i)
)
θ
=
1
m
m
i=1
(θ) θ
= θ θ = 0
Since bias(
ˆ
θ) = 0, we say that our estimator
ˆ
θ is unbiased.
Example: Gaussian Distribution Estimator of the Mean
Now, consider
a set of samples
{x
(1)
, . . . , x
(m)
}
that are independently and identically distributed
according to a Gaussian distribution
p
(
x
(i)
) =
N
(
x
(i)
;
µ, σ
2
), where
i {
1
, . . . , m}
.
Recall that the Gaussian probability density function is given by
p(x
(i)
; µ, σ
2
) =
1
2πσ
2
exp
1
2
(x
(i)
µ)
2
σ
2
. (5.23)
A common estimator of the Gaussian mean parameter is known as the sample
mean:
ˆµ
m
=
1
m
m
i=1
x
(i)
(5.24)
125
CHAPTER 5. MACHINE LEARNING BASICS
To determine the bias of the sample mean, we are again interested in calculating
its expectation:
bias(ˆµ
m
) = E[ˆµ
m
] µ
= E
1
m
m
i=1
x
(i)
µ
=
1
m
m
i=1
E
x
(i)
µ
=
1
m
m
i=1
µ
µ
= µ µ = 0
Thus we find that the sample mean is an unbiased estimator of Gaussian mean
parameter.
Example: Estimators of the Variance of a Gaussian Distribution
As an
example, we compare two different estimators of the variance parameter
σ
2
of a
Gaussian distribution. We are interested in knowing if either estimator is biased.
The first estimator of σ
2
we consider is known as the sample variance:
ˆσ
2
m
=
1
m
m
i=1
x
(i)
ˆµ
m
2
, (5.25)
where
ˆµ
m
is the sample mean, defined above. More formally, we are interested in
computing
bias(ˆσ
2
m
) = E[ˆσ
2
m
] σ
2
We begin by evaluating the term E[ˆσ
2
m
]:
E[ˆσ
2
m
] =E
1
m
m
i=1
x
(i)
ˆµ
m
2
(5.26)
=
m 1
m
σ
2
(5.27)
Returning to Eq. 5.26, we conclude that the bias of
ˆσ
2
m
is
σ
2
/m
. Therefore, the
sample variance is a biased estimator.
126
CHAPTER 5. MACHINE LEARNING BASICS
The unbiased sample variance estimator
˜σ
2
m
=
1
m 1
m
i=1
x
(i)
ˆµ
m
2
(5.28)
provides an alternative approach. As the name suggests this estimator is unbiased.
That is, we find that E[˜σ
2
m
] = σ
2
:
E[˜σ
2
m
] = E
1
m 1
m
i=1
x
(i)
ˆµ
m
2
=
m
m 1
E[ˆσ
2
m
]
=
m
m 1
m 1
m
σ
2
= σ
2
.
We have two estimators: one is biased and the other is not. While unbiased
estimators are clearly desirable, they are not always the “best” estimators. As we
will see we often use biased estimators that possess other important properties.
5.4.3 Variance and Standard Error
Another property of the estimator that we might want to consider is how much
we expect it to vary as a function of the data sample. Just as we computed the
expectation of the estimator to determine its bias, we can compute its variance.
The variance of an estimator is simply the variance
Var(
ˆ
θ) (5.29)
where the random variable is the training set. Alternately, the square root of the
variance is called the standard error, denoted SE(
ˆ
θ).
The variance or the standard error of an estimator provides a measure of how
we would expect the estimate we compute from data to vary as we independently
resample the dataset from the underlying data generating process. Just as we
might like an estimator to exhibit low bias we would also like it to have relatively
low variance.
When we compute any statistic using a finite number of samples, our estimate
of the true underlying parameter is uncertain, in the sense that we could have
obtained other samples from the same distribution and their statistics would have
127
CHAPTER 5. MACHINE LEARNING BASICS
been different. The expected degree of variation in any estimator is a source of
error that we want to quantify.
The standard error of the mean is given by
SE(ˆµ
m
) =
Var[
1
m
m
i=1
x
(i)
] =
σ
m
, (5.30)
where
σ
2
is the true variance of the samples
x
i
. The standard error is often
estimated by using an estimate of
σ
. Unfortunately, neither the square root of
the sample variance nor the square root of the unbiased estimator of the variance
provide an unbiased estimate of the standard deviation. Both approaches tend
to underestimate the true standard deviation, but are still used in practice. The
square root of the unbiased estimator of the variance is less of an underestimate.
For large m, the approximation is quite reasonable.
The standard error of the mean is very useful in machine learning experiments.
We often estimate the generalization error by computing the sample mean of the
error on the test set. The number of examples in the test set determines the
accuracy of this estimate. Taking advantage of the central limit theorem, which
tells us that the mean will be approximately distributed with a normal distribution,
we can use the standard error to compute the probability that the true expectation
falls in any chosen interval. For example, the 95% confidence interval centered on
the mean is ˆµ
m
is
(ˆµ
m
1.96SE(ˆµ
m
), ˆµ
m
+ 1.96SE(ˆµ
m
)), (5.31)
under the normal distribution with mean
ˆµ
m
and variance
SE
(
ˆµ
m
)
2
. In machine
learning experiments, it is common to say that algorithm
A
is better than algorithm
B
if the upper bound of the 95% confidence interval for the error of algorithm
A
is
less than the lower bound of the 95% confidence interval for the error of algorithm
B.
Example: Bernoulli Distribution
We once again consider a set of samples
{x
(1)
, . . . , x
(m)
}
drawn independently and identically from a Bernoulli distribution
(recall
P
(
x
(i)
;
θ
) =
θ
x
(i)
(1
θ
)
(1x
(i)
)
). This time we are interested in computing
128
CHAPTER 5. MACHINE LEARNING BASICS
the variance of the estimator
ˆ
θ
m
=
1
m
m
i=1
x
(i)
.
Var
ˆ
θ
m
= Var
1
m
m
i=1
x
(i)
=
1
m
2
m
i=1
Var
x
(i)
=
1
m
2
m
i=1
θ(1 θ)
=
1
m
2
(1 θ)
=
1
m
θ(1 θ)
The variance of the estimator decreases as a function of
m
, the number of examples
in the dataset. This is a common property of popular estimators that we will
return to when we discuss consistency (see Sec. 5.4.5).
5.4.4 Trading off Bias and Variance to Minimize Mean Squared
Error
Bias and variance measure two different sources of error in an estimator. Bias
measures the expected deviation from the true value of the function or parameter.
Variance on the other hand, provides a measure of the deviation from the expected
estimator value that any particular sampling of the data is likely to cause.
What happens when we are given a choice between two estimators, one with
more bias and one with more variance? How do we choose between them? For
example, imagine that we are interested in approximating the function shown in
Fig. 5.2 and we are only offered the choice between a model with large bias and
one that suffers from large variance. How do we choose between them?
The most common way to negotiate this trade-off is to use cross-validation.
Empirically, cross-validation is highly successful on many real-world tasks. Alter-
natively, we can also compare the mean squared error (MSE) of the estimates:
MSE = E[(
ˆ
θ
m
θ)
2
]
= Bias(
ˆ
θ
m
)
2
+ Var(
ˆ
θ
m
) (5.32)
The MSE measures the overall expected deviation—in a squared error sense—
between the estimator and the true value of the parameter
θ
. As is clear from
129
CHAPTER 5. MACHINE LEARNING BASICS
Eq. 5.32, evaluating the MSE incorporates both the bias and the variance. Desirable
estimators are those with small MSE and these are estimators that manage to keep
both their bias and variance somewhat in check.
Capacity
Bias
Generalization
error
Variance
Optimal
capacity
Overfitting zoneUnderfitting zone
Figure 5.6: As capacity increases (
x
-axis), bias (dotted) tends to decrease and variance
(dashed) tends to increase, yielding another U-shaped curve for generalization error (bold
curve). If we vary capacity along one axis, there is an optimal capacity, with underfitting
when the capacity is below this optimum and overfitting when it is above. This relationship
is similar to the relationship between capacity, underfitting, and overfitting, discussed in
Sec. 5.2 and Fig. 5.3.
The relationship between bias and variance is tightly linked to the machine
learning concepts of capacity, underfitting and overfitting. In the case where gen-
eralization error is measured by the MSE (where bias and variance are meaningful
components of generalization error), increasing capacity tends to increase variance
and decrease bias. This is illustrated in Fig. 5.6, where we see again the U-shaped
curve of generalization error as a function of capacity.
5.4.5 Consistency
So far we have discussed the properties of various estimators for a training set of
fixed size. Usually, we are also concerned with the behavior of an estimator as the
amount of training data grows. In particular, we usually wish that, as the number
of data points
m
in our dataset increases, our point estimates converge to the true
value of the corresponding parameters. More formally, we would like that
lim
m→∞
ˆ
θ
m
p
θ. (5.33)
130
CHAPTER 5. MACHINE LEARNING BASICS
The symbol
p
means that the convergence is in probability, i.e. for any
>
0,
P
(
|
ˆ
θ
m
θ| >
)
0 as
m
. The condition described by Eq. 5.33 is
known as consistency. It is sometimes referred to as weak consistency, with
strong consistency referring to the almost sure convergence of
ˆ
θ
to
θ
. Almost sure
convergence of a sequence of random variables
x
(1)
, x
(2)
, . . .
to a value
x
occurs
when p(lim
m→∞
x
(m)
= x) = 1.
Consistency ensures that the bias induced by the estimator is assured to
diminish as the number of data examples grows. However, the reverse is not
true—asymptotic unbiasedness does not imply consistency. For example, consider
estimating the mean parameter
µ
of a normal distribution
N
(
x
;
µ, σ
2
), with a
dataset consisting of
m
samples:
{x
(1)
, . . . , x
(m)
}
. We could use the first sample
x
(1)
of the dataset as an unbiased estimator:
ˆ
θ
=
x
(1)
. In that case,
E
(
ˆ
θ
m
) =
θ
so the estimator is unbiased no matter how many data points are seen. This, of
course, implies that the estimate is asymptotically unbiased. However, this is not
a consistent estimator as it is not the case that
ˆ
θ
m
θ as m .
5.5 Maximum Likelihood Estimation
Previously, we have seen some definitions of common estimators and analyzed
their properties. But where did these estimators come from? Rather than guessing
that some function might make a good estimator and then analyzing its bias and
variance, we would like to have some principle from which we can derive specific
functions that are good estimators for different models.
The most common such principle is the maximum likelihood principle.
Consider a set of
m
examples
X
=
{x
(1)
, . . . , x
(m)
}
drawn independently from
the true but unknown data generating distribution p
data
(x).
Let
p
model
(
x
;
θ
) be a parametric family of probability distributions over the
same space indexed by
θ
. In other words,
p
model
(
x
;
θ
) maps any configuration
x
to a real number estimating the true probability p
data
(x).
The maximum likelihood estimator for θ is then defined as
θ
ML
= arg max
θ
p
model
(X; θ) (5.34)
= arg max
θ
m
i=1
p
model
(x
(i)
; θ) (5.35)
This product over many probabilities can be inconvenient for a variety of reasons.
For example, it is prone to numerical underflow. To obtain a more convenient
131
CHAPTER 5. MACHINE LEARNING BASICS
but equivalent optimization problem, we observe that taking the logarithm of the
likelihood does not change its
arg max
but does conveniently transform a product
into a sum:
θ
ML
= arg max
θ
m
i=1
log p
model
(x
(i)
; θ). (5.36)
Because the argmax does not change when we rescale the cost function, we can
divide by
m
to obtain a version of the criterion that is expressed as an expectation
with respect to the empirical distribution ˆp
data
defined by the training data:
θ
ML
= arg max
θ
E
xˆp
data
log p
model
(x; θ). (5.37)
One way to interpret maximum likelihood estimation is to view it as minimizing
the dissimilarity between the empirical distribution
ˆp
data
defined by the training
set and the model distribution, with the degree of dissimilarity between the two
measured by the KL divergence. The KL divergence is given by
D
KL
(ˆp
data
p
model
) = E
xˆp
data
[log ˆp
data
(x) log p
model
(x)] . (5.38)
The term on the left is a function only of the data generating process, not the
model. This means when we train the model to minimize the KL divergence, we
need only minimize
E
xˆp
data
[log p
model
(x)] (5.39)
which is of course the same as the maximization in Eq. 5.37.
Minimizing this KL divergence corresponds exactly to minimizing the cross-
entropy between the distributions. Many authors use the term “cross-entropy” to
identify specifically the negative log-likelihood of a Bernoulli or softmax distribution,
but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross
entropy between the empirical distribution defined by the training set and the
model. For example, mean squared error is the cross-entropy between the empirical
distribution and a Gaussian model.
We can thus see maximum likelihood as an attempt to make the model dis-
tribution match the empirical distribution
ˆp
data
. Ideally, we would like to match
the true data generating distribution
p
data
, but we have no direct access to this
distribution.
While the optimal
θ
is the same regardless of whether we are maximizing the
likelihood or minimizing the KL divergence, the values of the objective functions
are different. In software, we often phrase both as minimizing a cost function.
Maximum likelihood thus becomes minimization of the negative log-likelihood
132
CHAPTER 5. MACHINE LEARNING BASICS
(NLL), or equivalently, minimization of the cross entropy. The perspective of
maximum likelihood as minimum KL divergence becomes helpful in this case
because the KL divergence has a known minimum value of zero. The negative
log-likelihood can actually become negative when x is real-valued.
5.5.1 Conditional Log-Likelihood and Mean Squared Error
The maximum likelihood estimator can readily be generalized to the case where
our goal is to estimate a conditional probability
P
(
y | x
;
θ
) in order to predict
y
given
x
. This is actually the most common situation because it forms the basis for
most supervised learning. If
X
represents all our inputs and
Y
all our observed
targets, then the conditional maximum likelihood estimator is
θ
ML
= arg max
θ
P (Y | X; θ). (5.40)
If the examples are assumed to be i.i.d., then this can be decomposed into
θ
ML
= arg max
θ
m
i=1
log P (y
(i)
| x
(i)
; θ). (5.41)
Example: Linear Regression as Maximum Likelihood
Linear regression,
introduced earlier in Sec. 5.1.4, may be justified as a maximum likelihood procedure.
Previously, we motivated linear regression as an algorithm that learns to take an
input
x
and produce an output value
ˆy
. The mapping from
x
to
ˆy
is chosen to
minimize mean squared error, a criterion that we introduced more or less arbitrarily.
We now revisit linear regression from the point of view of maximum likelihood
estimation. Instead of producing a single prediction
ˆy
, we now think of the model
as producing a conditional distribution
p
(
y | x
). We can imagine that with an
infinitely large training set, we might see several training examples with the same
input value
x
but different values of
y
. The goal of the learning algorithm is now to
fit the distribution
p
(
y | x
) to all of those different
y
values that are all compatible
with
x
. To derive the same linear regression algorithm we obtained before, we
define
p
(
y | x
) =
N
(
y
;
ˆy
(
x
;
w
)
, σ
2
). The function
ˆy
(
x
;
w
) gives the prediction of
the mean of the Gaussian. In this example, we assume that the variance is fixed to
some constant
σ
2
chosen by the user. We will see that this choice of the functional
form of
p
(
y | x
) causes the maximum likelihood estimation procedure to yield the
same learning algorithm as we developed before. Since the examples are assumed
133
CHAPTER 5. MACHINE LEARNING BASICS
to be i.i.d., the conditional log-likelihood (Eq. 5.41) is given by
m
i=1
log p(y
(i)
| x
(i)
; θ) (5.42)
= m log σ
m
2
log(2π)
m
i=1
|ˆy
(i)
y
(i)
||
2
2σ
2
(5.43)
where
ˆy
(i)
is the output of the linear regression on the
i
-th input
x
(i)
and
m
is the
number of the training examples. Comparing the log-likelihood with the mean
squared error,
MSE
train
=
1
m
m
i=1
||ˆy
(i)
y
(i)
||
2
, (5.44)
we immediately see that maximizing the log-likelihood with respect to
w
yields
the same estimate of the parameters
w
as does minimizing the mean squared error.
The two criteria have different values but the same location of the optimum. This
justifies the use of the MSE as a maximum likelihood estimation procedure. As we
will see, the maximum likelihood estimator has several desirable properties.
5.5.2 Properties of Maximum Likelihood
The main appeal of the maximum likelihood estimator is that it can be shown to
be the best estimator asymptotically, as the number of examples
m
, in terms
of its rate of convergence as m increases.
Under appropriate conditions, maximum likelihood estimator has the property
of consistency (see Sec. 5.4.5 above), meaning that as the number of training
examples approaches infinity, the maximum likelihood estimate of a parameter
converges to the true value of the parameter. These conditions are:
The true distribution
p
data
must lie within the model family
p
model
(
·
;
θ
).
Otherwise, no estimator can recover p
data
.
The true distribution
p
data
must correspond to exactly one value of
θ
. Other-
wise, maximum likelihood can recover the correct
p
data
, but will not be able
to determine which value of θ was used by the data generating processing.
There are other inductive principles besides the maximum likelihood estimator,
many of which share the property of being consistent estimators. However, consis-
tent estimators can differ in their statistic efficiency, meaning that one consistent
134
CHAPTER 5. MACHINE LEARNING BASICS
estimator may obtain lower generalization error for a fixed number of samples
m
,
or equivalently, may require fewer examples to obtain a fixed level of generalization
error.
Statistical efficiency is typically studied in the parametric case (like in linear
regression) where our goal is to estimate the value of a parameter (and assuming
it is possible to identify the true parameter), not the value of a function. A way to
measure how close we are to the true parameter is by the expected mean squared
error, computing the squared difference between the estimated and true parameter
values, where the expectation is over
m
training samples from the data generating
distribution. That parametric mean squared error decreases as
m
increases, and
for
m
large, the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that no
consistent estimator has a lower mean squared error than the maximum likelihood
estimator.
For these reasons (consistency and efficiency), maximum likelihood is often
considered the preferred estimator to use for machine learning. When the number
of examples is small enough to yield overfitting behavior, regularization strategies
such as weight decay may be used to obtain a biased version of maximum likelihood
that has less variance when training data is limited.
5.6 Bayesian Statistics
So far we have discussed frequentist statistics and approaches based on estimating a
single value of
θ
, then making all predictions thereafter based on that one estimate.
Another approach is to consider all possible values of
θ
when making a prediction.
The latter is the domain of Bayesian statistics.
As discussed in Sec. 5.4.1, the frequentist perspective is that the true parameter
value
θ
is fixed but unknown, while the point estimate
ˆ
θ
is a random variable on
account of it being a function of the dataset (which is seen as random).
The Bayesian perspective on statistics is quite different. The Bayesian uses
probability to reflect degrees of certainty of states of knowledge. The dataset is
directly observed and so is not random. On the other hand, the true parameter
θ
is unknown or uncertain and thus is represented as a random variable.
Before observing the data, we represent our knowledge of
θ
using the prior
probability distribution,
p
(
θ
) (sometimes referred to as simply “the prior”). Gen-
erally, the machine learning practitioner selects a prior distribution that is quite
broad (i.e. with high entropy) to reflect a high degree of uncertainty in the value of
θ
before observing any data. For example, one might assume a priori that
θ
lies
135
CHAPTER 5. MACHINE LEARNING BASICS
in some finite range or volume, with a uniform distribution. Many priors instead
reflect a preference for “simpler” solutions (such as smaller magnitude coefficients,
or a function that is closer to being constant).
Now consider that we have a set of data samples
{x
(1)
, . . . , x
(m)
}
. We can
recover the effect of data on our belief about
θ
by combining the data likelihood
p(x
(1)
, . . . , x
(m)
| θ) with the prior via Bayes’ rule:
p(θ | x
(1)
, . . . , x
(m)
) =
p(x
(1)
, . . . , x
(m)
| θ)p(θ)
p(x
(1)
, . . . , x
(m)
)
(5.45)
In the scenarios where Bayesian estimation is typically used, the prior begins as a
relatively uniform or Gaussian distribution with high entropy, and the observation
of the data usually causes the posterior to lose entropy and concentrate around a
few highly likely values of the parameters.
Relative to maximum likelihood estimation, Bayesian estimation offers two
important differences. First, unlike the maximum likelihood approach that makes
predictions using a point estimate of
θ
, the Bayesian approach is to make predictions
using a full distribution over
θ
. For example, after observing
m
examples, the
predicted distribution over the next data sample, x
(m+1)
, is given by
p(x
(m+1)
| x
(1)
, . . . , x
(m)
) =
p(x
(m+1)
| θ)p(θ | x
(1)
, . . . , x
(m)
) dθ. (5.46)
Here each value of
θ
with positive probability density contributes to the prediction
of the next example, with the contribution weighted by the posterior density itself.
After having observed
{x
(1)
, . . . , x
(m)
}
, if we are still quite uncertain about the
value of
θ
, then this uncertainty is incorporated directly into any predictions we
might make.
In Sec. 5.4, we discussed how the frequentist approach addresses the uncertainty
in a given point estimate of
θ
by evaluating its variance. The variance of the
estimator is an assessment of how the estimate might change with alternative
samplings of the observed data. The Bayesian answer to the question of how to deal
with the uncertainty in the estimator is to simply integrate over it, which tends to
protect well against overfitting. This integral is of course just an application of
the laws of probability, making the Bayesian approach simple to justify, while the
frequentist machinery for constructing an estimator is based on the rather ad hoc
decision to summarize all knowledge contained in the dataset with a single point
estimate.
The second important difference between the Bayesian approach to estimation
and the maximum likelihood approach is due to the contribution of the Bayesian
136
CHAPTER 5. MACHINE LEARNING BASICS
prior distribution. The prior has an influence by shifting probability mass density
towards regions of the parameter space that are preferred a priori. In practice,
the prior often expresses a preference for models that are simpler or more smooth.
Critics of the Bayesian approach identify the prior as a source of subjective human
judgment impacting the predictions.
Bayesian methods typically generalize much better when limited training data
is available, but typically suffer from high computational cost when the number of
training examples is large.
Example: Bayesian Linear Regression
Here we consider the Bayesian esti-
mation approach to learning the linear regression parameters. In linear regression,
we learn a linear mapping from an input vector
x R
n
to predict the value of a
scalar y R. The prediction is parametrized by the vector w R
n
:
ˆy = w
x. (5.47)
Given a set of
m
training samples (
X
(train)
, y
(train)
), we can express the prediction
of y over the entire training set as:
ˆ
y
(train)
= X
(train)
w. (5.48)
Expressed as a Gaussian conditional distribution on y
(train)
, we have
p(y
(train)
| X
(train)
, w) = N(y
(train)
; X
(train)
w, I)
exp
1
2
(y
(train)
X
(train)
w)
(y
(train)
X
(train)
w)
,
where we follow the standard MSE formulation in assuming that the Gaussian
variance on
y
is one. In what follows, to reduce the notational burden, we refer to
(X
(train)
, y
(train)
) as simply (X, y).
To determine the posterior distribution over the model parameter vector
w
, we
first need to specify a prior distribution. The prior should reflect our naive belief
about the value of these parameters. While it is sometimes difficult or unnatural
to express our prior beliefs in terms of the parameters of the model, in practice we
typically assume a fairly broad distribution expressing a high degree of uncertainty
about
θ
. For real-valued parameters it is common to use a Gaussian as a prior
distribution:
p(w) = N(w; µ
0
, Λ
0
) exp
1
2
(w µ
0
)
Λ
1
0
(w µ
0
)
137
CHAPTER 5. MACHINE LEARNING BASICS
where
µ
0
and
Λ
0
are the prior distribution mean vector and covariance matrix
respectively.
1
With the prior thus specified, we can now proceed in determining the posterior
distribution over the model parameters.
p(w | X, y) p(y | X, w)p(w)
exp
1
2
(y Xw)
(y Xw)
exp
1
2
(w µ
0
)
Λ
1
0
(w µ
0
)
exp
1
2
2y
Xw + w
X
Xw + w
Λ
1
0
w 2µ
0
Λ
1
0
w
.
We now define
Λ
m
=
X
X + Λ
1
0
1
and
µ
m
= Λ
m
X
y + Λ
1
0
µ
0
. Using
these new variables, we find that the posterior may be rewritten as a Gaussian
distribution:
p(w | X, y) exp
1
2
(w µ
m
)
Λ
1
m
(w µ
m
) +
1
2
µ
m
Λ
1
m
µ
m
(5.49)
exp
1
2
(w µ
m
)
Λ
1
m
(w µ
m
)
. (5.50)
All terms that do not include the parameter vector
w
have been omitted; they
are implied by the fact that the distribution must be normalized to integrate to 1.
Eq. 3.23 shows how to normalize a multivariate Gaussian distribution.
Examining this posterior distribution allows us to gain some intuition for the
effect of Bayesian inference. In most situations, we set
µ
0
to
0
. If we set
Λ
0
=
1
α
I
,
then
µ
m
gives the same estimate of
w
as does frequentist linear regression with
a weight decay penalty of
αw
w
. One difference is that the Bayesian estimate
is undefined if
alpha
is set to zero—-we are not allowed to begin the Bayesian
learning process with an infinitely wide prior on
w
. The more important difference
is that the Bayesian estimate provides a covariance matrix, showing how likely all
the different values of w are, rather than providing only the estimate µ
m
.
5.6.1 Maximum A Posteriori (MAP) Estimation
While the most principled approach is to make predictions using the full Bayesian
posterior distribution over the parameter
θ
, it is still often desirable to have a
single point estimate. One common reason for desiring a point estimate is that
1
Unless there is a reason to assume a particular covariance structure, we typically assume a
diagonal covariance matrix Λ
0
= diag(λ
0
).
138
CHAPTER 5. MACHINE LEARNING BASICS
most operations involving the Bayesian posterior for most interesting models are
intractable, and a point estimate offers a tractable approximation. Rather than
simply returning to the maximum likelihood estimate, we can still gain some of
the benefit of the Bayesian approach by allowing the prior to influence the choice
of the point estimate. One rational way to do this is to choose the maximum a
posteriori (MAP) point estimate. The MAP estimate chooses the point of maximal
posterior probability (or maximal probability density in the more common case of
continuous θ):
θ
MAP
= arg max
θ
p(θ | x) = arg max
θ
log p(x | θ) + log p(θ). (5.51)
We recognize, above on the right hand side,
log p
(
x | θ
), i.e. the standard log-
likelihood term, and log p(θ), corresponding to the prior distribution.
As an example, consider a linear regression model with a Gaussian prior on the
weights
w
. If this prior is given by
N
(
w
;
0,
1
λ
I
2
), then the log-prior term in Eq.
5.51 is proportional to the familiar
λw
w
weight decay penalty, plus a term that
does not depend on
w
and does not affect the learning process. MAP Bayesian
inference with a Gaussian prior on the weights thus corresponds to weight decay.
As with full Bayesian inference, MAP Bayesian inference has the advantage of
leveraging information that is brought by the prior and cannot be found in the
training data. This additional information helps to reduce the variance in the
MAP point estimate (in comparison to the ML estimate). However, it does so at
the price of increased bias.
Many regularized estimation strategies, such as maximum likelihood learning
regularized with weight decay, can be interpreted as making the MAP approxima-
tion to Bayesian inference. This view applies when the regularization consists of
adding an extra term to the objective function that corresponds to
log p
(
θ
). Not
all regularization penalties correspond to MAP Bayesian inference. For example,
some regularizer terms may not be the logarithm of a probability distribution.
Other regularization terms depend on the data, which of course a prior probability
distribution is not allowed to do.
MAP Bayesian inference provides a straightforward way to design complicated
yet interpretable regularization terms. For example, a more complicated penalty
term can be derived by using a mixture of Gaussians, rather than a single Gaussian
distribution, as the prior (Nowlan and Hinton, 1992).
139
CHAPTER 5. MACHINE LEARNING BASICS
5.7 Supervised Learning Algorithms
Recall from Sec. 5.1.3 that supervised learning algorithms are, roughly speaking,
learning algorithms that learn to associate some input with some output, given a
training set of examples of inputs
x
and outputs
y
. In many cases the outputs
y
may be difficult to collect automatically and must be provided by a human
“supervisor,” but the term still applies even when the training set targets were
collected automatically.
5.7.1 Probabilistic Supervised Learning
Most supervised learning algorithms in this book are based on estimating a
probability distribution
p
(
y | x
). We can do this simply by using maximum
likelihood estimation to find the best parameter vector
θ
for a parametric family
of distributions p(y | x; θ).
We have already seen that linear regression corresponds to the family
p(y | x; θ) = N(y; θ
x, I). (5.52)
We can generalize linear regression to the classification scenario by defining a
different family of probability distributions. If we have two classes, class 0 and
class 1, then we need only specify the probability of one of these classes. The
probability of class 1 determines the probability of class 0, because these two values
must add up to 1.
The normal distribution over real-valued numbers that we used for linear
regression is parametrized in terms of a mean. Any value we supply for this mean
is valid. A distribution over a binary variable is slightly more complicated, because
its mean must always be between 0 and 1. One way to solve this problem is to use
the logistic sigmoid function to squash the output of the linear function into the
interval (0, 1) and interpret that value as a probability:
p(y = 1 | x; θ) = σ(θ
x). (5.53)
This approach is known as logistic regression (a somewhat strange name since we
use the model for classification rather than regression).
In the case of linear regression, we were able to find the optimal weights by
solving the normal equations. Logistic regression is somewhat more difficult. There
is no closed-form solution for its optimal weights. Instead, we must search for
them by maximizing the log-likelihood. We can do this by minimizing the negative
log-likelihood (NLL) using gradient descent.
140
CHAPTER 5. MACHINE LEARNING BASICS
This same strategy can be applied to essentially any supervised learning problem,
by writing down a parametric family of conditional probability distributions over
the right kind of input and output variables.
5.7.2 Support Vector Machines
One of the most influential approaches to supervised learning is the support vector
machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to
logistic regression in that it is driven by a linear function
w
x
+
b
. Unlike logistic
regression, the support vector machine does not provide probabilities, but only
outputs a class identity. The SVM predicts that the positive class is present when
w
x
+
b
is positive. Likewise, it predicts that the negative class is present when
w
x + b is negative.
One key innovation associated with support vector machines is the kernel trick.
The kernel trick consists of observing that many machine learning algorithms can
be written exclusively in terms of dot products between examples. For example, it
can be shown that the linear function used by the support vector machine can be
re-written as
w
x + b = b +
m
i=1
α
i
x
x
(i)
(5.54)
where
x
(i)
is a training example and
α
is a vector of coefficients. Rewriting the
learning algorithm this way allows us to replace
x
by the output of a given feature
function
φ
(
x
) and the dot product with a function
k
(
x, x
(i)
) =
φ
(
x
)
·φ
(
x
(i)
) called
a kernel. The
·
operator represents an inner product analogous to
φ
(
x
)
φ
(
x
(i)
).
For some feature spaces, we may not use literally the vector inner product. In
some infinite dimensional spaces, we need to use other kinds of inner products, for
example, inner products based on integration rather than summation. A complete
development of these kinds of inner products is beyond the scope of this book.
After replacing dot products with kernel evaluations, we can make predictions
using the function
f(x) = b +
i
α
i
k(x, x
(i)
). (5.55)
This function is nonlinear with respect to
x
, but the relationship between
φ
(
x
)
and
f
(
x
) is linear. Also, the relationship between
α
and
f
(
x
) is linear. The
kernel-based function is exactly equivalent to preprocessing the data by applying
φ(x) to all inputs, then learning a linear model in the new transformed space.
The kernel trick is powerful for two reasons. First, it allows us to learn models
that are nonlinear as a function of
x
using convex optimization techniques that are
141
CHAPTER 5. MACHINE LEARNING BASICS
guaranteed to converge efficiently. This is possible because we consider
φ
fixed and
optimize only
α
, i.e., the optimization algorithm can view the decision function
as being linear in a different space. Second, the kernel function
k
often admits
an implementation that is significantly more computational efficient than naively
constructing two φ(x) vectors and explicitly taking their dot product.
In some cases,
φ
(
x
) can even be infinite dimensional, which would result in
an infinite computational cost for the naive, explicit approach. In many cases,
k
(
x, x
) is a nonlinear, tractable function of
x
even when
φ
(
x
) is intractable. As
an example of an infinite-dimensional feature space with a tractable kernel, we
construct a feature mapping
φ
(
x
) over the non-negative integers
x
. Suppose that
this mapping returns a vector containing
x
ones followed by infinitely many zeros.
We can write a kernel function
k
(
x, x
(i)
) =
min
(
x, x
(i)
) that is exactly equivalent
to the corresponding infinite-dimensional dot product.
The most commonly used kernel is the Gaussian kernel
k(u, v) = N(u v; 0, σ
2
I) (5.56)
where
N
(
x
;
µ, Σ
) is the standard normal density. This kernel is also known as
the radial basis function (RBF) kernel, because its value decreases along lines in
v
space radiating outward from
u
. The Gaussian kernel corresponds to a dot
product in an infinite-dimensional space, but the derivation of this space is less
straightforward than in our example of the min kernel over the integers.
We can think of the Gaussian kernel as performing a kind of template matching.
A training example
x
associated with training label
y
becomes a template for class
y
. When a test point
x
is near
x
according to Euclidean distance, the Gaussian
kernel has a large response, indicating that
x
is very similar to the
x
template.
The model then puts a large weight on the associated training label
y
. Overall,
the prediction will combine many such training labels weighted by the similarity
of the corresponding training examples.
Support vector machines are not the only algorithm that can be enhanced
using the kernel trick. Many other linear models can be enhanced in this way. The
category of algorithms that employ the kernel trick is known as kernel machines
or kernel methods (Williams and Rasmussen, 1996; Schölkopf et al., 1999).
A major drawback to kernel machines is that the cost of evaluating the decision
function is linear in the number of training examples, because the
i
-th example
contributes a term
α
i
k
(
x, x
(i)
) to the decision function. Support vector machines
are able to mitigate this by learning an
α
vector that contains mostly zeros.
Classifying a new example then requires evaluating the kernel function only for
the training examples that have non-zero
α
i
. These training examples are known
142
CHAPTER 5. MACHINE LEARNING BASICS
as support vectors.
Kernel machines also suffer from a high computational cost of training when
the dataset is large. We will revisit this idea in Sec. 5.9. Kernel machines with
generic kernels struggle to generalize well. We will explain why in Sec. 5.11. The
modern incarnation of deep learning was designed to overcome these limitations of
kernel machines. The current deep learning renaissance began when Hinton et al.
(2006) demonstrated that a neural network could outperform the RBF kernel SVM
on the MNIST benchmark.
5.7.3 Other Simple Supervised Learning Algorithms
We have already briefly encountered another non-probabilistic supervised learning
algorithm, nearest neighbor regression. More generally,
k
-nearest neighbors is
a family of techniques that can be used for classification or regression. As a
non-parametric learning algorithm,
k
-nearest neighbors is not restricted to a fixed
number of parameters. We usually think of the
k
-nearest neighbors algorithm
as not having any parameters, but rather implementing a simple function of the
training data. In fact, there is not even really a training stage or learning process.
Instead, at test time, when we want to produce an output
y
for a new test input
x
,
we find the
k
-nearest neighbors to
x
in the training data
X
. We then return the
average of the corresponding
y
values in the training set. This works for essentially
any kind of supervised learning where we can define an average over
y
values. In
the case of classification, we can average over one-hot code vectors
c
with
c
y
= 1
and
c
i
= 0 for all other values of
i
. We can then interpret the average over these
one-hot codes as giving a probability distribution over classes. As a non-parametric
learning algorithm,
k
-nearest neighbor can achieve very high capacity. For example,
suppose we have a multiclass classification task and measure performance with 0-1
loss. In this setting, 1-nearest neighbor converges to double the Bayes error as the
number of training examples approaches infinity. The error in excess of the Bayes
error results from choosing a single neighbor by breaking ties between equally
distant neighbors randomly. When there is infinite training data, all test points
x
will have infinitely many training set neighbors at distance zero. If we allow the
algorithm to use all of these neighbors to vote, rather than randomly choosing one
of them, the procedure converges to the Bayes error rate. The high capacity of
k
-nearest neighbors allows it to obtain high accuracy given a large training set.
However, it does so at high computational cost, and it may generalize very badly
given a small, finite training set. One weakness of
k
-nearest neighbors is that it
cannot learn that one feature is more discriminative than another. For example,
imagine we have a regression task with
x R
100
drawn from an isotropic Gaussian
143
CHAPTER 5. MACHINE LEARNING BASICS
distribution, but only a single variable
x
1
is relevant to the output. Suppose
further that this feature simply encodes the output directly, i.e. that
y
=
x
1
in all
cases. Nearest neighbor regression will not be able to detect this simple pattern.
The nearest neighbor of most points
x
will be determined by the large number of
features
x
2
through
x
100
, not by the lone feature
x
1
. Thus the output on small
training sets will essentially be random.
144
CHAPTER 5. MACHINE LEARNING BASICS
0
1
01
111
0 1
011
11111110
110
10
010
00
1110 1111
110
10
01
00
010 011
11
111
11
Figure 5.7: Diagrams describing how a decision tree works. (Top) Each node of the tree
chooses to send the input example to the child node on the left (0) or or the child node on
the right (1). Internal nodes are drawn as circles and leaf nodes as squares. Each node is
displayed with a binary string identifier corresponding to its position in the tree, obtained
by appending a bit to its parent identifier (0=choose left or top, 1=choose right or bottom).
(Bottom) The tree divides space into regions. The 2D plane shows how a decision tree
might divide
R
2
. The nodes of the tree are plotted in this plane, with each internal node
drawn along the dividing line it uses to categorize examples, and leaf nodes drawn in the
center of the region of examples they receive. The result is a piecewise-constant function,
with one piece per leaf. Each leaf requires at least one training example to define, so it is
not possible for the decision tree to learn a function that has more local maxima than the
number of training examples.
145
CHAPTER 5. MACHINE LEARNING BASICS
Another type of learning algorithm that also breaks the input space into regions
and has separate parameters for each region is the decision tree (Breiman et al.,
1984) and its many variants. As shown in Fig. 5.7, each node of the decision tree
is associated with a region in the input space, and internal nodes break that region
into one sub-region for each child of the node (typically using an axis-aligned
cut). Space is thus sub-divided into non-overlapping regions, with a one-to-one
correspondence between leaf nodes and input regions. Each leaf node usually maps
every point in its input region to the same output. Decision trees are usually
trained with specialized algorithms that are beyond the scope of this book. The
learning algorithm can be considered non-parametric if it is allowed to learn a tree
of arbitrary size, though decision trees are usually regularized with size constraints
that turn them into parametric models in practice. Decision trees as they are
typically used, with axis-aligned splits and constant outputs within each node,
struggle to solve some problems that are easy even for logistic regression. For
example, if we have a two-class problem and the positive class occurs wherever
x
2
> x
1
, the decision boundary is not axis-aligned. The decision tree will thus
need to approximate the decision boundary with many nodes, implementing a step
function that constantly walks back and forth across the true decision function
with axis-aligned steps.
As we have seen, nearest neighbor predictors and decision trees have many
limitations. Nonetheless, they are useful learning algorithms when computational
resources are constrained. We can also build intuition for more sophisticated
learning algorithms by thinking about the similarities and differences between
sophisticated algorithms and k-NN or decision tree baselines.
See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine
learning textbooks for more material on traditional supervised learning algorithms.
5.8 Unsupervised Learning Algorithms
Recall from Sec. 5.1.3 that unsupervised algorithms are those that experience only
“features” but not a supervision signal. The distinction between supervised and
unsupervised algorithms is not formally and rigidly defined because there is no
objective test for distinguishing whether a value is a feature or a target provided by
a supervisor. Informally, unsupervised learning refers to most attempts to extract
information from a distribution that do not require human labor to annotate
examples. The term is usually associated with density estimation, learning to
draw samples from a distribution, learning to denoise data from some distribution,
finding a manifold that the data lies near, or clustering the data into groups of
146
CHAPTER 5. MACHINE LEARNING BASICS
related examples.
A classic unsupervised learning task is to find the “best” representation of the
data. By ‘best’ we can mean different things, but generally speaking we are looking
for a representation that preserves as much information about
x
as possible while
obeying some penalty or constraint aimed at keeping the representation simpler or
more accessible than x itself.
There are multiple ways of defining a simpler representation. Three of the
most common include lower dimensional representations, sparse representations
and independent representations. Low-dimensional representations attempt to
compress as much information about
x
as possible in a smaller representation.
Sparse representations (Barlow, 1989; Olshausen and Field, 1996; Hinton and
Ghahramani, 1997) embed the dataset into a representation whose entries are
mostly zeroes for most inputs. The use of sparse representations typically requires
increasing the dimensionality of the representation, so that the representation
becoming mostly zeroes does not discard too much information. This results in an
overall structure of the representation that tends to distribute data along the axes
of the representation space. Independent representations attempt to disentangle
the sources of variation underlying the data distribution such that the dimensions
of the representation are statistically independent.
Of course these three criteria are certainly not mutually exclusive. Low-
dimensional representations often yield elements that have fewer or weaker de-
pendencies than the original high-dimensional data. This is because one way to
reduce the size of a representation is to find and remove redundancies. Identifying
and removing more redundancy allows the dimensionality reduction algorithm to
achieve more compression while discarding less information.
The notion of representation is one of the central themes of deep learning and
therefore one of the central themes in this book. In this section, we develop some
simple examples of representation learning algorithms. Together, these example
algorithms show how to operationalize all three of the criteria above. Most of the
remaining chapters introduce additional representation learning algorithms that
develop these criteria in different ways or introduce other criteria.
5.8.1 Principal Components Analysis
In Sec. 2.12, we saw that the principal components analysis algorithm provides a
means of compressing data. We can also view PCA as an unsupervised learning
algorithm that learns a representation of data. This representation is based on
two of the criteria for a simple representation described above. PCA learns a
147
CHAPTER 5. MACHINE LEARNING BASICS
20 10 0 10 20
x
1
20
10
0
10
20
x
2
20 10 0 10 20
z
1
20
10
0
10
20
z
2
Figure 5.8: PCA learns a linear projection that aligns the direction of greatest variance
with the axes of the new space. (Left) The original data consists of samples of
x
. In this
space, the variance might occur along directions that are not axis-aligned. (Right) The
transformed data
z
=
x
W
now varies most along the axis
z
1
. The direction of second
most variance is now along z
2
.
representation that has lower dimensionality than the original input. It also learns
a representation whose elements have no linear correlation with each other. This
is a first step toward the criterion of learning representations whose elements are
statistically independent. To achieve full independence, a representation learning
algorithm must also remove the nonlinear relationships between variables.
PCA learns an orthogonal, linear transformation of the data that projects an
input
x
to a representation
z
as shown in Fig. 5.8. In Sec. 2.12, we saw that we
could learn a one-dimensional representation that best reconstructs the original
data (in the sense of mean squared error) and that this representation actually
corresponds to the first principal component of the data. Thus we can use PCA
as a simple and effective dimensionality reduction method that preserves as much
of the information in the data as possible (again, as measured by least-squares
reconstruction error). In the following, we will study how the PCA representation
decorrelates the original data representation X.
Let us consider the
m × n
-dimensional design matrix
X
. We will assume that
the data has a mean of zero,
E
[
x
] =
0
. If this is not the case, the data can easily
be centered by subtracting the mean from all examples in a preprocessing step.
The unbiased sample covariance matrix associated with X is given by:
Var[x] =
1
m 1
X
X. (5.57)
148
CHAPTER 5. MACHINE LEARNING BASICS
PCA finds a representation (through linear transformation)
z
=
x
W
where
Var[z] is diagonal.
In Sec. 2.12, we saw that the principal components of a design matrix
X
are
given by the eigenvectors of X
X. From this view,
X
X = W ΛW
. (5.58)
In this section, we exploit an alternative derivation of the principal components. The
principal components may also be obtained via the singular value decomposition.
Specifically, they are the right singular vectors of
X
. To see this, let
W
be the
right singular vectors in the decomposition
X
=
UΣW
. We then recover the
original eigenvector equation with W as the eigenvector basis:
X
X =
UΣW
UΣW
= W Σ
2
W
. (5.59)
The SVD is helpful to show that PCA results in a diagonal
Var
[
z
]. Using the
SVD of X, we can express the variance of X as:
Var[x] =
1
m 1
X
X
=
1
m 1
(UΣW
)
UΣW
=
1
m 1
W Σ
U
UΣW
=
1
m 1
W Σ
2
W
, (5.60)
where we use the fact that
U
U
=
I
because the
U
matrix of the singular value
definition is defined to be orthonormal. This shows that if we take
z
=
W x
, we
can ensure that the covariance of z is diagonal as required:
Var[z] =
1
m 1
Z
Z
=
1
m 1
W
X
XW
=
1
m 1
W W
Σ
2
W W
=
1
m 1
Σ
2
, (5.61)
where this time we use the fact that
W W
=
I
, again from the definition of the
SVD.
149
CHAPTER 5. MACHINE LEARNING BASICS
The above analysis shows that when we project the data
x
to
z
, via the linear
transformation
W
, the resulting representation has a diagonal covariance matrix
(as given by
Σ
2
) which immediately implies that the individual elements of
z
are
mutually uncorrelated.
This ability of PCA to transform data into a representation where the elements
are mutually uncorrelated is a very important property of PCA. It is a simple
example of a representation that attempt to
disentangle the unknown factors
of variation
underlying the data. In the case of PCA, this disentangling takes
the form of finding a rotation of the input space (described by
W
) that aligns the
principal axes of variance with the basis of the new representation space associated
with z.
While correlation is an important category of dependency between elements of
the data, we are also interested in learning representations that disentangle more
complicated forms of feature dependencies. For this, we will need more than what
can be done with a simple linear transformation.
5.8.2 k-means Clustering
Another example of a simple representation learning algorithm is
k
-means clustering.
The
k
-means clustering algorithm divides the training set into
k
different clusters
of examples that are near each other. We can thus think of the algorithm as
providing a
k
-dimensional one-hot code vector
h
representing an input
x
. If
x
belongs to cluster
i
, then
h
i
= 1 and all other entries of the representation
h
are
zero.
The one-hot code provided by
k
-means clustering is an example of a sparse
representation, because the majority of its entries are zero for every input. Later,
we will develop other algorithms that learn more flexible sparse representations,
where more than one entry can be non-zero for each input
x
. One-hot codes
are an extreme example of sparse representations that lose many of the benefits
of a distributed representation. The one-hot code still confers some statistical
advantages (it naturally conveys the idea that all examples in the same cluster are
similar to each other) and it confers the computational advantage that the entire
representation may be captured by a single integer.
The
k
-means algorithm works by initializing
k
different centroids
{µ
(1)
, . . . , µ
(k)
}
to different values, then alternating between two different steps until convergence.
In one step, each training example is assigned to cluster
i
, where
i
is the index of
the nearest centroid
µ
(i)
. In the other step, each centroid
µ
(i)
is updated to the
mean of all training examples x
(j)
assigned to cluster i.
150
CHAPTER 5. MACHINE LEARNING BASICS
One difficulty pertaining to clustering is that the clustering problem is inherently
ill-posed, in the sense that there is no single criterion that measures how well a
clustering of the data corresponds to the real world. We can measure properties of
the clustering such as the average Euclidean distance from a cluster centroid to the
members of the cluster. This allows us to tell how well we are able to reconstruct
the training data from the cluster assignments. We do not know how well the
cluster assignments correspond to properties of the real world. Moreover, there
may be many different clusterings that all correspond well to some property of
the real world. We may hope to find a clustering that relates to one feature but
obtain a different, equally valid clustering that is not relevant to our task. For
example, suppose that we run two clustering algorithms on a dataset consisting of
images of red trucks, images of red cars, images of gray trucks, and images of gray
cars. If we ask each clustering algorithm to find two clusters, one algorithm may
find a cluster of cars and a cluster of trucks, while another may find a cluster of
red vehicles and a cluster of gray vehicles. Suppose we also run a third clustering
algorithm, which is allowed to determine the number of clusters. This may assign
the examples to four clusters, red cars, red trucks, gray cars, and gray trucks. This
new clustering now at least captures information about both attributes, but it has
lost information about similarity. Red cars are in a different cluster from gray
cars, just as they are in a different cluster from gray trucks. The output of the
clustering algorithm does not tell us that red cars are more similar to gray cars
than they are to gray trucks. They are different from both things, and that is all
we know.
These issues illustrate some of the reasons that we may prefer a distributed
representation to a one-hot representation. A distributed representation could have
two attributes for each vehicle—one representing its color and one representing
whether it is a car or a truck. It is still not entirely clear what the optimal
distributed representation is (how can the learning algorithm know whether the
two attributes we are interested in are color and car-versus-truck rather than
manufacturer and age?) but having many attributes reduces the burden on the
algorithm to guess which single attribute we care about, and allows us to measure
similarity between objects in a fine-grained way by comparing many attributes
instead of just testing whether one attribute matches.
5.9 Stochastic Gradient Descent
Nearly all of deep learning is powered by one very important algorithm: stochastic
gradient descent or SGD. Stochastic gradient descent is an extension of the gradient
151
CHAPTER 5. MACHINE LEARNING BASICS
descent algorithm introduced in Sec. 4.3.
A recurring problem in machine learning is that large training sets are necessary
for good generalization, but large training sets are also more computationally
expensive.
The cost function used by a machine learning algorithm often decomposes as a
sum over training examples of some per-example loss function. For example, the
negative conditional log-likelihood of the training data can be written as
J(θ) = E
x,yˆp
data
L(x, y, θ) =
1
m
m
i=1
L(x
(i)
, y
(i)
, θ) (5.62)
where L is the per-example loss L(x, y, θ) = log p(y | x; θ).
For these additive cost functions, gradient descent requires computing
θ
J(θ) =
1
m
m
i=1
θ
L(x
(i)
, y
(i)
, θ). (5.63)
The computational cost of this operation is
O
(
m
). As the training set size grows to
billions of examples, the time to take a single gradient step becomes prohibitively
long.
The insight of stochastic gradient descent is that the gradient is an expectation.
The expectation may be approximately estimated using a small set of samples.
Specifically, on each step of the algorithm, we can sample a minibatch of examples
B
=
{x
(1)
, . . . , x
(m
)
}
drawn uniformly from the training set. The minibatch size
m
is typically chosen to be a relatively small number of examples, ranging from
1 to a few hundred. Crucially,
m
is usually held fixed as the training set size
m
grows. We may fit a training set with billions of examples using updates computed
on only a hundred examples.
The estimate of the gradient is formed as
g =
1
m
θ
m
i=1
L(x
(i)
, y
(i)
, θ). (5.64)
using examples from the minibatch B. The stochastic gradient descent algorithm
then follows the estimated gradient downhill:
θ θ g, (5.65)
where is the learning rate.
152
CHAPTER 5. MACHINE LEARNING BASICS
Gradient descent in general has often been regarded as slow or unreliable. In
the past, the application of gradient descent to non-convex optimization problems
was regarded as foolhardy or unprincipled. Today, we know that the machine
learning models described in Part II work very well when trained with gradient
descent. The optimization algorithm may not be guaranteed to arrive at even a
local minimum in a reasonable amount of time, but it often finds a very low value
of the cost function quickly enough to be useful.
Stochastic gradient descent has many important uses outside the context of
deep learning. It is the main way to train large linear models on very large
datasets. For a fixed model size, the cost per SGD update does not depend on the
training set size
m
. In practice, we often use a larger model as the training set size
increases, but we are not forced to do so. The number of updates required to reach
convergence usually increases with training set size. However, as
m
approaches
infinity, the model will eventually converge to its best possible test error before
SGD has sampled every example in the training set. Increasing
m
further will not
extend the amount of training time needed to reach the model’s best possible test
error. From this point of view, one can argue that the asymptotic cost of training
a model with SGD is O(1) as a function of m.
Prior to the advent of deep learning, the main way to learn nonlinear models
was to use the kernel trick in combination with a linear model. Many kernel learning
algorithms require constructing an
m×m
matrix
G
i,j
=
k
(
x
(i)
, x
(j)
). Constructing
this matrix has computational cost
O
(
m
2
), which is clearly undesirable for datasets
with billions of examples. In academia, starting in 2006, deep learning was
initially interesting because it was able to generalize to new examples better
than competing algorithms when trained on medium-sized datasets with tens of
thousands of examples. Soon after, deep learning garnered additional interest in
industry, because it provided a scalable way of training nonlinear models on large
datasets.
Stochastic gradient descent and many enhancements to it are described further
in Chapter 8.
5.10 Building a Machine Learning Algorithm
Nearly all deep learning algorithms can be described as particular instances of
a fairly simple recipe: combine a specification of a dataset, a cost function, an
optimization procedure and a model.
For example, the linear regression algorithm combines a dataset consisting of
153
CHAPTER 5. MACHINE LEARNING BASICS
X and y, the cost function
J(w, b) = E
x,yˆp
data
log p
model
(y | x), (5.66)
the model specification
p
model
(
y | x
) =
N
(
y
;
x
w
+
b,
1), and, in most cases, the
optimization algorithm defined by solving for where the gradient of the cost is zero
using the normal equations.
By realizing that we can replace any of these components mostly independently
from the others, we can obtain a very wide variety of algorithms.
The cost function typically includes at least one term that causes the learning
process to perform statistical estimation. The most common cost function is the
negative log-likelihood, so that minimizing the cost function causes maximum
likelihood estimation.
The cost function may also include additional terms, such as regularization
terms. For example, we can add weight decay to the linear regression cost function
to obtain
J(w, b) = λ||w||
2
2
E
x,yp
data
log p
model
(y | x). (5.67)
This still allows closed-form optimization.
If we change the model to be nonlinear, then most cost functions can no longer
be optimized in closed form. This requires us to choose an iterative numerical
optimization procedure, such as gradient descent.
The recipe for constructing a learning algorithm by combining models, costs, and
optimization algorithms supports both supervised and unsupervised learning. The
linear regression example shows how to support supervised learning. Unsupervised
learning can be supported by defining a dataset that contains only
X
and providing
an appropriate unsupervised cost and model. For example, we can obtain the first
PCA vector by specifying that our loss function is
J(w) = E
xp
data
||x r(x; w)||
2
2
(5.68)
while our model is defined to have
w
with norm one and reconstruction function
r(x) = w
xw.
In some cases, the cost function may be a function that we cannot actually
evaluate, for computational reasons. In these cases, we can still approximately
minimize it using iterative numerical optimization so long as we have some way of
approximating its gradients.
Most machine learning algorithms make use of this recipe, though it may not
immediately be obvious. If a machine learning algorithm seems especially unique or
154
CHAPTER 5. MACHINE LEARNING BASICS
hand-designed, it can usually be understood as using a special-case optimizer. Some
models such as decision trees or
k
-means require special-case optimizers because
their cost functions have flat regions that make them inappropriate for minimization
by gradient-based optimizers. Recognizing that most machine learning algorithms
can be described using this recipe helps to see the different algorithms as part of a
taxonomy of methods for doing related tasks that work for similar reasons, rather
than as a long list of algorithms that each have separate justifications.
5.11 Challenges Motivating Deep Learning
The simple machine learning algorithms described in this chapter work very well on
a wide variety of important problems. However, they have not succeeded in solving
the central problems in AI, such as recognizing speech or recognizing objects.
The development of deep learning was motivated in part by the failure of
traditional algorithms to generalize well on such AI tasks.
This section is about how the challenge of generalizing to new examples becomes
exponentially more difficult when working with high-dimensional data, and how
the mechanisms used to achieve generalization in traditional machine learning
are insufficient to learn complicated functions in high-dimensional spaces. Such
spaces also often impose high computational costs. Deep learning was designed to
overcome these and other obstacles.
5.11.1 The Curse of Dimensionality
Many machine learning problems become exceedingly difficult when the number
of dimensions in the data is high. This phenomenon is known as the curse
of dimensionality. Of particular concern is that the number of possible distinct
configurations of a set of variables increases exponentially as the number of variables
increases.
155
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.9: As the number of relevant dimensions of the data increases (from left to
right), the number of configurations of interest may grow exponentially. (Left) In this
one-dimensional example, we have one variable for which we only care to distinguish 10
regions of interest. With enough examples falling within each of these regions (each region
corresponds to a cell in the illustration), learning algorithms can easily generalize correctly.
A straightforward way to generalize is to estimate the value of the target function within
each region (and possibly interpolate between neighboring regions). (Center) With 2
dimensions (center) it is more difficult to distinguish 10 different values of each variable.
We need to keep track of up to 10
×
10=100 regions, and we need at least that many
examples to cover all those regions. (Right) With 3 dimensions this grows to 10
3
= 1000
regions and at least that many examples. For
d
dimensions and
v
values to be distinguished
along each axis, we seem to need
O
(
v
d
) regions and examples. This is an instance of the
curse of dimensionality. Figure graciously provided by Nicolas Chapados.
The curse of dimensionality arises in many places in computer science, and
especially so in machine learning.
One challenge posed by the curse of dimensionality is a statistical challenge.
As illustrated in Fig. 5.9, a statistical challenge arises because the number of
possible configurations of
x
is much larger than the number of training examples.
To understand the issue, let us consider that the input space is organized into a
grid, like in the figure. In low dimensions we can describe this space with a low
number of grid cells that are mostly occupied by the data. When generalizing to a
new data point, we can usually tell what to do simply by inspecting the training
examples that lie in the same cell as the new input. For example, if estimating
the probability density at some point
x
, we can just return the number of training
examples in the same unit volume cell as
x
, divided by the total number of training
examples. If we wish to classify an example, we can return the most common class
of training examples in the same cell. If we are doing regression we can average
the target values observed over the examples in that cell. But what about the
cells for which we have seen no example? Because in high-dimensional spaces the
number of configurations is going to be huge, much larger than our number of
examples, most configurations will have no training example associated with it.
156
CHAPTER 5. MACHINE LEARNING BASICS
How could we possibly say something meaningful about these new configurations?
Many traditional machine learning algorithms simply assume that the output at a
new point should be approximately the same as the output at the nearest training
point.
5.11.2 Local Constancy and Smoothness Regularization
In order to generalize well, machine learning algorithms need to be guided by prior
beliefs about what kind of function they should learn. Previously, we have seen
these priors incorporated as explicit beliefs in the form of probability distributions
over parameters of the model. More informally, we may also discuss prior beliefs as
directly influencing the function itself and only indirectly acting on the parameters
via their effect on the function. Additionally, we informally discuss prior beliefs as
being expressed implicitly, by choosing algorithms that are biased toward choosing
some class of functions over another, even though these biases may not be expressed
(or even possible to express) in terms of a probability distribution representing our
degree of belief in various functions.
Among the most widely used of these implicit “priors” is the smoothness prior
or local constancy prior. This prior states that the function we learn should not
change very much within a small region.
Many simpler algorithms rely exclusively on this prior to generalize well, and
as a result they fail to scale to the statistical challenges involved in solving AI-
level tasks. Throughout this book, we will describe how deep learning introduces
additional (explicit and implicit) priors in order to reduce the generalization
error on sophisticated tasks. Here, we explain why the smoothness prior alone is
insufficient for these tasks.
There are many different ways to implicitly or explicitly express a prior belief
that the learned function should be smooth or locally constant. All of these different
methods are designed to encourage the learning process to learn a function
f
that
satisfies the condition
f
(x) f
(x + ) (5.69)
for most configurations
x
and small change
. In other words, if we know a good
answer for an input
x
(for example, if
x
is a labeled training example) then that
answer is probably good in the neighborhood of
x
. If we have several good answers
in some neighborhood we would combine them (by some form of averaging or
interpolation) to produce an answer that agrees with as many of them as much as
possible.
An extreme example of the local constancy approach is the
k
-nearest neighbors
157
CHAPTER 5. MACHINE LEARNING BASICS
family of learning algorithms. These predictors are literally constant over each
region containing all the points
x
that have the same set of
k
nearest neighbors in
the training set. For
k
= 1, the number of distinguishable regions cannot be more
than the number of training examples.
While the
k
-nearest neighbors algorithm copies the output from nearby training
examples, most kernel machines interpolate between training set outputs associated
with nearby training examples. An important class of kernels is the family of local
kernels where
k
(
u, v
) is large when
u
=
v
and decreases as
u
and
v
grow farther
apart from each other. A local kernel can be thought of as a similarity function
that performs template matching, by measuring how closely a test example
x
resembles each training example
x
(i)
. Much of the modern motivation for deep
learning is derived from studying the limitations of local template matching and
how deep models are able to succeed in cases where local template matching fails
(Bengio et al., 2006b).
Decision trees also suffer from the limitations of exclusively smoothness-based
learning because they break the input space into as many regions as there are
leaves and use a separate parameter (or sometimes many parameters for extensions
of decision trees) in each region. If the target function requires a tree with at
least
n
leaves to be represented accurately, then at least
n
training examples are
required to fit the tree. A multiple of
n
is needed to achieve some level of statistical
confidence in the predicted output.
In general, to distinguish
O
(
k
) regions in input space, all of these methods
require
O
(
k
) examples. Typically there are
O
(
k
) parameters, with
O
(1) parameters
associated with each of the
O
(
k
) regions. The case of a nearest neighbor scenario,
where each training example can be used to define at most one region, is illustrated
in Fig. 5.10.
Is there a way to represent a complex function that has many more regions
to be distinguished than the number of training examples? Clearly, assuming
only smoothness of the underlying function will not allow a learner to do that.
For example, imagine that the target function is a kind of checkerboard. A
checkerboard contains many variations but there is a simple structure to them.
Imagine what happens when the number of training examples is substantially
smaller than the number of black and white squares on the checkerboard. Based
on only local generalization and the smoothness or local constancy prior, we would
be guaranteed to correctly guess the color of a new point if it lies within the same
checkerboard square as a training example. There is no guarantee that the learner
could correctly extend the checkerboard pattern to points lying in squares that do
not contain training examples. With this prior alone, the only information that an
158
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.10: Illustration of how the nearest neighbor algorithm breaks up the input space
into regions. An example (represented here by a circle) within each region defines the
region boundary (represented here by the lines). The
y
value associated with each example
defines what the output should be for all points within the corresponding region. The
regions defined by nearest neighbor matching form a geometric pattern called a Voronoi
diagram. The number of these contiguous regions cannot grow faster than the number
of training examples. While this figure illustrates the behavior of the nearest neighbor
algorithm specifically, other machine learning algorithms that rely exclusively on the
local smoothness prior for generalization exhibit similar behaviors: each training example
only informs the learner about how to generalize in some neighborhood immediately
surrounding that example.
159
CHAPTER 5. MACHINE LEARNING BASICS
example tells us is the color of its square, and the only way to get the colors of the
entire checkerboard right is to cover each of its cells with at least one example.
The smoothness assumption and the associated non-parametric learning algo-
rithms work extremely well so long as there are enough examples for the learning
algorithm to observe high points on most peaks and low points on most valleys
of the true underlying function to be learned. This is generally true when the
function to be learned is smooth enough and varies in few enough dimensions.
In high dimensions, even a very smooth function can change smoothly but in a
different way along each dimension. If the function additionally behaves differently
in different regions, it can become extremely complicated to describe with a set of
training examples. If the function is complicated (we want to distinguish a huge
number of regions compared to the number of examples), is there any hope to
generalize well?
The answer to both of these questions is yes. The key insight is that a very
large number of regions, e.g.,
O
(2
k
), can be defined with
O
(
k
) examples, so long
as we introduce some dependencies between the regions via additional assumptions
about the underlying data generating distribution. In this way, we can actually
generalize non-locally (Bengio and Monperrus, 2005; Bengio et al., 2006c). Many
different deep learning algorithms provide implicit or explicit assumptions that are
reasonable for a broad range of AI tasks in order to capture these advantages.
Other approaches to machine learning often make stronger, task-specific as-
sumptions. For example, we could easily solve the checkerboard task by providing
the assumption that the target function is periodic. Usually we do not include such
strong, task-specific assumptions into neural networks so that they can generalize
to a much wider variety of structures. AI tasks have structure that is much too
complex to be limited to simple, manually specified properties such as periodicity,
so we want learning algorithms that embody more general-purpose assumptions.
The core idea in deep learning is that we assume that the data was generated
by the
composition of factors
or features, potentially at multiple levels in a
hierarchy. Many other similarly generic assumptions can further improve deep
learning algorithms. These apparently mild assumptions allow an exponential gain
in the relationship between the number of examples and the number of regions
that can be distinguished. These exponential gains are described more precisely in
Sec. 6.4.1, Sec. 15.4, and Sec. 15.5. The exponential advantages conferred by the
use of deep, distributed representations counter the exponential challenges posed
by the curse of dimensionality.
160
CHAPTER 5. MACHINE LEARNING BASICS
5.11.3 Manifold Learning
An important concept underlying many ideas in machine learning is that of a
manifold.
A manifold is a connected region. Mathematically, it is a set of points, associated
with a neighborhood around each point. From any given point, the manifold locally
appears to be a Euclidean space. In everyday life, we experience the surface of the
world as a 2-D plane, but it is in fact a spherical manifold in 3-D space.
The definition of a neighborhood surrounding each point implies the existence
of transformations that can be applied to move on the manifold from one position
to a neighboring one. In the example of the world’s surface as a manifold, one can
walk north, south, east, or west.
Although there is a formal mathematical meaning to the term “manifold,”
in machine learning it tends to be used more loosely to designate a connected
set of points that can be approximated well by considering only a small number
of degrees of freedom, or dimensions, embedded in a higher-dimensional space.
Each dimension corresponds to a local direction of variation. See Fig. 5.11 for an
example of training data lying near a one-dimensional manifold embedded in two-
dimensional space. In the context of machine learning, we allow the dimensionality
of the manifold to vary from one point to another. This often happens when a
manifold intersects itself. For example, a figure eight is a manifold that has a single
dimension in most places but two dimensions at the intersection at the center.
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
Figure 5.11: Data sampled from a distribution in a two-dimensional space that is actually
concentrated near a one-dimensional manifold, like a twisted string. The solid line indicates
the underlying manifold that the learner should infer.
161
CHAPTER 5. MACHINE LEARNING BASICS
Many machine learning problems seem hopeless if we expect the machine
learning algorithm to learn functions with interesting variations across all of
R
n
. Manifold learning algorithms surmount this obstacle by assuming that most
of
R
n
consists of invalid inputs, and that interesting inputs occur only along
a collection of manifolds containing a small subset of points, with interesting
variations in the output of the learned function occurring only along directions
that lie on the manifold, or with interesting variations happening only when we
move from one manifold to another. Manifold learning was introduced in the case
of continuous-valued data and the unsupervised learning setting, although this
probability concentration idea can be generalized to both discrete data and the
supervised learning setting: the key assumption remains that probability mass is
highly concentrated.
The assumption that the data lies along a low-dimensional manifold may not
always be correct or useful. We argue that in the context of AI tasks, such as
those that involve processing images, sounds, or text, the manifold assumption is
at least approximately correct. The evidence in favor of this assumption consists
of two categories of observations.
The first observation in favor of the manifold hypothesis is that the probability
distribution over images, text strings, and sounds that occur in real life is highly
concentrated. Uniform noise essentially never resembles structured inputs from
these domains. Fig. 5.12 shows how, instead, uniformly sampled points look like the
patterns of static that appear on analog television sets when no signal is available.
Similarly, if you generate a document by picking letters uniformly at random, what
is the probability that you will get a meaningful English-language text? Almost
zero, again, because most of the long sequences of letters do not correspond to a
natural language sequence: the distribution of natural language sequences occupies
a very small volume in the total space of sequences of letters.
162
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.12: Sampling images uniformly at random (by randomly picking each pixel
according to a uniform distribution) gives rise to noisy images. Although there is a non-
zero probability to generate an image of a face or any other object frequently encountered
in AI applications, we never actually observe this happening in practice. This suggests
that the images encountered in AI applications occupy a negligible proportion of the
volume of image space.
Of course, concentrated probability distributions are not sufficient to show
that the data lies on a reasonably small number of manifolds. We must also
establish that the examples we encounter are connected to each other by other
163
CHAPTER 5. MACHINE LEARNING BASICS
examples, with each example surrounded by other highly similar examples that
may be reached by applying transformations to traverse the manifold. The second
argument in favor of the manifold hypothesis is that we can also imagine such
neighborhoods and transformations, at least informally. In the case of images, we
can certainly think of many possible transformations that allow us to trace out a
manifold in image space: we can gradually dim or brighten the lights, gradually
move or rotate objects in the image, gradually alter the colors on the surfaces of
objects, etc. It remains likely that there are multiple manifolds involved in most
applications. For example, the manifold of images of human faces may not be
connected to the manifold of images of cat faces.
These thought experiments supporting the manifold hypotheses convey some in-
tuitive reasons supporting it. More rigorous experiments (Cayton, 2005; Narayanan
and Mitter, 2010; Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al.,
2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger
and Saul, 2004) clearly support the hypothesis for a large class of datasets of
interest in AI.
When the data lies on a low-dimensional manifold, it can be most natural
for machine learning algorithms to represent the data in terms of coordinates on
the manifold, rather than in terms of coordinates in
R
n
. In everyday life, we can
think of roads as 1-D manifolds embedded in 3-D space. We give directions to
specific addresses in terms of address numbers along these 1-D roads, not in terms
of coordinates in 3-D space. Extracting these manifold coordinates is challenging,
but holds the promise to improve many machine learning algorithms. This general
principle is applied in many contexts. Fig. 5.13 shows the manifold structure of a
dataset consisting of faces. By the end of this book, we will have developed the
methods necessary to learn such a manifold structure. In Fig. 20.6, we will see
how a machine learning algorithm can successfully accomplish this goal.
This concludes Part I, which has provided the basic concepts in mathematics
and machine learning which are employed throughout the remaining parts of the
book. You are now prepared to embark upon your study of deep learning.
164
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.13: Training examples from the QMUL Multiview Face Dataset (Gong et al., 2000)
for which the subjects were asked to move in such a way as to cover the two-dimensional
manifold corresponding to two angles of rotation. We would like learning algorithms to
be able to discover and disentangle such manifold coordinates. Fig. 20.6 illustrates such a
feat.
165