Chapter 5
Machine Learning Basics
TODO: no free lunch theorem
Deep learning is a specific kind of machine learning. In order to understand deep
learning well, one must have a solid understanding of the basic principles of machine
learning. This chapter provides a brief course in the most important general princi-
ples that will be applied throughout the rest of the book. If you are already familiar
with machine learning basics, feel free to skip to the end of this chapter, starting at
Section 5.11. These sections cover basic machine learning notions having to do with
smoothness, local vs non-local generalization and the curse of dimensionality, as these
are not always discussed in basic courses in machine learning.
5.1 Learning algorithms
A machine learning algorithm is an algorithm that is able to learn from data. But what
do we mean by learning? A popular definition of learning in the context of computer
programs is “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T , as
measured by P, improves with experience E (Mitchell, 1997). One can imagine a very
wide variety of experiences E, tasks T , and performance measures P , and we do not
make any attempt in this book to provide a formal definition of what may be used for
each of these entities.
5.1.1 The task, T
Machine learning is mostly interesting because of the tasks we can accomplish with
it. From an engineering point of view, machine learning allows us to tackle tasks that
are too difficult to solve with fixed programs written and designed by human beings.
From a scientific and philosophical point of view, machine learning is interesting because
understanding it allows us to understand the principles that underlie intelligent behavior,
and intelligent behavior is defined as being able to accomplish certain tasks.
Many kinds of tasks can be solved with machine learning. This book describes these
kinds of tasks:
70
Classification: In this type of task, the algorithm is asked to output a function
f : R
n
{1, . . . , k}. Here f(x) can be interpreted as an estimate of the category
that x belongs to. There are other variants of the classification task, for example,
where f outputs a probability distribution over classes. An example of this type of
task is object recognition, where the input is an image (usually described as a set
of pixel brightness values), and the output is a numeric code identifying the object
in the image. For example, the Willow Garage PR2 robot is able to recognize
objects and deliver them to people on command (Goodfellow et al., 2010). Object
recognition is also commercially interesting to search within a set of photos, to
recognize faces in order to identify a person or unlock your phone, or for web
sites that use targeted advertising based on the content of user-uploaded photos.
Modern object recognition is best accomplished with deep learning (Krizhevsky
et al., 2012b; Zeiler and Fergus, 2014; Sermanet et al., 2014).
Classification with missing inputs : This is similar to classification, except rather
than providing a single classification function, the algorithm must learn a set of
functions. Each function corresponds to classifying x with a different subset of
its inputs missing. This kind of situation arises frequently in medical diagnosis,
because many kinds of medical tests are expensive or invasive.
Regression : In this type of task, the algorithm is asked to output a function
f : R
n
R. This type of task is similar to classification, except the machine
learning system predicts a real-valued output instead of a category code. An
example of a regression task is the prediction of the expected claim amount that
an insured person will make (used to set insurance premia), or the prediction of
future prices of securities. These kinds of predictions are used for algorithmic
trading.
Transcription : This type of task is similar to classification, except that the output
is a sequence of symbols, rather than a category code (which can be thought of as
a single symbol). For example, in speech recognition and in machine translation,
the target output is a sequence of words. Such task fall under the more general
umbrella of structured output tasks, where the target output is a complex compo-
nential object involving many random variables (such as the individual words in
the translated sentence) that have a non-trivial joint distribution, given the input.
Density estimation: In this type of task, the machine learning algorithm is asked
to learn a function p
model
: R
n
R, where p
model
(x) can be interpreted as a
probability density function on the space that the examples were drawn from.
To do such a task well (we’ll specify exactly what that means when we discuss
performance measures P), the algorithm needs to learn the structure of the data
it has seen. It must know where examples cluster tightly and where they are
unlikely to occur. This can be used for anomaly detection (see below) but it can
also be used in principle (at the price of some computation) to answer all question
of the form: what values are probable (or what are the expected values or other
71
such statistic) for some subset of the variables in x, given another subset of these
variables? This generalizes all of the tasks presented here and many more but
may involve intractable computations called inference, discussed in Chapter 16.
Density estimation is a form of unsupervised learning, and many other forms of
unsupervised learning are possible, which can solve all or a subset of such tasks,
provided appropriate inference is computationally feasible.
Anomaly detection: In this type of task, the machine learning algorithm is to
identify which events are normal and which events are unusual. This is closely
related to the density estimation task, since unusual events are less likely. An
example of an anomaly detection task is credit card fraud detection. By modeling
your purchasing habits, a credit card company can detect misuse of your cards.
If a thief steals your credit card or credit card information, the thief’s purchases
will presumably come from a different probability distribution over purchase types
than your own. The credit card company can prevent fraud by placing a hold on
an account as soon as that card has been used for an unusual purchase.
Synthesis and sampling: In this type of task, the machine learning algorithm is
asked to generate new examples that are similar to those in the training data.
This can be useful for media applications where it can be expensive or boring for
an artist to generate large volumes of content by hand. For example, video games
can automatically generate textures for large objects or landscapes, rather than
requiring an artist to manually label each pixel (Luo et al., 2013).
Imputation of missing values: In this type of task, the machine learning algorithm
is given a new example x R
n
, but with some entries x
i
of x missing. The algo-
rithm must provide a prediction of the values of the missing entries. This task is
closely related to density estimation, because it can be solved by learning p
model
(x)
then conditioning on the observed entries of x. TODO: example application
Of course, many other tasks and types of tasks are possible.
5.1.2 The performance measure, P
In order to evaluate the performance of a machine learning algorithm, we must design a
quantitative measure of its performance. Usually this performance measure P is specific
to the task T being carried out by the system.
For tasks such as classification, classification with missing inputs, and transcription,
we often measure the accuracy of the model. This is simply the proportion of examples
for which the model produces the correct output. For tasks such as density estimation,
we can measure the probability the model assigns to some examples.
Usually we are interested in how well the machine learning algorithm performs on
data that it has not seen before, since this determines how well it will work when
deployed in the real world. We therefore evaluate these performance measures using a
test set of data that is separate from the data used for training the machine learning
system. We discuss this idea further in section 5.3.
72
In machine learning, the performance measure is sometimes called the loss which
represent the cost associated with a particular event (such as a classification). The
objective of learning is then to minimize the loss. For classification tasks, a common
choice for the loss function is the error which is simply the proportion of examples
for which the model produces a incorrect output. Minimizing the error is, of course,
equivalent to maximizing the accuracy.
The choice of performance measure may seem straightforward and objective, but it
is often difficult to choose a performance measure that corresponds well to the desired
behavior of the system.
In some cases, this is because it is difficult to decide what should be measured.
For example, when performing a transcription task, should we measure the accuracy
of the system at transcribing entire sequences, or should we use a more fine-grained
performance measure that gives partial credit for getting some elements of the sequence
correct? When performing a regression task, should we penalize the system more if it
frequently makes medium-sized mistakes or if it rarely makes very large mistakes? These
kinds of design choices depend on the application.
In other cases, we know what quantity we would ideally like to measure, but mea-
suring it is impractical. This arises frequently in density estimation. Many of the best
probabilistic models represent probability distributions only implicitly. Computing the
actual probability value assigned to a specific point in space is intractable. In these cases,
one must design an alternative criterion that still corresponds to the design objectives,
or design a good approximation to the desired criterion.
Finally, in machine learning, our goal is to adapt the model parameters to optimize
the performance measure; however in many cases the relevant performance measure is
not amenable to direct application in the optimization process. In order for a perfor-
mance measure to be optimized directly, we almost always require a smooth signal such
as a gradient to let us know how to move the model parameters in order to improve
the performance measure. Many natural performance measures lack this property.
For example, consider again the use of classification accuracy as a measure of per-
formance. Very small (infinitesimal) changes in the model parameters are unlikely to
cause a change in the classification of any example. As a consequence, there is no local
signal to tell us how to update the model parameters to improve model performance.
As a result, in place of the natural loss functions, we often use surrogate performance
measures (also called surrogate loss functions) that are amenable for direct use as the
objective function optimized with respect to the model parameters.
5.1.3 The experience, E
In this book, the experience E usually consists of allowing the algorithm to observe
a dataset containing several examples. This is a relatively simple kind of experience;
other approaches to machine learning involve providing the algorithm with richer and
more interactive experiences. For example, in active learning the algorithm may request
specific additions to the dataset, and in reinforcement learning the algorithm continually
interacts with an environment.
73
A dataset can be described in many ways. In all cases, a dataset is a collection of
examples. Each example is a collection of observations called features collected from
a different time or place. If we wish to make a system for recognizing objects from
photographs, we might use a machine learning algorithm where each example is a pho-
tograph, and the features within the example are the brightness values of each of the
pixels within the photograph. If we wish to perform speech recognition, we might collect
a dataset where each example is a recording of a person saying a word or sentence, and
each of the features is the amplitude of the sound wave at a particular moment in time.
One common way of describing a dataset is with a design matrix. A design matrix
is a matrix containing a different example in each row. Each column of the matrix cor-
responds to a different feature. A famous example of a dataset that can be represented
with a design matrix is the Iris dataset (Fisher, 1936). This dataset contains mea-
surements of different parts of iris plants. The features measured are the sepal length,
sepal width, petal length, and petal width. In total, the dataset describes 150 different
individual iris plants. This means we can represent the dataset with a design matrix
X R
150×4
, where X
i,1
is the sepal length of plant i, X
i,2
is the sepal width of plant i,
etc. We will describe most of the learning algorithms in this book in terms of how they
operate on design matrix datasets.
Of course, to describe a dataset as a design matrix, it must be possible to describe
each example as a vector, and each of these vectors must be the same size. This is
not always possible. For example, if you have a collection of photographs with different
width and height, then different photographs will contain different numbers of pixels, so
not all of the photographs may be described with the same length of vector. Different
sections of this book describe how to handle different types of heterogeneous data.
In many cases, the example contains a label as well as a collection of features. For
example, if we want to use a learning algorithm to perform object recognition from
photographs, we need to specify which object appears in each of the photos. We might
do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying
a cat, etc. Often when working with a dataset containing a design matrix of feature
observations X, we also provide a vector of labels y, with y
i
providing the label for
example i.
Of course, sometimes the label may be more than just a single number. For example,
if we want to train a speech recognition system to transcribe entire sentences, then the
label for each example sentence is a string of characters. These cases require more
sophisticated data structures.
5.2 Example: Linear regression
Let’s begin with an example of a simple machine learning algorithm. One of the simplest
machine learning algorithms is linear regression. As the name implies, this learning
algorithm solves a regression problem. In other words, the goal is to build a system that
can take a vector x R
n
as input and predict the value of a scalar y R as its output.
In the case of linear regression, the output is a linear function of the input. Let ˆy be
74
the value that our model predicts y should take on. We define the output to be
ˆy = w
x
where w R
n
is a vector of parameters.
Parameters are values that control the behavior of the system. In this case, w
i
is
the coefficient that we multiply by feature x
i
before summing up the contributions from
all the features. We can think of w as a set of weights that determine how each feature
affects the prediction. If a feature x
i
receives a positive weight w
i
, then increasing the
value of that feature increases the value of our prediction ˆy. If a feature receives a
negative weight, then decreasing the value of that feature decreases the value of our
prediction. If a feature’s weight is large in magnitude, then it has a large effect on the
prediction. If a feature’s weight is zero, it has no effect on the prediction.
We thus have a definition of our task T : to predict y from x by outputting ˆy = w
x.
Next we need a definition of our performance measure, P .
Let’s suppose that we have a design matrix of m example inputs that we will not use
for training, only for evaluating how well the model performs. We also have a vector of
regression targets providing the correct value of y for each of these examples. Because
this dataset will only be used for evaluation, we call it the test set. Let’s refer to the
design matrix of inputs as X
(test)
and the vector of regression targets as y
(test)
.
One way of measuring the performance of the model is to compute the mean squared
error of the model on the test set. If ˆy
(test)
is the predictions of the model on the test
set, then the mean squared error is given by
MSE
test
=
1
m
i
(ˆy
(test)
y
(test)
)
2
i
.
Intuitively, one can see that this error measure decreases to 0 when ˆy
(test)
= y
(test)
. We
can also see that
MSE
test
=
1
m
||ˆy
(test)
y
(test)
||
2
2
,
so the error increases whenever the Euclidean distance between the predictions and the
targets increases. We will justify the use of this performance measure more formally in
section 5.6.
To make a machine learning algorithm, we need to design an algorithm that will
improve the weights w in a way that reduces ˆy
(test)
when the algorithm is allowed to
gain experience by observing a training set (X
(train)
, y
(train)
). One intuitive way of doing
this is just to minimize the mean squared error on the training set, MSE
train
. We will
justify this intuition later, in section 5.6.
To minimize MSE
train
, we can simply solve for where its gradient is 0:
w
MSE
train
= 0
w
1
m
||ˆy
(train)
y
(train)
||
2
2
= 0
1
m
w
||X
(train)
w y
(train)
||
2
2
= 0
75
Figure 5.1: Consider this example linear regression problem, with a training set consist-
ing of 5 data points, each containing one feature. This means that the weight vector
w contains only a single parameter to learn, w
0
. (Left) Observe that linear regression
learns to set w
0
such that the line y = w
0
x comes as close as possible to passing through
all the training points. (Right) The plotted point indicates the value of w
0
found by the
normal equations, which we can see minimizes the mean squared error on the training
set.
w
(X
(train)
w y
(train)
)
(X
(train)
w y
(train)
) = 0
w
(w
X
(train)
X
(train)
w 2w
X
(train)
y
(train)
+ y
(train)
y
(train)
) = 0
2X
(train)
X
(train)
w 2X
(train)
y
(train)
= 0
w = (X
(train)
X
(train)
)
1
X
(train)
y
(train)
(5.1)
The system of equations defined by Eq. 5.1 is known as the normal equations. Solving
these equations constitutes a simple learning algorithm. For an example of the linear
regression learning algorithm in action, see Fig. 5.1.
This is of course an extremely simple and limited learning algorithm, but it provides
an example of how a learning algorithm can work. In the subsequent sections we will de-
scribe some of the basic principles underlying learning algorithm design and demonstrate
how these principles can be used to build more complicated learning algorithms.
5.3 Generalization, Capacity, Overfitting and Underfitting
5.3.1 Generalization
In our example machine learning algorithm, linear regression, we fit the model by min-
imizing the squared error on the training set. However, what we actually care about
is the performance of the model on new, previously unseen examples, such as the test
76
set. The expected error over all examples is known as the generalization error, and it
is the quantity that machine learning algorithms aim to minimize. Generalization is
the central objective of learning algorithms because for most problems of interest, the
number of possible input configurations is huge, so that new examples are almost surely
going to be different from any of the training examples. The learner has to generalize
from the training examples to new cases.
Formally, generalization performance is typically defined as the expected value of
the chosen performance measure, taken over the probability distribution of interest. For
example, in the above regression example, we care about the expected squared prediction
error. This probability distribution over which generalization performance is to be
measured would be the probability distribution from which future test cases are supposed
to arise. Hopefully, this is also the probability distribution from which training cases
are obtained. Otherwise, it will be difficult to obtain theoretical guarantees about
generalization. However, if some test examples from the target distribution are available,
they could be used to assess future generalization performance.
Figure 5.2: We fit three models to this example training set. The training data was
generated synthetically, by adding a small amount of noise to a quadratic function. (Left)
A linear function fit to the data suffers from underfitting–it cannot capture the curvature
that is present in the data. (Center) A quadratic function fit to the data generalizes
well to unseen points. It does not suffer from a significant amount of overfitting or
underfitting. (Right) A polynomial of degree 10 fit to the data suffers from overfitting.
It is tightly fit to noise in the data and has inappropriately steep slope at the edges of
the data.
5.3.2 Capacity
Capacity is a general term designating the ability of the learner to discover a function
taken from a more or less rich family of function. For example, Figure 5.2 illustrates
the use of three families of functions over a scalar input x: the linear (actually, affine)
predictor (left of the figure),
ˆy = b + wx,
the quadratic predictor (middle of the figure),
ˆy = b + w
1
x + w
2
x
2
,
77
and the degree-10 polynomial predictor (right of the figure),
ˆy = b +
10
i=1
w
i
x
i
.
The latter family is clearly richer, allowing to capture more complex functions, but as
discussed next, a richer family of functions or a larger capacity can yield to the problem
of overfitting and poor generalization.
A simple definition of capacity is the number of training examples that the learner
could always fit, wherever their location and value
1
As discussed in many parts of this book, there are many ways to change the capacity
of a learner. For example, with deep neural networks (Chapter 6), one can change
the number of hidden units of each layer, the number of training iterations, or use
regularization techniques (Chapter 7) to reduce capacity. In general, regularization is
equivalent to imposing a preference over the set of functions that a learner can obtain
as a solution. Such preferences can be thought of as a prior probability distribution
over the space of functions or parameters that the learner can access. This point of
view is particularly important according to the Bayesian statistics approach to machine
learning, introduced in Section 5.7.
5.3.3 Occam’s Razor, Underfitting and Overfitting
A fundamental element of machine learning is the trade-off between capacity and gener-
alization, and it has been the subject of a vast and mathematically grounded literature,
also known as statistical learning theory, with some of the best known contributions com-
ing from Vapnik and his collaborators, starting with the Vapnik-Chervonenkis dimension
or VC dimension (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989;
Vapnik, 1995), and with the many contributions reported at the COLT (COmputational
Learning Theory) conference. This trade-off is related to a much older idea, Occam’s
razor (c. 1287-1347), or principle of parsimony, which states that among competing hy-
potheses (here, read functions that could explain the observed data), one should choose
the “simpler” one. Simplicity here is just another word for the opposite of “richness of
the family of functions”, or capacity.
Machine learning algorithms can fail in two ways: underfitting and overfitting. Un-
derfitting is simple to understand: it happens when the learner cannot find a solution
that fits the observed data well enough. If we try to fit with a linear regression (x, y)
pairs for which y is actually a quadratic function of x, then we are bound to have
underfitting. There will be important aspects of the data that our learner cannot cap-
ture. If the learner needs a richer model in order to capture the data, it is underfitting.
Sometimes, underfitting is not just due to the limitation imposed on the family of func-
tions but also due to limitations on computational resources. For example, deep neural
networks or recurrent neural networks can be very difficult to optimize (and in theory
1
barring degenerate cases such as when two identical inputs are associated with different target labels,
e.g. examples (x
1
, y
1
) and (x
2
, y
2
) with x
1
= x
2
and y
1
= y
2
can never be fit perfectly.
78
finding the global optimum of the training objective can be NP-hard). With a limited
computational budget (e.g., a limited number of training iterations of an iterative opti-
mization procedure), the learner may not be able to fit the training data well enough.
It could also be that the optimization procedure easily gets stuck in local minima of
the objective function and vastly more expensive optimization procedures are needed to
really extract all the juice from the data and the theoretical richness of the family of
functions in which the learner searches for a solution.
Overfitting is more subtle and not so easy to wrap one’s mind around. What would
be wrong with picking a very rich family of functions that is large enough to always fit
our training data? We could certainly fit the training data all right, but we might
lose the ability to generalize well. Why? The fundamental reason is that if the family of
functions accessible to the learner (with its limitations on computational resources) is too
large, then this family of functions may actually contain many functions which all fit the
training data well. Without sufficient data, we would therefore not be able to distinguish
which one was most appropriate, and the learner would make an arbitrary choice among
these apparently good solutions. Yet, these functions are all different from each other,
yielding different answers in at least some potential cases. Even if one of these functions
was the correct one, we might pick the wrong one, and the expected generalization
error would thus be at least proportional to the average discrepancy between these
functions, which is related to the variance of the estimator, discussed below (Section 5.5).
Occam’s razor suggests to pick the family of functions just large enough to leave only one
choice that fits well the data. Statistical learning theory has quantified this notion more
precisely, and shown a mathematical relationship between capacity (e.g. measured by the
VC-dimension) and the difference between generalization error and training error (also
known as optimism). Although generalization error cannot generally be characterized
exactly because it depends on the specifics of the training distribution, error bounds can
be proven, showing that the discrepancy between training error and generalization error
is bounded by a quantity that grows with the ratio of capacity to number of training
examples (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik,
1995).
Since training error decreases as a function of capacity, and adding a decreasing
function (training error) and an increasing function (the difference between generaliza-
tion and training error) yields a U-shaped function (first decreasing, then increasing),
it means that generalization error is a U-shaped function of capacity. This is illustrated
in Figure 5.3, showing the underfitting regime (left) and overfitting regime (right).
How do we know if are underfitting or overfitting? If by increasing capacity we
decrease generalization error, then we are underfitting, otherwise we are overfitting.
The capacity associated with the transition from underfitting to overfitting is the optimal
capacity.
An interesting question is how one can select optimal capacity, and how it is expected
to vary. We can use a validation set on monitor generalization error as a function of ca-
pacity and thus estimate optimal capacity empirically. Methods such as cross-validation
(TODO) allow one to estimate generalization error even when the dataset is small, by
considering many possible splits of the data into training and validation examples. What
79
Capacity
training
error
generalization
error
Error
optimism
optimal
capacity
Overfitting zoneUnderfitting zone
Figure 5.3: Typical relationship between capacity (horizontal axis) and both training
(bottom curve, blue) and generalization (or test) error (top curve, red). Capacity is
basically the number of examples that the learner can always fit. As we increase it,
training error can be reduced, but the optimism (difference between training and gen-
eralization error, called “optimism”) increases. At some point, the increase in optimism
is larger than the decrease in training error (typically when the training error is low and
cannot go much lower), and we enter the overfitting regime, where capacity is too large,
above the optimal capacity. Before reaching optimal capacity, we are in the underfitting
regime.
80
does theory tell us about the optimal capacity? Because theory predicts that optimism
roughly increases with the ratio of capacity and number of training examples (Vapnik,
1995), it means that optimal capacity should increase with the number of training exam-
ples, as illustrated in This also agrees with intuition: with more training examples, we
can “afford” a more complex model. If one only sees a single (x
1
, y
1
) example, a very
reasonable prediction for a function y = f(x) is the constant function f (x) = y
1
. If one
sees two examples (x
1
, y
1
) and (x
2
, y
2
), a very reasonable prediction is the straight line
that passes through these two points, i.e., a linear function of x. As we see more ex-
amples, it makes sense to consider higher order polynomials, but certainly not of higher
order than the number of training examples.
With this notion in mind, let us consider an example. We can fit a polynomial
of degree n to a dataset by using linear regression on a transformed space. Instead
of fitting a linear function to x = [x
1
]
, we fit it to x
= [1, x
1
, x
2
1
, . . . , x
n
1
]
. For an
example of how lower order polynomials are more prone to underfitting and higher order
polynomials are more prone to underfitting, see Fig. 5.2.
Generally speaking, underfitting can be caused in two ways:
1. The model complexity (richness of the family of functions) is too limited for the
function being approximated. This is the situation illustrated in Fig. 5.2 and is
most commonly how machine learning practitioners perceive underfitting.
2. The optimization procedure fails to find a solution that well approximates the target
function despite the model having sufficient capacity for the task. While in practice
it is often difficult to decouple the two effects, this cause for underfitting is thought
to be fairly common with deep learning methods. This is one reason why there is so
much emphasis on models and methods that promote more effective optimization.
We discuss these in Chapter 8.
5.4 Estimating and Monitoring Generalization Error
TODO
5.5 Estimators, Bias, and Variance
TODO TODO: discuss sample variance and covariance using sample mean versus known
true mean
As a discipline, machine learning borrows many useful concepts from statistics. Foun-
dational concepts such as parameter estimation, bias and variance are all borrowed from
statistics. In this section we briefly review these concepts and relate them to the machine
learning concepts that we have already discussed, such as generalization.
5.5.1 Point Estimation
Point estimation is the attempt to provide the single “best” prediction of some quantity
of interest. In general the quantity of interest can be a single parameter or a vector
81
Number of training examples
Optimal capacity
Training error at fixed capacity
Generalization error at fixed capacity
Generalization error at optimal capacity
Asymptotic error at optimal (non-parametric) capacity
Asymptotic error at fixed (parametric) capacity
Figure 5.4: As the number of training examples increases, optimal capacity (bold black)
increases (we can afford a bigger and more flexible model), and the associated gener-
alization error (green bold) would decrease, eventually reaching the (non-parametric)
asymptotic error (green dashed line). If capacity was fixed (parametric setting), in-
creasing the number of training examples would also decrease generalization error (top
red curve), but not as fast, and training error would slowly increase (bottom red curve),
so that both would meet at an asymptotic value (dashed red line) corresponding to the
best achievable solution in some class of learned functions.
82
of parameters in some parametric model, such as the weights in our linear regression
example in Section 5.2, but it can also be a function.
In order to distinguish estimates of parameters from their true value, our convention
will be to denote a point estimate of a parameter θ by
ˆ
θ.
In the remainder of this section we will be assuming the orthodox perspective on
statistics. That is, we assume that the true parameter value θ is fixed but unknown,
while the point estimate
ˆ
θ is a function of the data and is therefore a random variable.
Let x
1
, . . . , x
n
be n independent and identically distributed (IID) data points. A
point estimator is any function of the data:
ˆ
θ
n
= g(x
1
, . . . , x
n
). (5.2)
In other words, any statistic
2
is a point estimate. Notice that no mention is made of
any correspondence between the estimator and the parameter being estimated. There
is also no constraint that the range of g(x
1
, . . . , x
n
) should correspond to that of the
parameter.
This definition of a point estimator is very general and allows the designer of an
estimator great flexibility. What distinguishes “just any” function of the data from
most of the estimators that are in common usage is their properties.
Point estimation can also refer to the estimation of the relationship between input
and target variables. We refer to these types of point estimates as function estimators.
Function Estimation As we mentioned above, sometime we are interested in per-
forming function estimation (or function approximation). Here we are trying to predict
a variable (or vector) y given an input vector x (also called the covariates). We con-
sider that there is a function f(x) that describes the relationship between y and x. For
example, we may assume that y = f(x) + .
In function estimation, we are interested in approximating f with a model or estimate
ˆ
f. Note that we are really not adding anything new here to our notion of a point
estimator, the function estimator
ˆ
f is simply a point estimator in function space.
The linear regression example we discussed above in Section. 5.2 and the polynomial
regression example discussed in Section. 5.3 are both examples of function estimation
where we estimate a model
ˆ
f of the relationship between an input x and target y.
In the following we will review the most commonly studied properties of point esti-
mators and discuss what they tell us about point estimators.
As
ˆ
θ and
ˆ
f are random variables (or vectors, or functions), they are distributed
according to some probability distribution. We refer to this distribution as the sampling
distribution. When we discuss properties of the estimator, we are really describing
properties of the sampling distribution.
2
A statistic is a function of the data, typically of the whole training set, such as the mean.
83
5.5.2 Bias
The bias of an estimator is defined as:
bias(
ˆ
θ
n
) = E(
ˆ
θ
n
) θ (5.3)
where the expectation is over the data (seen as samples from a random variable). An
estimator
ˆ
θ
n
is said to be unbiased if bias(
ˆ
θ
n
) = 0, i.e., if E(
ˆ
θ
n
) = θ. An estimator
ˆ
θ
n
is
said to be asymptotically unbiased if lim
n→∞
bias(
ˆ
θ
n
) = 0, i.e., if lim
n→∞
E(
ˆ
θ
n
) = θ.
Example: Bernoulli distribution Consider a data samples x {0, 1}
n
that are
independently and identically distributed according to a Bernoulli distribution (x
i
Bernoulli(θ), where i [1, n]). The Bernoulli p.m.f. (probability mass function, or
probability function) is given by P (x
i
; θ) = θ
x
i
(1 θ)
(1x
i
)
.
We are interested in knowing if the estimator
ˆ
θ
n
=
1
n
n
i=1
x
i
is biased.
bias(
ˆ
θ) = E[
ˆ
θ
n
] θ
= E
1
n
n
i=1
x
i
θ
=
1
n
n
i=1
E [x
i
] θ
=
1
n
n
i=1
1
x
i
=0
x
i
θ
x
i
(1 θ)
(1x
i
)
θ
=
1
n
n
i=1
(θ) θ
= θ θ = 0
Since bias(
ˆ
θ) = 0, we say that our estimator
ˆ
θ is unbiased.
TODO: another example Unbiased estimators are clearly desirable, though they
are not always the “best” estimators. As we will see we often use biased estimators that
possess other important properties.
5.5.3 Variance
Another property of the estimator that we might want to consider is how much we
expect it to vary as a function of the data sample. Just as we computed the expectation
of the estimator to determine its bias, we can compute its variance.
Var[
ˆ
θ] = E[
ˆ
θ
2
] E[
ˆ
θ]
2
(5.4)
84
We can also define the standard error (se) of the estimator as
se(
ˆ
θ) =
Var[
ˆ
θ] (5.5)
Example: Bernoulli distribution Let’s once again consider a dataset x {0, 1}
n
that are independently and identically distributed according to the Bernoulli distribu-
tion, where P (x
i
; θ) = θ
x
i
(1 θ)
(1x
i
)
. This time we are interesting in computing the
variance of the estimator
ˆ
θ
n
=
1
n
n
i=1
x
i
.
Var
ˆ
θ
n
= Var
1
n
n
i=1
x
i
=
1
n
2
n
i=1
Var (x
i
)
=
1
n
2
n
i=1
θ(1 θ)
=
1
n
2
(1 θ)
=
1
n
θ(1 θ)
TODO: another example
5.5.4 Trading off Bias and Variance and the Mean Squared Error
Bias and variance measure two different sources of error in an estimator. Bias measures
the expected deviation from the true value of the function or parameter. Variance on the
other hand, provides a measure of the deviation from the true value that any particular
sampling of the data is likely to cause.
What happens when we are given a choice between two estimators, one with more
bias and one with less variance? How do we choose between them? For example, let’s
imagine that we are interested in approximating the function shown in Fig. 5.2 and we
are only offered the choice between a model with large bias and one that suffers from
large variance. How do we choose between them?
In machine learning, perhaps the most common and empirically successful way to ne-
gotiate this kind of trade-off, in general is by cross-validation, discussed (TODO where?)
Alternatively, we can also compare the mean squared error (MSE) of the estimates:
MSE = E[
ˆ
θ
n
θ]
2
= Bias(
ˆ
θ
n
)
2
+ Var[
ˆ
θ
n
] (5.6)
The MSE measures the overall expected deviation in a squared error sense between
the estimator and the true value of the parameter θ. As is clear from Eq. 5.6, evaluating
85
the MSE incorporates both the bias and the variance. Desirable estimators are those
with small MSE and these are estimators that manage to keep both their bias and
variance somewhat in check.
The relationship between bias and variance is tightly linked to the machine learning
concepts of capacity, underfitting and overfitting discussed in Section. 5.3. In the
case where generalization error is measured by the MSE (where bias and variance are
meaningful components of generalization error), increasing capacity tends to increase
variance and decrease bias. We thus see again the U-shaped curve of generalization
error as a function of of capacity, as in Section 5.3.3 and Figure 5.3.
TODO
TODO Example trading off bias and variance let’s compare two estimators for
the variance of a Gaussian distribution. One known as the sample variance:
SampleVariance (5.7)
TODO
5.5.5 Consistency
As we’ve already discussed, sometimes we may wish to choose an estimator that is
biased. For example, in order to minimize the variance of the estimator. However we
might still wish that, as the number of data points in our dataset increases, our point
estimates converge to the true value of the parameter. More formally, we would like
that lim
n→∞
ˆ
θ
n
p
θ.
3
This condition is known as consistency
4
and ensures that the
bias induced by the estimator is assured to diminish as the number of data examples
grows. In other words, the estimator is asymptotically unbiased.
Asymtotic unbiasedness is not equivalent to consistency. For example, consider es-
timating the mean parameter µ of a normal distribution N(µ, σ
2
), with a dataset con-
sisting of n samples: {x
1
, . . . , x
n
}. We could use the first sample x
1
of the dataset as
an unbiased estimator:
ˆ
θ = x
1
, In that case, E(
ˆ
θ
n
) = θ so the estimator is unbiased
no matter how many data points are seen. This, of course, implies that the estimate is
asymptotically unbiased. However this not a consistent estimator as it is not the case
that
ˆ
θ
n
θ as n .
5.6 Maximum likelihood estimation
TODO: refer back to sec:linreg, we promised we’d justify using mean squared error as
a performance measure for linear regression justify minimizing mean squared error as a
learning algorithm for linear regression TODO: relate it to KL divergence TODO
3
The symbol
p
means that the convergence is in probability, i.e. for any > 0, P (|
ˆ
θ
n
θ| > ) 0
as n .
4
This is sometime referred to as weak consistency, with strong consistency referring to the almost
sure convergence of
ˆ
θ to θ.
86
In the previous section we discussed a number of common properties of estimators
but we never mentioned where these estimators come from. In this section we discuss
one of the most common approaches to deriving estimators: via the maximum likelihood
principle.
5.6.1 Properties of Maximum Likelihood
TODO: Asymptotic convergence and efficiency
TODO: how the log-likelihood criterion forces a learner that is not able to generalize
perfectly to yield an estimator that is much smoother than the target distribution.
5.6.2 Regularized Likelihood
In practice, many machine learning algorithms actually do not fit the maximum like-
lihood framework but instead a slightly more general one, in which we introduce an
additional term in the objective function, besides the objective function. We call that
term the regularizer and it depends on the chosen function or parameters, indicating
a preference over some functions or parameters over others. Typically, the regularizer
prefers simpler solutions, such as the weight decay regularizer
λ||θ||
2
where θ is the parameter vector and λ is a scalar that controls the strength of the
regularizer (here, to be minimized) against the negative log-likelihood. Regularization
is treated in much greater depth in Chapter 7. When we interpret the regularizer as an
a priori log-probability over parameters, it is also related to Bayesian statistics, treated
next.
5.7 Bayesian Statistics
TODO: a quick overview of Bayesian statistics and the Bayesian world view
In this section we will briefly consider estimation theory from a Bayesian perspective.
Historically, the statistic community has been divided between what has become known
as orthodox statistics and Bayesian statistics. The difference is mainly one of world view
but can have important practical implications.
As we mentioned above, the orthodox perspective is that the true parameter value
θ is fixed but unknown, while the point estimate
ˆ
θ is a random variable on account of
it being a function of the data (which are seen as random).
The Bayesian perspective on statistics is quite different and, in some sense, more
intuitive. The Bayesian uses probability to reflect states of knowledge (actually the lack
thereof, i.e. probability represents the certainty of states of knowledge). The data is
directly observed and so is not random. On the other hand, the true parameter θ is
unknown or uncertain and thus is represented as a random variable.
Before observing the data, we represent our knowledge of θ using the prior probability
distribution, p(θ) (sometimes referred to as simply ’the prior’). Generally, the prior
87
distribution is quite broad (i.e. with high entropy) to reflect a high degree of uncertainty
in the value of θ before observing any data. For example, we might assume a priori that
θ lies in some finite range or volume, with a uniform distribution. Many priors instead
reflect a preference for “simpler” solutions (such as smaller magnitude coefficients, or a
function that is closer to being constant).
We can recover the effect of data on our belief about θ by combining the data
likelihood p(x | θ) with the prior via Bayes’ rule:
p(θ | x) =
p(x | θ)p(θ)
p(x)
(5.8)
If the data is at all informative about the value of θ, the posterior distribution p(θ | x)
will have less entropy (will be more ‘peaky’) than the prior p(θ).
We can consider the posterior distribution over values of θ as the basis for an estima-
tor of θ. Just as with the orthodox perspective, our Bayesian estimator of θ is associated
with a probability distribution. If we need a single point estimate, a good choice would
be to use the maximum a posteriori (MAP) point.
θ
MAP
= arg max
θ
p(θ | x) = arg max
θ
log p(x | θ) + log p(θ) (5.9)
where we recognize the regularized maximum likelihood training criterion, on the right
hand side, with log p(x | θ) being the log-likelihood and log p(θ) corresponding to the
regularizer. This shows that a more peaking prior corresponds to a stronger regularizer,
while a completely flat prior (e.g. p(θ) a constant that does not depend on θ) means no
regularization at all.
TODO: Redo the regression example with a prior.
5.8 Supervised learning
5.8.1 Estimating Conditional Expectation by Minimizing Squared Er-
ror
In the context of linear regression, we discussed in Section 5.2 the squared error criterion
E[||Y f(X)||
2
]
where the expectation is over the training set during training, and over the data gener-
ating distribution to obtain generalization error.
We can generalize its interpretation beyond the case where f is linear or affine,
uncovering an interesting property: minimizing it yields an estimator of the conditional
expectation of the output variable Y given the input variable X, i.e.,
arg min
f
E[||Y f(X)||
2
] = E[Y |X] (5.10)
where the expectations on the left and the right are over the same distribution over
(X, Y ). In practive we will minimize over the training set (mean squared error) but
88
hope to obtain an estimator of the conditional expectation with respect to the data
generating distribution, i.e., to generalize. Note that the minimization in Eq. 5.10 is
over all possible functions f, i.e., it is a non-parametric result. The equation remains
true if f belongs to a particular parametric family (such as the affine functions) so long
as the true expected value also belongs to the same parametric family. The above can
be proven by computing derivatives of the expected squared error with respect to each
particular value f(x) (for every particular x), and setting the derivative to 0.
5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-
mum Likelihood
A demonstration similar to the above can be written down to show that the maximum
likelihood criterion yields an estimator of the data generating probability distribution
(or conditional distribution).
We show the conditional case, estimating the true Q(Y |X), but this immediately
applies to the unconditional case by ignoring X. Consider the negative log-likelihood
(NLL) training criterion,
NLL = E[log P
θ
(Y |X)] =
x,y
Q(x, y) log P
θ
(y|x) (5.11)
where the expectation is over the data generating distribution P (X, Y ) and we have
used sums (which apply to discrete variables) but they should be replaced by integrals
when the variables are continuous. We can transform the NLL into a Kullback-Liebler
divergence between P
θ
(Y |X) and P (Y |X), by subtracting the entropy term
H(Q(Y |X = x)) = E
Q(Y |X=x
[log Q(Y |X = x)] =
y
Q(y|x) log Q(y|x)
where again we used sums but they should be replaced by integrals for continuous
variables. Subtracting the conditional entropy of Q(Y |X) (which does not depend on
P
θ
) from the NLL yields the KL-divergence between Q(Y |X = x) and P
θ
(Y |X = x),
with Q as the reference:
x,y
Q(x, y) log P
θ
(y|x)+
y
Q(y|x) log Q(y|x) =
x
Q(x)
y
Q(y|x) log
Q(y|x)
P
θ
(y|x)
= E
Q(X)
KL(Q(Y |X)||P
θ
(Y |X)).
As we have seen in Section 3.9, the KL divergence is minimized when the two distri-
butions are equal, which proves that the non-parametric minimizer of the NLL over
P
θ
(Y |X) is the target conditional distribution Q(Y |X).
The above motivates the maximum likelihood criterion (or minimizing the NLL)
when trying to estimate probabilities or conditional probability distibutions.
TODO– logistic regression TODO– SVMs
89
5.9 Unsupervised learning
Unsupervised learning is another broad class of machine learning methods that is distinct
from supervised learning. In supervised learning, the goal is to use input-label pairs, x, y
to learn a function f that predicts a label (or a distribution over labels) given the input,
i.e. ˆy = f(x). In unsupervised learning, no such label or other target is provided. The
data consists of a set of examples x and the objective is to learn about the statistical
structure of x itself.
Learning a representation of data A classic unsupervised learning task is to find
the ‘best’ representation of the data. By ‘best’ we can mean different things, but gen-
erally speaking we are looking for a representation that reserves as much information
about x as possible while obeying some penalty or constraint aimed at keeping the
representation simpler or more accessible than x itself.
There are multiple ways of defining a simpler representation, some of the most com-
mon include lower dimensional representations, sparse representations and independent
representations. Low-dimensional representations attempt to compress as much infor-
mation about x as possible in a smaller representation. Sparse representations generally
embed the dataset into a high-dimensional representation
5
where the number of non-zero
entries is small. This results in an overall structure of the representation that tends to
distribute data along the axes of the representation space. Independent representations
attempt to disentangle the sources of variation underlying the data distribution such
that the dimensions of the representation are statistically independent.
Of course these three criteria are certainly not mutually exclusive. Low-dimensional
representations often yield elements that are more-or-less mutually independent. This
happens because the pressure to encode as much information about the data x as pos-
sible into a low-dimensional representation drives the elements of this representation to
be more independent. Any dependency between the variables in f(x) is evidency of
redundancy and implies that the representation could have capture more information
about x.
The notion of representation is one of the central themes of deep learning and there-
fore one of the central themes in this book. Section 5.11 of this chapter discusses some
of the qualities we would like in our learned representations.
5.9.1 Principal Components Analysis
In the remainder of this section we will consider one of the most widely used unsupervised
learning methods: Principle Components Analysis (PCA). PCA is an orthogonal, linear
transformation of the data that projects it into a representation where the elements are
uncorrelated (shown in Figure 5.5).
5
sparse representations often use over-complete representations: the representation dimension is
greater than the original dimensionality of the data.
90
Z =XW
x
1
x
2
x
1
x
2
z
2
z
1
Figure 5.5: Illustration of the data representation learned via PCA.
In section 2.11, we saw that we could learn a one-dimensional representation that
best reconstucts the original data (in the sense of mean-squared error) and that this
representation actually corresponds to the first principle component of the data. Thus we
can use PCA as a simple and effective dimensionality reduction method that perserves as
much of the information in the data (again, as measured by least-squares reconstruction
error) as possible. In the following, we will take a look at other properties of the PCA
representation. Specifially, we will study how the PCA representation can be said to
decorrelate the original data representation X.
Let us consider the n × m-dimensional design matrix X. We will assume that the
data has a mean of zero, E[x] = 0. If this is not the case, the data can easily be centered
(mean removed). The sample covariance matrix associated with X is given by:
Var[x] =
1
N 1
X
T
X (5.12)
The goal of PCA is to find a representation Z = XW where Var[z] is diagonal. To do
this, we will make use of the singular value decomposition (SVD) of X: X = UΣW
T
,
where Σ is an N ×M-dimensional rectangular diagonal matrix with the singular values
of X on the diagonal, U is an N × N matrix whose columns are orthonormal (i.e. unit
length and othogonal) and W is an m×m matrix also composed of orthonormal column
vectors.
Using the SVD of X, we can re-express the variance of X as:
Var[x] =
1
N 1
X
T
X (5.13)
=
1
N 1
(UΣW
T
)
T
UΣW
T
(5.14)
=
1
N 1
W Σ
T
U
T
UΣW
T
(5.15)
=
1
N 1
W Σ
2
W
T
, (5.16)
where we use the orthonormality of U (U
T
U = I) and define Σ
2
as an m × m-
dimensional diagonal matrix with the squares of the singular values of X on the diagonal,
91
i.e. the ith diagonal elements is given by σ
2
i
.
This suggests that if we take Z = XW , we can ensure that the covariance associated
with Z is diagonal as required.
Var[z] =
1
N 1
Z
T
Z (5.17)
=
1
N 1
W
T
X
T
XW (5.18)
=
1
N 1
W W
T
Σ
2
W W
T
(5.19)
=
1
N 1
Σ
2
(5.20)
Similar to our analysis of the variance of X above, we exploit the orthonormality of W
(W
T
W = I). Our use of SVD to solve for the PCA components of X (i.e. elements of
z) reveals an interesting connection to the eigen-decomposition of a matrix related to
X. Specifically, the columns of W are the eigenvectors of the matrix N ×N-dimensional
matrix X
T
X.
The above analysis suggests shows that when we project the data X to Z, via
the linear transformation W , the resulting representation has a diagonal covariance
matrix (as given by Σ
2
) which immediately implies that the individual elements of Z
are mutually uncorrelated.
This ability of PCA to transform data into a representation where the elements are
mutually exclusive is a very important property of PCA. It is a simple example of a
representation that disentangling the factors of variation underlying the data. In the
case of PCA, this disentangling takes the form of finding a rotation of the input space
(mediated via the transformation W ) that aligns the principle axes of variance with the
basis of the new representation space associated with Z, as illustrated in Fig. 5.5. While
correlation is an important category of dependency between elements of the data, we are
also interested in learning representations that disentangle more complicated forms of
feature dependencies. For this, we will need more than what can be done with a simple
linear transformation. These issues are discussed below in Sec. 5.11 and later in detail
in Chapter 14.
5.10 Weakly supervised learning
Weakly supervised learning is another class of learning methods that stands between
supervised and unsupervised learning. It refers to a setting where the datasets consists
of x, y pairs, as in supervised learning, but where the labels y are either unreliably
present (i.e. with missing values) or noisy (i.e. where the label given in not the true
label).
Methods for working with weakly labeled data have recently grown in importance
due to the largely untapped potential for using large quantities of readily available
weakly labeled data in a transfer learning paradigm to help solve problems where large,
92
clean datasets are hard to come-by. The internet has become a major source of this
kind of noisy data.
5.11 The Smoothness Prior, Local Generalization and Non-
Parametric Models
A fairly large number of machine learning algorithms are exploiting a single basic prior:
the smoothness prior. It can be expressed in many different ways, but it basically says
that the target function or distribution of interest f
is smooth, i.e.,
f
(x) f
(x + ) (5.21)
for most configurations x and small change .
For example, if we force the learned f(·) to be piecewise constant, we obtain a his-
togram, with the number of pieces being equal to the number of distinguishable regions,
or “bins” in which the different x values can fall. Another example of piecewise constant
learned function is what we obtain with K-nearest neighbors predictors, where f(x) is
constant in some region R containing all the points x that have the same set of K
nearest neighbors from the training set. If we are doing classification and K=1, f(x) is
just the output class associated with the nearest neighbor of x in the training set. If we
are doing regression, f(x) is the average of the outputs associated with the K nearest
neighbors of x. Note that in both cases, the number of distinguishable regions cannot
be more than the number of training examples.
93
Figure 5.6: Illustration of interpolation and kernel-based methods, which construct a
smooth function by interpolating in various ways between the training examples (circles),
which act like knot points controlling the shape of the implicit regions that separate them
as well as the values to output within each region. Depending on the type of kernel, one
obtains a piecewise constant (histogram-like, in dotted red), a piecewise linear (dashed
black) or a smoother kernel (bold blue). The underlying assumption is that the target
function is smooth, i.e., does not vary much locally. It allows to generalize locally, and
this works very well so long as, like in the figure, there are enough examples to cover
most of the ups and downs of the target function.
To obtain even more smoothness, we can interpolate between neighboring training
examples, as illustrated in Figure 5.6. For example, non-parametric kernel density es-
timation methods and kernel regression methods construct a learned function f of the
form
f(x) = b +
n
i=1
α
i
K(x, x
i
) (5.22)
where x
i
are the n training examples, or
f(x) = b +
n
i=1
α
i
K(x, x
i
)
n
j=1
K(x, x
j
)
.
If the kernel function K is discrete (e.g. 0 or 1), then this can include the above cases
where f is piecewise constant and a discrete set of regions (no more than one per training
example) can be distinguished. However, better results can often be obtained if K is
smooth, e.g., the Gaussian kernel
K(u, v) = N(u v; 0, σ
2
I)
where N(x; µ, Σ) is the standard normal density. With K a local kernel (Bengio et al.,
94
2006a; Bengio and LeCun, 2007a; Bengio, 2009)
6
, we can think of each x
i
as a template
and the kernel function as a similarity function that matches a template and a test
example.
With the Gaussian kernel, we do not have a piecewise constant function but instead a
continuous and smooth function. In fact, the choice of K can be shown to correspond to
a particular form of smoothness. Equivalently, we can think of many of these estimators
as the result of smoothing the empirical distribution by convolving it with a function
associated with the kernel, e.g., the Gaussian kernel density estimator is the empirical
distribution convolved with the Gaussian density.
Although in classical estimators α
i
of Eq. 5.22 are fixed (e.g. to 1/n for density
estimation and to y
i
for supervised learning from examples (x
i
, y
i
)), they can be opti-
mized, and this is the basis of more modern non-parametric kernel methods (Sch¨olkopf
and Smola, 2002) such as the Support Vector Machine (Boser et al., 1992; Cortes and
Vapnik, 1995).
However, as illustrated in Figure 5.6, even though these smooth kernel methods
generalize better, the main thing that has changed is that one can basically interpolate
between the neighboring examples, in some space associated with the kernel. One can
then think of the training examples as control knots which locally specify the shape of
each region and the associated output.
Another type of non-parametric learning algorithm that also breaks the input space
into regions and has separate parameters for each region is the decision tree (Breiman
et al., 1984) and its many variants. Each node of the decision tree is associated with a
region, and it breaks it into one sub-region for each child of the node (typically using
an axis-aligned cut). Typically, a constant output is returned for all inputs associated
with a leaf node, i.e., within the associated region. Because each example only informs
the region in which it falls about the target output, one cannot have more regions
than training examples. If the target function can be well approximated by cutting the
input space into N regions (with a different answer in each region), then at least N
examples are needed (and a multiple of N is needed to achieve some level of statistical
confidence in the predicted output). All this is also true if the tree is used for density
estimation (the output is simply an estimate of the density within the region, which can
be obtained by the ratio of the number of training examples in the region by the region
volume) or whether a non-constant (e.g. linear) predictor is associated with each leaf
(then more examples are needed within each leaf node, but the relationship between
number of regions and number of examples remains linear). Below (Section 5.12) we
examine how this makes decision trees and other such learning algorithms based only
on the smoothness prior potential victims of the curse of dimensionality.
In all cases, the smoothness assumption (Eq. 5.21) allows the learner to generalize
locally. Since we assume that the target function obeys f
(x) f
(x + ) most of
the time for small , we can generalize the empirical distribution (or the (x, y) training
pairs) to their neighborhood. If (x
i
, y
i
) is a supervised training example, then we expect
f
(x
i
) y
i
, and therefore if x
i
is a near neighbor of x, we expect that f
(x) y
i
. By
6
i.e., with K(u, v) large when u = v and decreasing as they get farther apart
95
considering more neighbors, we can obtain better generalization, by better executing
the smoothness assumption.
Figure 5.7: Illustration of how learning algorithms that exploit only the smoothness prior
typically break up the input space into regions, with examples in those regions being
used both to define the region boundaries and what the output should be within each
region. The figure shows the case of clustering or 1-nearest-neighbor classifiers, for which
each training example (cross of a different color) defines a region or a template (here, the
different regions form a Voronoi tessellation). The number of these contiguous regions
cannot grow faster than the number of training examples. In the case of a decision tree,
the regions are recursively obtained by axis-aligned cuts within existing regions, but for
these and for kernel machines with a local kernel (such as the Gaussian kernel), the
same property holds, and generalization can only be local: each training example only
informs the learner about how to generalize in some neighborhood around it.
In general, to distinguish O(N) regions in input space, all of these methods require
O(N) examples (and typically there are O(N ) parameters associated with the O(N)
regions). This is illustrated in Figure 5.7 in the case of a nearest-neighbor or clustering
scenario, where each training example can be used to define one region. Is there a way
to represent a complex function that has many more regions to be distinguished than
the number of training examples?
The smoothness assumption and the associated learning algorithms work extremely
well so long as there are enough examples to cover most of the ups and downs of the
target function. This is generally true when the function to be learned is smooth enough,
which is typically the case for low-dimensional data. And if it is not very smooth (we
want to distinguish a huge number of regions compared to the number of examples), is
96
there any hope to generalize well?
Both of these questions are answered positively in Chapter 14. The key insight is
that a very large number of regions, e.g., O(2
N
), can be defined with O(N ) examples, so
long as we introduce some dependencies between the regions associated with additional
priors about the underlying data generating distribution. In this way, we can actually
generalize non-locally (Bengio and Monperrus, 2005; Bengio et al., 2006b). For example,
if we assume that the target function is periodic, then we can generalize from examples
of a few periods to an arbitrarily large number of repetitions of the basic pattern.
However, we want a more general-purpose assumption, and the idea in deep learning is
that we assume that the data was generated by the composition of factors or features,
potentially at multiple levels in a hierarchy. These apparently mild assumptions allow
an exponential gain in the relationship between the number of examples and the number
of regions that can be distinguished, as discussed in Chapter 14.
5.12 Manifold Learning and the Curse of Dimensionality
Manifold learning algorithms assume that the data distribution is concentrated in a
small number of dimensions. Manifold learning was introduced in the case of continuous-
valued data and the unsupervised learning setting, although this probability concentra-
tion idea can be generalized to both discrete data and the supervised learning setting:
the key assumption remains that probability mass is highly concentrated.
Is this assumption reasonable? It seems to be true for almost all of the AI tasks such
as those involving images, sounds, and text. To be convinced of this consider that if
the assumption was false, then sampling uniformly at random should produce probable
(data-like) configurations reasonably often. Instead, take the example of generating
images by independently picking the grey level (or a binary 0 vs 1) for each pixel. What
kind of images do you get? You get “white noise” images, that look like the old television
sets when no signal is coming in, as illustrated in Figure 5.8.
97
Figure 5.8: Sampling images uniformly at random, e.g., by randomly picking each pixel
according to a uniform distribution, gives rise to white noise images such as illustrated
on the left. Although there is a non-zero probability to generate something that looks
like a natural image (like those on the right), that probability is exponentially tiny
(exponential in the number of pixels!). This suggests that natural images are very
“special”, and that they occupy a tiny volume of the space of images.
The above thought experiment, which is an agreement with the many experimental
results of the manifold learning literature, e.g. (Cayton, 2005; Narayanan and Mitter,
2010; Sch¨olkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand,
2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004),
clearly establishes that for a large class of datasets, the manifold hypothesis is true:
the data generating distribution concentrates in a small number of dimensions, as in
the cartoon of Figure 13.4, from Chapter 13. That chapter explores the relationships
between representation learning and manifold learning: if the data distribution concen-
trates on a smaller number of dimensions, then we can think of these dimensions as
natural coordinates for the data.
An initial hope of early work on manifold learning (Roweis and Saul, 2000; Tenen-
baum et al., 2000) was to reduce the effect of the curse of dimensionality, by reducing
the data to a lower dimensional representation. The curse of dimensionality refers to
the exponential increase of the number of configurations of interest as a function of the
number of dimensions. This is illustrated in Figure 5.9.
98
Figure 5.9: As the number of relevant dimensions of the data increases (from left to
right), the number of configurations of interest may grow exponentially. In the figure
we first consider one-dimensional data (left), i.e., one variable for which we only care
to distinguish 10 regions of interest. With enough examples falling within each of these
regions (cells, in the figure), we can easily generalize correctly, i.e., estimate the value
of the target function within each region (and possibly interpolate between neighboring
regions). With 2 dimensions (center), but still caring to distinguish 10 different values
of each variable, we need to keep track of up to 10×10=100 regions, and we need at
least that many examples to cover all those regions. With 3 dimensions (right) this
grows to 10
3
= 1000 regions and at least that many examples. For d dimensions and
V values to be distinguished along each axis, it looks like we need O(V
d
) regions and
examples. This is an instance of the curse of dimensionality. However, note that if the
data distribution is concentrated on a smaller set of regions, we may actually not need
to cover all the possible regions, only those where probability is non-negligeable. Figure
graciously provided by and with authorization from Nicolas Chapados.
The hope was that by non-linearly projecting the data in a new space of lower
dimension, we would reduce the curse of dimensionality by only looking at relevant
dimensions, i.e., a smaller set of regions of interest (cells, in Figure 5.9. This can indeed
be the case, however, as discussed in Chapter 13, the manifolds can be highly curved
and have a very large number of twists, requiring still a very large number of regions to
be distinguished (every up and down of each corner of the manifold). And even if we
were to reduce the dimensionality of an input from 10000 (e.g. 100×100 binary pixels)
to 100, V
100
is still too large to hope covering with a training set. This rules out the use
of purely local generalization (i.e., the smoothness prior only) to model such manifolds,
as discussed in Chapter 13 around Figure 13.4 and 13.5. It may also be that although
the effective dimensionality of the data could be small, that some examples fall outside
of the main manifold and that we don’t want to systematically lose that information. A
sparse representation then becomes a possible way to represent data that is mostly low-
dimensional, although occasionally occupying more dimensions. This can be achieved
with a high-dimensional representation whose elements are 0 most of the time. We can
see that the effective dimension (the number of non-zeros) then can change depending on
99
where we are in input space, which can be useful. Sparse representations are discussed
in Section 13.3.
5.13 Challenges of High-Dimensional Distributions
High-dimensional random variables actually bring two challenges: a statistical challenge
and a computational challenge.
The statistical challenge regards generalization: as sketched in the previous section,
the number of configurations we may want to distinguish can grow exponentially with the
number of dimensions of interest, and this quickly becomes much larger than the number
of examples one can possibly have (or use with bounded computational resources). This
raises the need to generalize to never-seen configurations, and not just those in the
immediate neighborhood of the training examples. This is difficult because the volume
of these regions where we want to generalize grows exponentially with the number of
dimensions.
In order to achieve that kind of non-local generalization, it is necessary to introduce
priors other than the smoothness prior, and much of deep learning research is about such
priors. In particular, priors that are based on compositionality, such as arising from
learning distributed representations and from a deep composition of representations,
can give an exponential advantage, which can hopefully counter the exponential curse
of dimensionality. Chapter 14 discusses these questions from the angle of representation
learning and the objective of disentangling the underlying factors of variation.
The computational challenge associated with high-dimensional distributions arises
because many algorithms for learning or using a trained model (especially those based
on estimating an explicit probability function) involve intractable computations that
grow exponentially with the number of dimensions.
For example, suppose that we have been able to learn a probability function p(X)
which maps points x = (x
1
, x
2
, . . . , x
d
) in a d-dimensional space to a scalar p(X = x).
Then simply predicting the joint probability of a subset of d
of these variables given
d

of the others requires computing a sum (or an integral) over all the configurations of
d d

variables:
p(x
1
, . . . x
d
|x
d
+1
, . . . x
d
+d

) =
x
d
+d

+1
,...x
d
p(x
1
, . . . x
d
)
x
1
,...x
d
,x
d
+d

+1
,...x
d
p(x
1
, . . . x
d
)
(5.23)
This is an example of inference. Exact inference is generally intractable, but approximate
techniques are discussed in Chapter 16.
Another common example of inference is where we want to find the most probable
configuration out of some exponentially large set, e.g.,
arg max
x
1
...x
d
p(x
1
, . . . x
d
|x
d
+1
, . . . x
d
).
In that case, the answer involves maximizing rather than summing over an exponentiallly
large number of configurations, and optimization or search methods can be used to help
100
approximating a solution (an exact solution is rarely achievable with non-exponential
computational resources).
Another even more common example is the difficulty of computing the normaliza-
tion constant, called partition function, associated with a probability function, studied in
more detail in Section 9.2.3 and Chapter 15. Unless we strongly restrict the mathemat-
ical form of the probability function, we may end up, e.g., with Restricted Boltzmann
Machines (Section 9.7.1) and other undirected graphical models, with a probability
function whose value can be computed efficiently only up to a normalization constant
Z(θ):
p(x) =
p
θ
(x)
Z(θ)
where the normalization constant or partition function Z(θ) is a generally intractable
sum or integral
Z(θ) =
x
p
θ
(x)
and p
θ
is an unnormalized probability function with parameters θ. In order to learn
θ, we would like to compute the derivative of Z(θ) with respect to θ and this is gen-
erally not computationally tractable in an exact way, but many approximations have
been proposed. Techniques to face the issues raised with the partition function and its
gradient are discussed in chapter 15.
Sometimes the x in the above examples should be viewed as including not just visible
variables (the variables in our dataset) but also latent variables. Latent variables are
unobserved variables that we introduce in our probabilistic models in order to increase
their expressive power. Probabilistic graphical models (including ones with latent vari-
ables) are discussed at length in Chapters 9 and 17. In the context of probabilistic
models, one often resorts to Monte-Carlo Markov Chain (MCMC) methods in order to
estimate such intractable sums (seen as expectated values). MCMC is introduced in
Chapter 9 and comes back in Chapters 15 and 17. One issue that arises with MCMC is
that it may requiring sequentially visiting a large fraction of the modes (regions of high
probability) of the distribution. If the number of these modes (which would be larger
for more complex distributions) is very large (and it is potentially exponentially large),
then the MCMC estimators may be too unreliable. Again, it has to do with having an
exponentially large set of configurations (or modes, or regions) that we would like to
visit for performing some computation.
One way to confront these intractable computations is to approximate them, and
many approaches have been proposed, discussed in the chapters listed above. Another
way is to avoid these intractable computations altogether by design, and methods that
do not require such computations are thus very appealing, although maybe at the price of
losing the ability to actually compute the probability function itself. Several generative
models based on auto-encoders have been proposed in recent years, with that motivation,
and are discussed at the end of Chapter 17.
101