
CHAPTER 5. MACHINE LEARNING BASICS
region associated with a particular leaf node n(x). Because each example only
informs the region in which it falls about which output to produce, one cannot
have more regions than training examples. If the target function can be well
approximated by cutting the input space into N regions (with a different answer
in each region), then at least N examples are needed (and a multiple of N is
needed to achieve some level of statistical confidence in the predicted output).
All this is also true if the tree is used for density estimation (the output is simply
an estimate of the density within the region, which can be obtained by the ratio of
the number of training examples in the region by the region volume) or whether
a non-constant (e.g. linear) predictor is associated with each leaf (then more
examples are needed within each leaf node, but the relationship between number
of regions and number of examples remains linear). We examine below how this
may hurt the generalization ability of decision trees and other learning algorithms
that are based only on the smoothness or local constancy priors, when the input
is high-dimensional, i.e., because of the curse of dimensionality.
In all cases, the smoothness assumption (Eq. 5.52) allows the learner to gen-
eralize locally. Since we assume that the target function obeys f
∗
(x) ≈ f
∗
(x + )
most of the time for small , we can generalize the empirical distribution (or the
(x, y) training pairs) to the neighborhood of the training examples. If (x
(i)
, y
(i)
)
is a supervised (input,target) training example, then we expect f
∗
(x
(i)
) ≈ y
(i)
,
and therefore if x is a near neighbor of x
(i)
, we expect that f
∗
(x) ≈ y
(i)
. By con-
sidering more neighbors, we can obtain better generalization, by better executing
the smoothness assumption.
In general, to distinguish O(N) regions in input space, all of these meth-
ods require O(N ) examples (and typically there are O(N) parameters associated
with the O(N) regions). This is illustrated in Figure 5.11 in the case of a nearest-
neighbor or clustering scenario, where each training example can be used to define
one region. Is there a way to represent a complex function that has many more
regions to be distinguished than the number of training examples? Clearly, as-
suming only smoothness of the underlying function will not allow a learner to do
that. For example, imagine that the target function is a kind of checkerboard,
i.e., with a lot of variations, but a simple structure to them, and imagine that
the number of training examples is substantially less than the number of black
and white regions. Based on local generalization and the smoothness or local
constancy prior, we could get the correct answer within a constant-colour region,
but we could not correctly predict the checkerboard pattern. The only thing that
an example tells us, with this prior, is that nearby points should have the same
colour, and the only way to get the checkerboard right is to cover all of its cells
with at least one example.
The smoothness assumption and the associated non-parametric learning algo-
141