
2006a; Bengio and LeCun, 2007a; Bengio, 2009)
6
, we can think of each x
i
as a template
and the kernel function as a similarity function that matches a template and a test
example.
With the Gaussian kernel, we do not have a piecewise constant function but instead a
continuous and smooth function. In fact, the choice of K can be shown to correspond to
a particular form of smoothness. Equivalently, we can think of many of these estimators
as the result of smoothing the empirical distribution by convolving it with a function
associated with the kernel, e.g., the Gaussian kernel density estimator is the empirical
distribution convolved with the Gaussian density.
Although in classical estimators α
i
of Eq. 5.22 are fixed (e.g. to 1/n for density
estimation and to y
i
for supervised learning from examples (x
i
, y
i
)), they can be opti-
mized, and this is the basis of more modern non-parametric kernel methods (Sch¨olkopf
and Smola, 2002) such as the Support Vector Machine (Boser et al., 1992; Cortes and
Vapnik, 1995).
However, as illustrated in Figure 5.6, even though these smooth kernel methods
generalize better, the main thing that has changed is that one can basically interpolate
between the neighboring examples, in some space associated with the kernel. One can
then think of the training examples as control knots which locally specify the shape of
each region and the associated output.
Another type of non-parametric learning algorithm that also breaks the input space
into regions and has separate parameters for each region is the decision tree (Breiman
et al., 1984) and its many variants. Each node of the decision tree is associated with a
region, and it breaks it into one sub-region for each child of the node (typically using
an axis-aligned cut). Typically, a constant output is returned for all inputs associated
with a leaf node, i.e., within the associated region. Because each example only informs
the region in which it falls about the target output, one cannot have more regions
than training examples. If the target function can be well approximated by cutting the
input space into N regions (with a different answer in each region), then at least N
examples are needed (and a multiple of N is needed to achieve some level of statistical
confidence in the predicted output). All this is also true if the tree is used for density
estimation (the output is simply an estimate of the density within the region, which can
be obtained by the ratio of the number of training examples in the region by the region
volume) or whether a non-constant (e.g. linear) predictor is associated with each leaf
(then more examples are needed within each leaf node, but the relationship between
number of regions and number of examples remains linear). Below (Section 5.12) we
examine how this makes decision trees and other such learning algorithms based only
on the smoothness prior potential victims of the curse of dimensionality.
In all cases, the smoothness assumption (Eq. 5.21) allows the learner to generalize
locally. Since we assume that the target function obeys f
∗
(x) ≈ f
∗
(x + ) most of
the time for small , we can generalize the empirical distribution (or the (x, y) training
pairs) to their neighborhood. If (x
i
, y
i
) is a supervised training example, then we expect
f
∗
(x
i
) ≈ y
i
, and therefore if x
i
is a near neighbor of x, we expect that f
∗
(x) ≈ y
i
. By
6
i.e., with K(u, v) large when u = v and decreasing as they get farther apart
95