Page 242 - Understanding Machine Learning
P. 242
Nearest Neighbor
224
Finally, setting
= 2m −1/(d+1) and noting that
d −d
d −d
2
2 2 m d/(d+1) −1/(d+1)
+
= + 2m
me me
−1/(d+1) −1/(d+1)
= m (1/e + 2) ≤ 4m
we conclude our proof.
The theorem implies that if we first fix the data-generating distribution and
then let m go to infinity, then the error of the 1-NN rule converges to twice the
Bayes error. The analysis can be generalized to larger values of k, showing that
√
the expected error of the k-NN rule converges to (1 + 8/k)times theerrorofthe
Bayes classifier. This is formalized in Theorem 19.5, whose proof is left as a guided
exercise.
19.2.2 The “Curse of Dimensionality”
The upper bound given in Theorem 19.3 grows with c (the Lipschitz coefficient of η)
and with d, the Euclidean dimension of the domain set X. In fact, it is easy to see that
a necessary condition for the last term in Theorem 19.3 to be smaller than
is that
√
m ≥ (4c d/
) d+1 . That is, the size of the training set should increase exponentially
with the dimension. The following theorem tells us that this is not just an artifact
of our upper bound, but, for some distributions, this amount of examples is indeed
necessary for learning with the NN rule.
Theorem 19.4. For any c > 1, and every learning rule, L, there exists a distribution
d
over [0,1] ×{0,1}, such that η(x) is c-Lipschitz, the Bayes error of the distribution is
d
0, but for sample sizes m ≤ (c + 1) /2, the true error of the rule L is greater than 1/4.
d
d
Proof. Fix any values of c and d.Let G be thegridon[0,1] with distance of
c
1/c between points on the grid. That is, each point on the grid is of the form
(a 1 /c,...,a d /c)where a i is in {0,...,c−1,c}. Note that, since any two distinct points
on this grid are at least 1/c apart, any function η : G D → [0,1] is a c-Lipschitz
C
d
function. It follows that the set of all c-Lipschitz functions over G contains the
c
set of all binary valued functions over that domain. We can therefore invoke the
No-Free-Lunch result (Theorem 5.1) to obtain a lower bound on the needed sam-
d
ple sizes for learning that class. The number of points on the grid is (c + 1) ;hence,
d
if m < (c + 1) /2, Theorem 5.1 implies the lower bound we are after.
The exponential dependence on the dimension is known as the curse of dimen-
sionality. As we saw, the 1-NN rule might fail if the number of examples is smaller
d
than ((c + 1) ). Therefore, while the 1-NN rule does not restrict itself to a prede-
fined set of hypotheses, it still relies on some prior knowledge – its success depends
on the assumption that the dimension and the Lipschitz constant of the underlying
distribution, η, are not too high.