Page 242 - Understanding Machine Learning
P. 242

Nearest Neighbor
           224

                 Finally, setting 
 = 2m −1/(d+1)  and noting that
                                               d −d
                                   d −d
                                  2 
         2 2  m d/(d+1)   −1/(d+1)
                                        + 
 =              + 2m
                                    me             me
                                               −1/(d+1)           −1/(d+1)
                                           = m        (1/e + 2) ≤ 4m
                 we conclude our proof.



                    The theorem implies that if we first fix the data-generating distribution and
                 then let m go to infinity, then the error of the 1-NN rule converges to twice the
                 Bayes error. The analysis can be generalized to larger values of k, showing that
                                                                 √
                 the expected error of the k-NN rule converges to (1 +  8/k)times theerrorofthe
                 Bayes classifier. This is formalized in Theorem 19.5, whose proof is left as a guided
                 exercise.



                 19.2.2 The “Curse of Dimensionality”
                 The upper bound given in Theorem 19.3 grows with c (the Lipschitz coefficient of η)
                 and with d, the Euclidean dimension of the domain set X. In fact, it is easy to see that
                 a necessary condition for the last term in Theorem 19.3 to be smaller than 
 is that
                         √
                 m ≥ (4c d/
) d+1 . That is, the size of the training set should increase exponentially
                 with the dimension. The following theorem tells us that this is not just an artifact
                 of our upper bound, but, for some distributions, this amount of examples is indeed
                 necessary for learning with the NN rule.

                 Theorem 19.4. For any c > 1, and every learning rule, L, there exists a distribution
                          d
                 over [0,1] ×{0,1}, such that η(x) is c-Lipschitz, the Bayes error of the distribution is
                                              d
                 0, but for sample sizes m ≤ (c + 1) /2, the true error of the rule L is greater than 1/4.
                                                      d
                                                                         d
                 Proof. Fix any values of c and d.Let G be thegridon[0,1] with distance of
                                                      c
                 1/c between points on the grid. That is, each point on the grid is of the form
                 (a 1 /c,...,a d /c)where a i is in {0,...,c−1,c}. Note that, since any two distinct points
                 on this grid are at least 1/c apart, any function η : G D  → [0,1] is a c-Lipschitz
                                                                  C
                                                                             d
                 function. It follows that the set of all c-Lipschitz functions over G contains the
                                                                             c
                 set of all binary valued functions over that domain. We can therefore invoke the
                 No-Free-Lunch result (Theorem 5.1) to obtain a lower bound on the needed sam-
                                                                                  d
                 ple sizes for learning that class. The number of points on the grid is (c + 1) ;hence,
                             d
                 if m < (c + 1) /2, Theorem 5.1 implies the lower bound we are after.
                    The exponential dependence on the dimension is known as the curse of dimen-
                 sionality. As we saw, the 1-NN rule might fail if the number of examples is smaller
                              d
                 than  ((c + 1) ). Therefore, while the 1-NN rule does not restrict itself to a prede-
                 fined set of hypotheses, it still relies on some prior knowledge – its success depends
                 on the assumption that the dimension and the Lipschitz constant of the underlying
                 distribution, η, are not too high.
   237   238   239   240   241   242   243   244   245   246   247