Page 38 - Understanding Machine Learning
P. 38

A Gentle Start
           20















                 Figure 2.1. Each point in the large circle represents a possible m-tuple of instances. Each
                 colored oval represents the set of “misleading” m-tuple of instances for some “bad” pre-
                 dictor h ∈ H B . The ERM can potentially overfit whenever it gets a misleading training set
                  S. That is, for some h ∈ H B we have L S (h) = 0. Equation (2.9) guarantees that for each
                                                          m
                 individual bad hypothesis, h ∈ H B ,atmost (1 − 
) -fraction of the training sets would be
                 misleading. In particular, the larger m is, the smaller each of these colored ovals becomes.
                 The union bound formalizes the fact that the area representing the training sets that are
                 misleading with respect to some h ∈ H B (that is, the training sets in M)is atmostthe
                 sum of the areas of the colored ovals. Therefore, it is bounded by |H B | times the maximum
                 size of a colored oval. Any sample S outside the colored ovals cannot cause the ERM rule
                 to overfit.

                 Then, for any labeling function, f , and for any distribution, D, for which the realiz-
                 ability assumption holds (that is, for some h ∈ H, L (D, f ) (h) = 0), with probability of
                 at least 1 − δ over the choice of an i.i.d. sample S of size m, we have that for every
                 ERM hypothesis, h S , it holds that
                                               L (D, f ) (h S ) ≤ 
.

                    The preceeding corollary tells us that for a sufficiently large m,the ERM H rule
                 over a finite hypothesis class will be probably (with confidence 1 −δ) approximately
                 (up to an error of 
) correct. In the next chapter we formally define the model of
                 Probably Approximately Correct (PAC) learning.


                 2.4 EXERCISES

                  2.1 Overfitting of polynomial matching: We have shown that the predictor defined in
                     Equation (2.3) leads to overfitting. While this predictor seems to be very unnatural,
                     the goal of this exercise is to show that it can be described as a thresholded poly-
                                                                         m      d       m
                     nomial. That is, show that given a training set S ={(x i , f (x i ))}  ⊆ (R ×{0,1}) ,
                                                                         i=1
                     there exists a polynomial p S such that h S (x) = 1 if and only if p S (x) ≥ 0, where h S
                     is as defined in Equation (2.3). It follows that learning the class of all thresholded
                     polynomials using the ERM rule may lead to overfitting.
                  2.2 Let H be a class of binary classifiers over a domain X.Let D be an unknown distri-
                     bution over X,and let f be the target hypothesis in H.Fix some h ∈ H. Show that
                     the expected value of L S (h) over the choice of S| x equals L (D, f ) (h), namely,

                                              E   [L S (h)] = L (D, f ) (h).
                                            S| x ∼D m
                  2.3 Axis aligned rectangles: An axis aligned rectangle classifier in the plane is a classi-
                     fier that assigns the value 1 to a point if and only if it is inside a certain rectangle.
   33   34   35   36   37   38   39   40   41   42   43