Page 62 - Understanding Machine Learning
P. 62

The VC-Dimension
           44

                 Example 6.1. Let H be the set of threshold functions over the real line, namely,
                 H ={h a : a ∈ R},where h a : R →{0,1} is a function such that h a (x) = 1 [x<a] .To
                 remind the reader, 1 [x<a] is 1 if x < a and 0 otherwise. Clearly, H is of infinite size.
                 Nevertheless, the following lemma shows that H is learnable in the PAC model
                 using the ERM algorithm.
                 Lemma 6.1. Let H be the class of thresholds as defined earlier. Then, H is PAC
                 learnable, using the ERM rule, with sample complexity of m H(
,δ) ≤ log(2/δ)/
 .


                 Proof. Let a be a threshold such that the hypothesis h (x) = 1 [x<a ] achieves


                  L D (h ) = 0. Let D x be the marginal distribution over the domain X and let a 0 <

                 a < a 1 be such that


                                     P [x ∈ (a 0 ,a )] = P [x ∈ (a ,a 1 )] = 
.
                                   x∼D x             x∼D x
                                           ε mass          ε mass
                                     a 0             a *            a 1

                   (If D x ( −  ,a ) ≤ 
 we set a 0 =−∞ and similarly for a 1 ). Given a training set S,
                 let b 0 = max{x :(x,1) ∈ S} and b 1 = min{x :(x,0) ∈ S} (if no example in S is positive
                 we set b 0 =−∞ and if no example in S is negative we set b 1 =∞). Let b S be a
                 threshold corresponding to an ERM hypothesis, h S , which implies that b S ∈ (b 0 ,b 1 ).
                 Therefore, a sufficient condition for L D (h S ) ≤ 
 is that both b 0 ≥ a 0 and b 1 ≤ a 1 .In
                 other words,
                                    P [L D (h S ) >
] ≤ P [b 0 < a 0 ∨ b 1 > a 1 ],
                                     m                 m
                                  S∼D               S∼D
                 and using the union bound we can bound the preceding by

                                 P [L D (h S ) >
] ≤ P [b 0 < a 0 ] + P [b 1 > a 1 ].  (6.1)
                               S∼D m             S∼D m         S∼D  m
                 The event b 0 < a 0 happens if and only if all examples in S are not in the interval
                 (a 0 ,a ), whose probability mass is defined to be 
, namely,
                      ∗
                                                                         m

                           P [b 0 < a 0 ] = P [∀(x, y) ∈ S, x  ∈ (a 0 ,a )] = (1 − 
) ≤ e −
 m .
                         S∼D  m         S∼D m
                 Since we assume m > log(2/δ)/
 it follows that the equation is at most δ/2. In the
                 same way it is easy to see that P S∼D [b 1 > a 1 ] ≤ δ/2. Combining with Equation (6.1)
                                                m
                 we conclude our proof.

                 6.2 THE VC-DIMENSION

                 We see, therefore, that while finiteness of H is a sufficient condition for learnability,
                 it is not a necessary condition. As we will show, a property called the VC-dimension
                 of a hypothesis class gives the correct characterization of its learnability. To moti-
                 vate the definition of the VC-dimension, let us recall the No-Free-Lunch theorem
                 (Theorem 5.1) and its proof. There, we have shown that without restricting the
                 hypothesis class, for any learning algorithm, an adversary can construct a distri-
                 bution for which the learning algorithm will perform poorly, while there is another
   57   58   59   60   61   62   63   64   65   66   67