Page 67 - Understanding Machine Learning
P. 67

6.5 Proof of Theorem 6.7  49


              6.5 PROOF OF THEOREM 6.7
              We have already seen that 1 → 2 in Chapter 4. The implications 2 → 3and 3 → 4
              are trivial and so is 2 → 5. The implications 4 → 6and 5 → 6 follow from the No-
              Free-Lunch theorem. The difficult part is to show that 6 → 1. The proof is based on
              two main claims:

                If VCdim(H) = d, then even though H might be infinite, when restricting it to
                                                                  d
                 a finite set C ⊂ X, its “effective” size, |H C |, is only O(|C| ). That is, the size of
                 H C grows polynomially rather than exponentially with |C|. This claim is often
                 referred to as Sauer’s lemma, but it has also been stated and proved indepen-
                 dently by Shelah and by Perles. The formal statement is given in Section 6.5.1
                 later.
                In Section 4 we have shown that finite hypothesis classes enjoy the uniform con-
                 vergence property. In Section 6.5.2 later we generalize this result and show
                 that uniform convergence holds whenever the hypothesis class has a “small
                 effective size.” By “small effective size” we mean classes for which |H C | grows
                 polynomially with |C|.


              6.5.1 Sauer’s Lemma and the Growth Function

              We defined the notion of shattering, by considering the restriction of H to a finite
              set of instances. The growth function measures the maximal “effective” size of H on
              aset of m examples. Formally:
              Definition 6.9 (Growth Function). Let H be a hypothesis class. Then the growth
              function of H, denoted τ H : N → N, is defined as

                                       τ H (m) =  max   |H C |.
                                               C⊂X :|C|=m

              In words, τ H (m) is the number of different functions from a set C of size m to {0,1}
              that can be obtained by restricting H to C.
                                                                             m
                 Obviously, if VCdim(H) = d then for any m ≤ d we have τ H (m) = 2 .In such
              cases, H induces all possible functions from C to {0,1}. The following beautiful
              lemma, proposed independently by Sauer, Shelah, and Perles, shows that when m
              becomes larger than the VC-dimension, the growth function increases polynomially
              rather than exponentially with m.

              Lemma 6.10 (Sauer-Shelah-Perles). Let H be a hypothesis class with VCdim(H) ≤
                                            d   m

              d < ∞. Then, for all m, τ H (m) ≤  i=0  i  . In particular, if m > d + 1 then τ H (m) ≤
                    d
              (em/d) .
              Proof of Sauer’s Lemma*
              To prove the lemma it suffices to prove the following stronger claim: For any C =
              {c 1 ,...,c m } we have

                                 ∀H, |H C |≤ |{B ⊆ C : H shatters B}|.           (6.3)
   62   63   64   65   66   67   68   69   70   71   72