Page 378 - Understanding Machine Learning
P. 378

Compression Bounds
           360

                 the examples in S. Note that this protocol is equivalent to the protocol we defined
                 before – hence Lemma 30.1 still holds.
                    Applying a union bound over the choice of the sequence of indices we obtain
                 the following theorem.
                                                          k
                 Theorem 30.2. Let k be an integer and let B : Z → H be a mapping from sequences
                 of k examples to the hypothesis class. Let m ≥ 2k be a training set size and let A :
                   m
                  Z → H be a learning rule that receives a training sequence S of size m and returns
                                                                             k
                                                      ) for some (i 1 ,...,i k ) ∈ [m] .Let V ={z j :
                 a hypothesis such that A(S) = B(z i 1  ,...,z i k
                  j /∈ (i 1 ,...,i k )} be the set of examples which were not selected for defining A(S). Then,
                 with probability of at least 1 − δ over the choice of S we have

                                                          4k log(m/δ)  8k log(m/δ)
                         L D (A(S)) ≤ L V (A(S)) +  L V (A(S))       +           .
                                                              m            m
                                        k
                 Proof. For any I ∈ [m] let h I = B(z i 1  ,...,z i k ). Let n = m − k. Combining
                 Lemma 30.1 with the union bound we have


                                                        2L V (h I )log(1/δ)  4log(1/δ)
                                 k
                       P ∃I ∈ [m] s.t. L D (h I ) − L V (h I ) ≥        +
                                                               n             n


                                                       2L V (h I )log(1/δ)  4log(1/δ)
                           ≤     P L D (h I ) − L V (h I ) ≥          +
                                                              n             n
                             I∈[m] k
                              k
                           ≤ m δ.
                              k

                 Denote δ = m δ. Using the assumption k ≤ m/2, which implies that n = m −k ≥ m/2,
                 the above implies that with probability of at least 1 − δ we have that


                                                         4k log(m/δ )  8k log(m/δ )


                         L D (A(S)) ≤ L V (A(S)) +  L V (A(S))       +            ,
                                                              m            m
                 which concludes our proof.
                    As a direct corollary we obtain:
                 Corollary 30.3. Assuming the conditions of Theorem 30.2, and further assuming that
                  L V (A(S)) = 0, then, with probability of at least 1 − δ over the choice of S we have
                                                      8k log(m/δ)
                                          L D (A(S)) ≤          .
                                                          m
                    These results motivate the following definition:
                 Definition 30.4. (Compression Scheme) Let H be a hypothesis class of functions
                 from X to Y and let k be an integer. We say that H has a compression scheme of
                 size k if the following holds:
                                                          k
                                                k
                                         m
                 For all m there exists A : Z → [m] and B : Z → H such that for all h ∈ H,if we
                 feed any training set of the form (x 1 ,h(x 1 )),...,(x m ,h(x m )) into A and then feed
                                       )) into B,where (i 1 ,...,i k ) is the output of A, then the
                 (x i 1  ,h(x i 1  )),...,(x i k  ,h(x i k

                 output of B, denoted h ,satisfies L S (h ) = 0.
   373   374   375   376   377   378   379   380   381   382   383