Page 366 - Understanding Machine Learning
P. 366

Proof of the Fundamental Theorem of Learning Theory
           348


                 Note that (S,T ) ∈ B implies S ∈ B and therefore 1 [(S,T)∈B ] = 1 [(S,T)∈B ] 1 [S∈B] ,which


                 gives

                                   P[(S,T ) ∈ B ] = E  E 1 [(S,T)∈B ] 1 [S∈B]

                                                    m
                                                 S∼D T ∼D m
                                               = E 1 [S∈B] E 1 [(S,T)∈B ] .

                                                    m         m
                                                 S∼D       T ∼D
                 Fix some S. Then, either 1 [S∈B] = 0or S ∈ B and then ∃h S such that D(h S ) ≥ 
 and

                 |h S ∩ S|= 0. It follows that a sufficient condition for (S,T ) ∈ B is that |T ∩h S | >  
m .
                                                                                       2
                 Therefore, whenever S ∈ B we have
                                      E 1 [(S,T)∈B ] ≥  P [|T ∩ h S | >  
m  ].

                                                                    2
                                    T ∼D m           T ∼D  m
                 But, since we now assume S ∈ B we know that D(h S ) = ρ ≥ 
. Therefore, |T ∩ h S |
                 is a binomial random variable with parameters ρ (probability of success for a single
                 try) and m (number of tries). Chernoff’s inequality implies
                                       2        2
                             ρm      − mρ  (mρ−mρ/2)  −mρ/2  −m
/2  −d log(1/δ)/2  d/2
                  P[|T ∩ h S |≤  ] ≤ e           = e     ≤ e     ≤ e         = δ   ≤ 1/2.
                              2
                 Thus,
                                  
m                  
m                   ρm
                      P[|T ∩ h S | >  ] = 1 − P[|T ∩ h S |≤  ] ≥ 1 − P[|T ∩ h S |≤  ] ≥ 1/2.
                                  2                    2                   2
                 Combining all the preceding we conclude the proof of Claim 1.


                 Claim 2 (Symmetrization):
                 P[(S,T ) ∈ B ] ≤ e −
m/4  τ H (2m).

                 Proof of Claim 2: To simplify notation, let α = m
/2 and for a sequence A =

                 (x 1 ,...,x 2m )let A 0 = (x 1 ,...,x m ). Using the definition of B we get that

                               P[A ∈ B ] =  E   max1 [D(h)≥
] 1 [|h∩A 0 |=0] 1 [|h∩A|≥α]
                                              2m h∈H
                                          A∼D
                                        ≤   E  max1 [|h∩A 0 |=0] 1 [|h∩A|≥α] .
                                          A∼D 2m h∈H
                 Now, let us define by H A the effective number of different hypotheses on A, namely,
                 H A ={h ∩ A : h ∈ H }. It follows that


                                   P[A ∈ B ] ≤  E  max 1 [|h∩A 0 |=0] 1 [|h∩A|≥α]
                                             A∼D 2m h∈H A

                                           ≤   E       1 [|h∩A 0 |=0] 1 [|h∩A|≥α] .
                                                 2m
                                             A∼D
                                                   h∈H A
                 Let J ={j ⊂ [2m]: |j|= m}.For any j ∈ J and A = (x 1 ,...,x 2m ) define A j =
                           ). Since the elements of A are chosen i.i.d., we have that for any j ∈ J and
                 (x j 1  ,...,x j m
                 any function f (A, A 0 ) it holds that E  2m [ f (A, A 0 )] = E  2m [ f (A, A j )]. Since
                                                   A∼D               A∼D
                 this holds for any j it also holds for the expectation of j chosen at random from J.

                 In particular, it holds for the function f (A, A 0 ) =  1 [|h∩A 0 |=0] 1 [|h∩A|≥α] .We
                                                                h∈H A
   361   362   363   364   365   366   367   368   369   370   371