Page 364 - Understanding Machine Learning
P. 364

Proof of the Fundamental Theorem of Learning Theory
           346

                 The sum within the parentheses is minimized when A( j, y)(c i ) is the maximizer of
                     I
                  P[y |b i ] over b i ∈{±1}, which is exactly the Maximum-Likelihood rule. Repeating
                 the same argument for all i we conclude our proof.


                    Fix i. For every j,let n i ( j) ={|t : j t = i|} be the number of instances in which the
                 instance is c i . For the Maximum-Likelihood rule, we have that the quantity

                                           E       E   1 [A ML (S)(c i ) =b i ]
                                              d
                                        b∼U({±1} ) ∀r,y r ∼b jr
                 is exactly the probability that a binomial (n i ( j),(1 − ρ)/2) random variable will be
                                                                       2
                 larger than n i ( j)/2. Using Lemma B.11, and the assumption ρ ≤ 1/2, we have that

                                                  1                  2
                                   P[B ≥ n i ( j)/2] ≥  1 −  1 − e −2n i ( j)ρ  .
                                                  2
                 We have thus shown that

                                      d
                                   ρ
                                           E      E       E   1 [A(S)(c i ) =b i ]
                                   d    j∼U([d]) b∼U({±1} ) ∀r,y r ∼b jr
                                              m
                                                      d
                                     i=1
                                           d
                                        ρ                         2
                                     ≥          E     1 −  1 − e −2ρ n i ( j)
                                       2d    j∼U([d]) m
                                          i=1
                                           d
                                        ρ
                                                             2
                                     ≥          E     1 −  2ρ n i ( j) ,
                                       2d    j∼U([d]) m
                                          i=1
                 where in the last inequality we used the inequality 1 − e −a  ≤ a.
                    Since the square root function is concave, we can apply Jensen’s inequality to
                 obtain that the above is lower bounded by

                                            d
                                         ρ
                                      ≥        1 −   2ρ 2  E   n i ( j)
                                        2d              j∼U([d]) m
                                           i=1
                                            d
                                         ρ
                                                       2
                                      =        1 −  2ρ m/d
                                        2d
                                           i=1

                                        ρ
                                                   2
                                      =    1 −  2ρ m/d .
                                        2
                 As long as m <  d  , this term would be larger than ρ/4.
                               8ρ  2
                    In summary, we have shown that if m <  d  then for any algorithm there exists a
                                                       8ρ 2
                 distribution such that

                                       E   L D (A(S)) − min L D (h) ≥ ρ/4.
                                         m
                                     S∼D              h∈H
   359   360   361   362   363   364   365   366   367   368   369