Page 363 - Understanding Machine Learning
P. 363

28.2 The Lower Bound for the Agnostic Case  345


                 Next, fix some learning algorithm A. As in the proof of the No-Free-Lunch
              theorem, we have that

                             max     E   L D b  (A(S)) − min L D b (h)          (28.3)
                                  d    m
                           D b :b∈{±1} S∼D b         h∈H

                              ≥     E       E   L D b  (A(S)) − min L D b (h)   (28.4)
                                        d    m
                               D b :b∼U({±1} ) S∼D b       h∈H

                                                   |{i ∈ [d]: A(S)(c i )  = b i |
                              =     E       E   ρ ·                             (28.5)
                                         d
                               D b :b∼U({±1} ) S∼D m        d
                                             b
                                  d
                                ρ
                              =          E       E 1 [A(S)(c i ) =b i ] ,       (28.6)
                                d   D b :b∼U({±1} ) S∼D m
                                              d
                                  i=1             b
              where the first equality follows from Equation (28.2). In addition, using the defini-
                                                                         m
              tion of D b , to sample S ∼ D b we can first sample ( j 1 ,..., j m ) ∼ U([d]) ,set x r = c j i ,
                                                 ] = (1 + ρ)/2. Let us simplify the notation
              and finally sample y r such that P[y r = b j i
              and use y ∼ b to denote sampling according to P[y = b] = (1 + ρ)/2. Therefore, the
              right-hand side of Equation (28.6) equals
                                 d
                               ρ
                                      E       E       E   1 [A(S)(c i ) =b i ] .  (28.7)
                               d    j∼U([d]) b∼U({±1} ) ∀r,y r ∼b jr
                                                  d
                                          m
                                 i=1
              We now proceed in two steps. First, we show that among all learning algorithms,
              A, the one which minimizes Equation (28.7) (and hence also Equation (28.4)) is the
              Maximum-Likelihood learning rule, denoted A ML . Formally, for each i, A ML (S)(c i )
              is the majority vote among the set {y r : r ∈ [m],x r = c i }. Second, we lower bound
              Equation (28.7)for A ML .
              Lemma 28.1. Among all algorithms, Equation (28.4) is minimized for A being the
              Maximum-Likelihood algorithm, A ML , defined as


                                  ∀i, A ML (S)(c i ) = sign  y r  .
                                                       r:x r =c i
                                  m
                                                              m
              Proof. Fix some j ∈ [d] . Note that given j and y ∈{±1} , the training set S is fully
              determined. Therefore, we can write A( j, y) instead of A(S). Let us also fix i ∈ [d].
                                                                              m
              Denote b ¬i  the sequence (b 1 ,...,b i−1 ,b i+1 ,...,b m ). Also, for any y ∈{±1} ,let y I
              denote the elements of y corresponding to indices for which j r = i and let y ¬I  be
              the rest of the elements of y. We have
                       E      E   1 [A(S)(c i ) =b i ]
                          d
                   b∼U({±1} ) ∀r,y r ∼b jr
                        1                         ¬i
                      =            E         P[y|b ,b i ]1 [A( j,y)(c i ) =b i ]
                        2      b ∼U({±1} d−1 )
                               ¬i
                         b i ∈{±1}         y
                                                                            
                                         ¬I  ¬i  1            I
                      =     E         P[y  |b ]           P[y |b i ]1 [A( j,y)(c i ) =b i ]  .
                        ¬i      d−1 )          2
                        b ∼U({±1}   ¬I            I
                                   y              y   b i ∈{±1}
   358   359   360   361   362   363   364   365   366   367   368