Page 361 - Understanding Machine Learning
P. 361

28.2 The Lower Bound for the Agnostic Case  343


              D + and D − , such that for b ∈{±1} we have

                                                 1+yb
                                                  2     if x = c
                                  D b ({(x, y)}) =
                                                0       otherwise.
              That is, all the distribution mass is concentrated on two examples (c,1) and (c,−1),
              where the probability of (c,b)is  1+b
  and the probability of (c,−b)is  1−b
 .
                                           2                               2
                 Let A be an arbitrary algorithm. Any training set sampled from D b has the
              form S = (c, y 1 ),...,(c, y m ). Therefore, it is fully characterized by the vector y =
                              m
              (y 1 ,..., y m ) ∈{±1} . Upon receiving a training set S, the algorithm A returns a
              hypothesis h : X →{±1}. Since the error of A w.r.t. D b only depends on h(c), we
                                                m
              can think of A as a mapping from {±1} into {±1}. Therefore, we denote by A(y)
              the value in {±1} corresponding to the prediction of h(c), where h is the hypothesis
              that A outputs upon receiving the training set S = (c, y 1 ),...,(c, y m ).
                 Note that for any hypothesis h we have


                                                 1 − h(c)b
                                            (h) =         .
                                        L D b
                                                     2
              In particular, the Bayes optimal hypothesis is h b and


                                           1 − A(y)b
  1 − 
    
  if A(y)  = b
                                     (h b ) =        −      =
                       L D b  (A(y)) − L D b
                                               2         2      0  otherwise.
                                                  m
                                      b
                 Fix A.For b ∈{±1},let Y ={y ∈{0,1} : A(y)  = b}. The distribution D b induces
                                    m
              a probability P b over {±1} .Hence,

                                                        b
                                         (h b ) = 
] = D b (Y ) =  P b [y]1 [A(y) =b] .
                        P[L D b  (A(y)) − L D b
                                                             y
                                                          m
                                                              +
              Denote N ={y : |{i : y i = 1}| ≥ m/2} and N ={±1} \ N . Note that for any y ∈ N  +
                      +
                                                  −
              we have P + [y] ≥ P − [y] and for any y ∈ N we have P − [y] ≥ P + [y]. Therefore,
                                                 −
                                                   (h b ) = 
]
                              max P[L D b  (A(y)) − L D b
                             b∈{±1}

                                = max     P b [y]1 [A(y) =b]
                                  b∈{±1}
                                        y
                                  1                  1
                                ≥      P + [y]1 [A(y) =+] +  P − [y]1 [A(y) =−]
                                  2                  2
                                    y                  y
                                  1
                                =      (P + [y]1 [A(y) =+] + P − [y]1 [A(y) =−] )
                                  2
                                   y∈N  +
                                    1
                                  +      (P + [y]1 [A(y) =+] + P − [y]1 [A(y) =−] )
                                    2
                                     y∈N  −
   356   357   358   359   360   361   362   363   364   365   366