Page 125 - Understanding Machine Learning
P. 125

10.2 AdaBoost    107





              Hence,
                                  m  −y i f t+1 ( x i )




                         Z t+1    i=1  e
                             =
                                  m




                          Z t        −y j f t ( x j )


                                    e
                                 j=1
                                  m  −y i f t ( x i )  −y i w t+1  h t+1 ( x i )










                                  i=1  e   e
                             =

                                       m



                                         e −y j f t ( x j )


                                      j=1
                                m
                                    ( t+1) −y i w t+1  h t+1 ( x i )



                             =    D     e
                                    i
                               i=1
                                                ( t+1)                 ( t+1)
                             = e  −w t+1       D i  + e w t+1        D i




                                     i: y i h t+1 ( x i )=1  i: y i h t+1 ( x i )=−1








                             = e  −w t+1 (1 − 
 t+1 ) + e w t+1  
 t+1
                                    1
                                          (1 − 
 t+1 ) +
                             =                         1/
 t+1 − 1 
 t+1
                                 1/
 t+1 − 1
                                                   *

                                   
 t+1             1 − 
 t+1
                             =          (1 − 
 t+1 ) +      
 t+1
                                 1 − 
 t+1            
 t+1

                             = 2 
 t+1 (1 − 
 t+1 ).
                                     1


              By our assumption, 
 t+1 ≤  −γ . Since the function g(a) = a(1−a) is monotonically
                                     2
              increasing in [0,1/2], we obtain that
                                            *

                                               1        1
                                                                        2
                          2 
 t+1 (1 − 
 t+1 ) ≤ 2  − γ  + γ  =   1 − 4γ  .
                                               2        2
                                                                           2        2
                                                                    2

              Finally, using the inequality 1 − a ≤ e −a       we have that  1 − 4γ  ≤ e −4γ  /2  = e −2γ   .



              This shows that Equation (10.3) holds and thus concludes our proof.
                 Each iteration of AdaBoost involves O(m) operations as well as a single call to
              the weak learner. Therefore, if the weak learner can be implemented efficiently (as
              happens in the case of ERM with respect to decision stumps) then the total training
              process will be efficient.
              Remark 10.2.  Theorem 10.2 assumes that at each iteration of AdaBoost, the weak
              learner returns a hypothesis with weighted sample error of at most 1/2−γ . Accord-
              ing to the definition of a weak learner, it can fail with probability δ. Using the union
              bound, the probability that the weak learner will not fail at all of the iterations is at
              least 1 − δT . As we show in Exercise 10.1, the dependence of the sample complex-


              ity on δ can always be logarithmic in 1/δ, and therefore invoking the weak learner
              with a very small δ is not problematic. We can therefore assume that δT is also
              small. Furthermore, since the weak learner is only applied with distributions over
              the training set, in many cases we can implement the weak learner so that it will have
              a zero probability of failure (i.e., δ = 0). This is the case, for example, in the weak
   120   121   122   123   124   125   126   127   128   129   130