Page 124 - Understanding Machine Learning
P. 124

Boosting
           106

                 “strong” classifier that is based on a weighted sum of all the weak hypotheses. The
                 pseudocode of AdaBoost is presented in the following.

                                                 AdaBoost
                           input:
                              training set S = (x 1 , y 1 ),...,(x m , y m )
                              weak learner WL
                              number of rounds T
                                          1
                                                1
                           initialize D (1)  = ( ,..., ).
                                          m     m
                           for t = 1,...,T :
                                                         (t)
                             invoke weak learner h t = WL(D , S)
                                           m   (t)
                             compute 
 t =    D i  1 [y i  =h t (x i )]
                                           i=1
                                    1
                             let w t = log  1  − 1
                                    2     
 t
                                               (t)
                                     (t+1)    D  exp(−w t y i h t (x i ))
                             update D          i               for all i = 1,...,m
                                     i   =   m   (t)
                                             j=1  D  j  exp(−w t y j h t (x j ))

                                                            T
                           output the hypothesis h s (x) = sign  w t h t (x) .
                                                            t=1
                    The following theorem shows that the training error of the output hypothesis
                 decreases exponentially fast with the number of boosting rounds.
                 Theorem 10.2. Let S be a training set and assume that at each iteration of AdaBoost,
                 the weak learner returns a hypothesis for which 
 t ≤ 1/2 − γ . Then, the training error
                 of the output hypothesis of AdaBoost is at most
                                             m
                                           1                        2
                                  L S (h s ) =  1 [h s (x i ) =y i ] ≤ exp( − 2γ T).
                                          m
                                             i=1

                 Proof. For each t, denote f t =  w p h p . Therefore, the output of AdaBoost is
                                                p≤t
                  f T . In addition, denote
                                                    m
                                                  1     −y i f t (x i )
                                             Z t =    e       .
                                                 m
                                                   i=1
                 Note that for any hypothesis we have that 1 [h(x) =y] ≤ e −yh(x) . Therefore, L S ( f T ) ≤
                                                    2
                  Z T , so it suffices to show that Z T ≤ e −2γ T  . To upper bound Z T we rewrite it as
                                           Z T   Z T   Z T−1  Z 2 Z 1
                                      Z T =   =      ·     ···   ·  ,               (10.2)
                                           Z 0  Z T−1 Z T−2   Z 1 Z 0
                 whereweused thefact that Z 0 = 1 because f 0 ≡ 0. Therefore, it suffices to show that
                 for every round t,
                                                Z t+1  −2γ  2
                                                    ≤ e    .                        (10.3)
                                                 Z t
                 To do so, we first note that using a simple inductive argument, for all t and i,
                                                     e −y i f t (x i )
                                            (t+1)
                                           D                    .
                                            i   =    m   −y j f t (x j )
                                                     j=1 e
   119   120   121   122   123   124   125   126   127   128   129