Page 122 - Understanding Machine Learning
P. 122

Boosting
           104

                    Finally, since the VC-dimension of decision stumps is 2, if the sample size is
                                        2
                 greater than  (log(1/δ)/
 ), then with probability of at least 1 − δ,the ERM B rule
                 returns a hypothesis with an error of at most 1/3+
. Setting 
 = 1/12 we obtain that
                 the error of ERM B is at most 1/3 + 1/12 = 1/2 − 1/12.
                    We see that ERM B is a γ -weak learner for H. We next show how to implement
                 the ERM rule efficiently for decision stumps.



                 10.1.1 Efficient Implementation of ERM for Decision Stumps
                           d
                                                                                        d
                 Let X = R and consider the base hypothesis class of decision stumps over R ,
                 namely,
                               H DS ={x  → sign(θ − x i ) · b : θ ∈ R,i ∈ [d],b ∈{±1}}.
                 For simplicity, assume that b = 1; that is, we focus on all the hypotheses in H DS of the
                 form sign(θ − x i ). Let S = ((x 1 , y 1 ),...,(x m , y m )) beatrainingset.Wewill show how
                 to implement an ERM rule, namely, how to find a decision stump that minimizes
                  L S (h). Furthermore, since in the next section we will show that AdaBoost requires
                 finding a hypothesis with a small risk relative to some distribution over S, we will
                 show here how to minimize such risk functions. Concretely, let D be a probability
                           m
                 vector in R (that is, all elements of D are nonnegative and     D i = 1). The weak
                                                                        i
                 learner we describe later receives D and S and outputs a decision stump h : X → Y
                 that minimizes the risk w.r.t. D,
                                                   m

                                           L D (h) =  D i 1 [h(x i ) =y i ] .
                                                   i=1
                 Note that if D = (1/m,...,1/m)then L D (h) = L S (h).
                    Recall that each decision stump is parameterized by an index j ∈ [d]and a
                 threshold θ. Therefore, minimizing L D (h) amounts to solving the problem

                                                                       

                                 min min      D i 1 [x i, j >θ] +  D i 1 [x i, j ≤θ]   .  (10.1)
                                 j∈[d] θ∈R
                                           i:y i =1       i:y i =−1
                 Fix j ∈ [d] and let us sort the examples so that x 1, j ≤ x 2, j ≤ ... ≤ x m, j . Define
                       x i, j +x i+1, j
                   j ={        : i ∈ [m − 1]}∪ {(x 1, j − 1),(x m, j + 1)}. Note that for any θ ∈ R there
                           2

                 exists θ ∈   j that yields the same predictions for the sample S as the threshold θ.
                 Therefore, instead of minimizing over θ ∈ R we can minimize over θ ∈   j .
                    This already gives us an efficient procedure: Choose j ∈ [d]and θ ∈   j that
                 minimize the objective value of Equation (10.1). For every j and θ ∈   j we have to
                 calculate a sum over m examples; therefore the runtime of this approach would be
                       2
                  O(dm ). We next show a simple trick that enables us to minimize the objective in
                 time O(dm).
                    The observation is as follows. Suppose we have calculated the objective for θ ∈

                 (x i−1, j ,x i, j ). Let F(θ) be the value of the objective. Then, when we consider θ ∈
                 (x i, j ,x i+1, j ) we have that
                               F(θ ) = F(θ) − D i 1 [y i =1] + D i 1 [y i =−1] = F(θ) − y i D i .
   117   118   119   120   121   122   123   124   125   126   127