Page 123 - Understanding Machine Learning

P. 123

10.2 AdaBoost 105

Therefore, we can calculate the objective at θ in a constant time, given the objective
at the previous threshold, θ. It follows that after a preprocessing step in which we
sort the examples with respect to each coordinate, the minimization problem can be
performed in time O(dm). This yields the following pseudocode.

ERM for Decision Stumps
input:
training set S = (x 1 , y 1 ),...,(x m , y m )
distribution vector D

goal: Find j ,θ that solve Equation (10.1)

initialize: F =∞
for j = 1,...,d
sort S using the j’th coordinate, and denote
def
x 1, j ≤ x 2, j ≤ ··· ≤ x m, j ≤ x m+1, j = x m, j + 1

F = D i
i:y i =1
if F < F

F = F, θ = x 1, j − 1, j = j
for i = 1,...,m
F = F − y i D i

if F < F and x i, j = x i+1, j
1

F = F, θ = (x i, j + x i+1, j ), j = j
2

output j ,θ
10.2 ADABOOST
AdaBoost (short for Adaptive Boosting) is an algorithm that has access to a weak
learner and ﬁnds a hypothesis with a low empirical risk. The AdaBoost algorithm
receives as input a training set of examples S = (x 1 , y 1 ),...,(x m , y m ), where for
each i, y i = f (x i ) for some labeling function f . The boosting process proceeds in
a sequence of consecutive rounds. At round t, the booster ﬁrst deﬁnes a distribution
(t)
m
over the examples in S, denoted D .That is, D (t) ∈ R and m D (t) = 1. Then,
+ i=1 i
the booster passes the distribution D (t) and the sample S to the weak learner. (That
way, the weak learner can construct i.i.d. examples according to D (t) and f .) The
weak learner is assumed to return a “weak” hypothesis, h t , whose error,
m
def def (t)

t = L (t)(h t ) = D 1 [h t (x i ) =y i ] ,
D i
i=1
is at most 1 − γ (of course, there is a probability of at most δ that the weak learner
2

1
fails). Then, AdaBoost assigns a weight for h t as follows: w t = log 1 − 1 .That
2
t
is, the weight of h t is inversely proportional to the error of h t . Atthe end ofthe
round, AdaBoost updates the distribution so that examples on which h t errs will
get a higher probability mass while examples on which h t is correct will get a lower
probability mass. Intuitively, this will force the weak learner to focus on the prob-
lematic examples in the next round. The output of the AdaBoost algorithm is a

118 119 120 121 122 123 124 125 126 127 128