Page 278 - Understanding Machine Learning

P. 278

Online Learning
260

therefore obtain the update rule

(t)
w (t) if y t w ,x t > 0
(t+1)
w =
w (t) + ηy t x t otherwise
(t)
Denote by M the set of rounds in which sign( w ,x t ) = y t . Note that on round t,
the prediction of the Perceptron can be rewritten as

(t)
p t = sign( w ,x t ) = sign η y i x i ,x t .
i∈M:i<t
This form implies that the predictions of the Perceptron algorithm and the set M
do not depend on the actual value of η as long as η> 0. We have therefore obtained
the Perceptron algorithm:

Perceptron
initialize: w 1 = 0
for t = 1,2,...,T
receive x t
(t)
predict p t = sign( w ,x t )
(t)
if y t w ,x t ≤ 0
w (t+1) = w (t) + y t x t
else
w (t+1) = w (t)

To analyze the Perceptron, we rely on the analysis of Online Gradient Descent
given in the previous section. In our case, the subgradient of f t we use in the
Perceptron is v t =−1 [y t w ,x t ≤0] t x t . Indeed, the Perceptron’s update is w (t+1) =
y
(t)
w (t) − v t , and as discussed before this is equivalent to w (t+1) = w (t) − ηv t for every
η> 0. Therefore, Theorem 21.15 tells us that
T T T
1 η
(t) 2 2
f t (w ) − f t (w ) ≤ w + v t .
2
2
2η 2
t=1 t=1 t=1
(t)
(t)
f
Since f t (w ) is a surrogate for the 0−1 loss we know that T t=1 t (w ) ≥|M|.
Denote R = max t x t ; then we obtain
T
1 η
2 2
|M|− f t (w ) ≤ w + |M| R
2
2η 2
t=1

w
Setting η = √ and rearranging, we obtain
R |M|
T

|M|− R w |M|− f t (w ) ≤ 0. (21.6)
t=1
This inequality implies

273 274 275 276 277 278 279 280 281 282 283