Page 277 - Understanding Machine Learning

P. 277

21.4 The Online Perceptron Algorithm 259

d
w ∈ R }.In Section 9.1.2 we have presented the batch version of the Perceptron,
which aims to solve the ERM problem with respect to H. We now present an online
version of the Perceptron algorithm.
d
d
Let X = R , Y ={−1,1}. On round t, the learner receives a vector x t ∈ R .The
(t)
d
learner maintains a weight vector w (t) ∈ R and predicts p t = sign( w ,x t ). Then,
it receives y t ∈ Y and pays 1 if p t = y t and 0 otherwise.
The goal of the learner is to make as few prediction mistakes as possible. In
Section 21.1 we characterized the optimal algorithm and showed that the best
achievable mistake bound depends on the Littlestone dimension of the class. We
show later that if d ≥ 2 then Ldim(H) =∞, which implies that we have no hope
of making few prediction mistakes. Indeed, consider the tree for which v 1 =
1
3
1
( ,1,0,...,0), v 2 = ( ,1,0,...,0), v 3 = ( ,1,0,...,0), etc. Because of the density
2 4 4
of the reals, this tree is shattered by the subset of H which contains all hypothe-
ses that are parametrized by w of the form w = ( − 1,a,0,...,0), for a ∈ [0,1]. We
conclude that indeed Ldim(H) =∞.
To sidestep this impossibility result, the Perceptron algorithm relies on the tech-
nique of surrogate convex losses (see Section 12.3). This is also closely related to the
notion of margin we studied in Chapter 15.
A weight vector w makes a mistake on an example (x, y) whenever the sign of
w,x does not equal y. Therefore, we can write the 0−1 loss function as follows

(w,(x, y)) = 1 [y w,x ≤0] .

On rounds on which the algorithm makes a prediction mistake, we shall use the
hinge-loss as a surrogate convex loss function

f t (w) = max{0,1 − y t w,x t }.

The hinge-loss satisﬁes the two conditions:

f t is a convex function
(t)
For all w, f t (w) ≥ (w,(x t , y t )). In particular, this holds for w .

On rounds on which the algorithm is correct, we shall deﬁne f t (w) = 0. Clearly, f t
(t)
(t)
is convex in this case as well. Furthermore, f t (w ) = (w ,(x t , y t )) = 0.
Remark 21.5. In Section 12.3 we used the same surrogate loss function for all the
examples. In the online model, we allow the surrogate to depend on the speciﬁc
(t)
round. It can even depend on w . Our ability to use a round speciﬁc surrogate
stems from the worst-case type of analysis we employ in online learning.
Let us now run the Online Gradient Descent algorithm on the sequence of func-
d
tions, f 1 ,..., f T , with the hypothesis class being all vectors in R (hence, the
projection step is vacuous). Recall that the algorithm initializes w (1) = 0 and its
update rule is
w (t+1) = w (t) − ηv t
(t)
(t)
for some v t ∈ ∂ f t (w ). In our case, if y t w ,x t > 0then f t is the zero function and
(t)
we can take v t = 0. Otherwise, it is easy to verify that v t =−y t x t is in ∂ f t (w ). We

272 273 274 275 276 277 278 279 280 281 282