Page 269 - Understanding Machine Learning

P. 269

21.2 Online Classiﬁcation in the Unrealizable Case 251

d
sequence of labels (y 1 ,..., y d ) ∈{0,1} there exists a hypothesis h ∈ H that gives
exactly this sequence of labels. The following theorem relates the VC dimension to
the Littlestone dimension.
Theorem 21.9. For any class H, VCdim(H) ≤ Ldim(H), and there are classes for
which strict inequality holds. Furthermore, the gap can be arbitrarily larger.
Proof. We ﬁrst prove that VCdim(H) ≤ Ldim(H). Suppose VCdim(H) = d and
let x 1 ,...,x d be a shattered set. We now construct a complete binary tree of
instances v 1 ,...,v d , where all nodes at depth i are set to be x i – see the following
2 −1
illustration:
x 1

x 2 x 2

x 3 x 3 x 3 x 3

Now, the deﬁnition of a shattered set clearly implies that we got a valid shattered
tree of depth d, and we conclude that VCdim(H) ≤ Ldim(H). To show that the gap
can be arbitrarily large simply note that the class given in Example 21.4 has VC
dimension of 1 whereas its Littlestone dimension is inﬁnite.

21.2 ONLINE CLASSIFICATION IN THE UNREALIZABLE CASE
In the previous section we studied online learnability in the realizable case. We now
consider the unrealizable case. Similarly to the agnostic PAC model, we no longer

assume that all labels are generated by some h ∈ H, but we require the learner to
be competitive with the best ﬁxed predictor from H. This is captured by the regret
of the algorithm, which measures how “sorry” the learner is, in retrospect, not to
have followed the predictions of some hypothesis h ∈ H. Formally, the regret of an
algorithm A relative to h when running on a sequence of T examples is deﬁned as

T T

Regret (h,T ) = sup |p t − y t |− |h(x t ) − y t | , (21.1)
A
(x 1 ,y 1 ),...,(x T ,y T ) t=1 t=1
and the regret of the algorithm relative to a hypothesis class H is
Regret (H,T ) = supRegret (h,T ). (21.2)
A A
h∈H
We restate the learner’s goal as having the lowest possible regret relative to H.An
interesting question is whether we can derive an algorithm with low regret, meaning
that Regret (H,T ) grows sublinearly with the number of rounds, T , which implies
A
that the difference between the error rate of the learner and the best hypothesis in
H tends to zero as T goes to inﬁnity.
We ﬁrst show that this is an impossible mission—no algorithm can obtain a
sublinear regret bound even if |H|= 2. Indeed, consider H ={h 0 ,h 1 },where h 0
is the function that always returns 0 and h 1 is the function that always returns 1. An

264 265 266 267 268 269 270 271 272 273 274