Page 274 - Understanding Machine Learning
P. 274
Online Learning
256
To analyze the algorithm we first note that the number of experts is
Ldim(H)
T
d = . (21.4)
L
L=0
It can be shown that when T ≥ Ldim(H) + 2, the right-hand side of the equation is
Ldim(H)
bounded by eT/Ldim(H) (the proof can be found in Lemma A.5).
Theorem 21.11 tells us that the expected number of mistakes of Weighted-
Majority is at most the number of mistakes of the best expert plus 2log(d)T .We
will next show that the number of mistakes of the best expert is at most the number
of mistakes of the best hypothesis in H. The following key lemma shows that, on
any sequence of instances, for each hypothesis h ∈ H there exists an expert with the
same behavior.
Lemma 21.13. Let H be any hypothesis class with Ldim(H) < ∞.Let x 1 ,x 2 ,...,x T
be any sequence of instances. For any h ∈ H, there exists L ≤ Ldim(H) and indices
1 ≤ i 1 < i 2 < ··· < i L ≤ T such that when running Expert(i 1 ,i 2 ,...,i L ) on the sequence
x 1 ,x 2 ,...,x T , the expert predicts h(x t ) on each online round t = 1,2,...,T .
Proof. Fix h ∈ H and the sequence x 1 ,x 2 ,...,x T . We must construct L and the
indices i 1 ,i 2 ,...,i L . Consider running SOA on the input (x 1 ,h(x 1 )), (x 2 ,h(x 2 )), ...,
(x T ,h(x T )). SOA makes at most Ldim(H) mistakes on such input. We define L to
be the number of mistakes made by SOA and we define {i 1 ,i 2 ,...,i L } to be the set
of rounds in which SOA made the mistakes.
Now, consider the Expert(i 1 ,i 2 ,...,i L ) running on the sequence x 1 ,x 2 ,...,x T .
By construction, the set V t maintained by Expert(i 1 ,i 2 ,...,i L ) equals the set V t
maintained by SOA when running on the sequence (x 1 ,h(x 1 )),...,(x T ,h(x T )). The
predictions of SOA differ from the predictions of h if and only if the round is
in {i 1 ,i 2 ,...,i L }.Since Expert(i 1 ,i 2 ,...,i L ) predicts exactly like SOA if t is not
in {i 1 ,i 2 ,...,i L } and the opposite of SOAs’ predictions if t is in {i 1 ,i 2 ,...,i L },we
conclude that the predictions of the expert are always the same as the predic-
tions of h.
The previous lemma holds in particular for the hypothesis in H that makes the
least number of mistakes on the sequence of examples, and we therefore obtain the
following:
Corollary 21.14. Let (x 1 , y 1 ),(x 2 , y 2 ),...,(x T , y T ) be a sequence of examples and let
H be a hypothesis class with Ldim(H) < ∞. There exists L ≤ Ldim(H) and indices
1 ≤ i 1 < i 2 < ··· < i L ≤ T , such that Expert(i 1 ,i 2 ,...,i L ) makes at most as many
mistakes as the best h ∈ H does, namely,
T
min |h(x t ) − y t |
h∈H
t=1
mistakes on the sequence of examples.
Together with Theorem 21.11, the upper bound part of Theorem 21.10 is proven.