Page 274 - Understanding Machine Learning
        P. 274
     Online Learning
           256
                    To analyze the algorithm we first note that the number of experts is
                                                  Ldim(H)
                                                         T
                                              d =           .                       (21.4)
                                                         L
                                                   L=0
                 It can be shown that when T ≥ Ldim(H) + 2, the right-hand side of the equation is
                                          Ldim(H)
                 bounded by eT/Ldim(H)          (the proof can be found in Lemma A.5).
                    Theorem 21.11 tells us that the expected number of mistakes of Weighted-
                 Majority is at most the number of mistakes of the best expert plus  2log(d)T .We
                 will next show that the number of mistakes of the best expert is at most the number
                 of mistakes of the best hypothesis in H. The following key lemma shows that, on
                 any sequence of instances, for each hypothesis h ∈ H there exists an expert with the
                 same behavior.
                 Lemma 21.13. Let H be any hypothesis class with Ldim(H) < ∞.Let x 1 ,x 2 ,...,x T
                 be any sequence of instances. For any h ∈ H, there exists L ≤ Ldim(H) and indices
                 1 ≤ i 1 < i 2 < ··· < i L ≤ T such that when running Expert(i 1 ,i 2 ,...,i L ) on the sequence
                 x 1 ,x 2 ,...,x T , the expert predicts h(x t ) on each online round t = 1,2,...,T .
                 Proof. Fix h ∈ H and the sequence x 1 ,x 2 ,...,x T . We must construct L and the
                 indices i 1 ,i 2 ,...,i L . Consider running SOA on the input (x 1 ,h(x 1 )), (x 2 ,h(x 2 )), ...,
                 (x T ,h(x T )). SOA makes at most Ldim(H) mistakes on such input. We define L to
                 be the number of mistakes made by SOA and we define {i 1 ,i 2 ,...,i L } to be the set
                 of rounds in which SOA made the mistakes.
                    Now, consider the Expert(i 1 ,i 2 ,...,i L ) running on the sequence x 1 ,x 2 ,...,x T .
                 By construction, the set V t maintained by Expert(i 1 ,i 2 ,...,i L ) equals the set V t
                 maintained by SOA when running on the sequence (x 1 ,h(x 1 )),...,(x T ,h(x T )). The
                 predictions of SOA differ from the predictions of h if and only if the round is
                 in {i 1 ,i 2 ,...,i L }.Since Expert(i 1 ,i 2 ,...,i L ) predicts exactly like SOA if t is not
                 in {i 1 ,i 2 ,...,i L } and the opposite of SOAs’ predictions if t is in {i 1 ,i 2 ,...,i L },we
                 conclude that the predictions of the expert are always the same as the predic-
                 tions of h.
                    The previous lemma holds in particular for the hypothesis in H that makes the
                 least number of mistakes on the sequence of examples, and we therefore obtain the
                 following:
                 Corollary 21.14. Let (x 1 , y 1 ),(x 2 , y 2 ),...,(x T , y T ) be a sequence of examples and let
                 H be a hypothesis class with Ldim(H) < ∞. There exists L ≤ Ldim(H) and indices
                 1 ≤ i 1 < i 2 < ··· < i L ≤ T , such that Expert(i 1 ,i 2 ,...,i L ) makes at most as many
                 mistakes as the best h ∈ H does, namely,
                                                  T
                                             min    |h(x t ) − y t |
                                             h∈H
                                                 t=1
                 mistakes on the sequence of examples.
                    Together with Theorem 21.11, the upper bound part of Theorem 21.10 is proven.





