Page 268 - Understanding Machine Learning
P. 268

Online Learning
           250

                 This tree is shattered by H. And, because of the density of the reals, this tree can be
                 made arbitrarily deep.

                    Lemma 21.6 states that Ldim(H) lower bounds the mistake bound of any algo-
                 rithm. Interestingly, there is a standard algorithm whose mistake bound matches this
                 lower bound. The algorithm is similar to the Halving algorithm. Recall that the pre-
                 diction of Halving is made according to a majority vote of the hypotheses which are
                 consistent with previous examples. We denoted this set by V t . Put another way, Halv-
                                             +
                                                                      −
                 ing partitions V t into two sets: V ={h ∈ V t : h(x t ) = 1} and V ={h ∈ V t : h(x t ) = 0}.
                                                                     t
                                             t
                 It then predicts according to the larger of the two groups. The rationale behind this
                 prediction is that whenever Halving makes a mistake it ends up with |V t+1 |≤ 0.5|V t |.
                    The optimal algorithm we present in the following uses the same idea, but
                 instead of predicting according to the larger class, it predicts according to the class
                 with larger Ldim.
                                      Standard Optimal Algorithm (SOA)

                            input: A hypothesis class H
                           initialize: V 1 = H
                           for t = 1,2,...
                             receive x t
                                            (r)
                             for r ∈{0,1} let V t  ={h ∈ V t : h(x t ) = r}
                                                          (r)
                             predict p t = argmax  Ldim(V t  )
                                              r∈{0,1}
                              (incaseof a tiepredict p t = 1)
                             receive true label y t
                             update V t+1 ={h ∈ V t : h(x t ) = y t }

                    The following lemma formally establishes the optimality of the preceding
                 algorithm.

                 Lemma 21.7. SOA enjoys the mistake bound M SOA (H) ≤ Ldim(H).
                 Proof. It suffices to prove that whenever the algorithm makes a prediction mistake
                 we have Ldim(V t+1 ) ≤ Ldim(V t )−1. We prove this claim by assuming the contrary,
                 that is, Ldim(V t+1 ) = Ldim(V t ). If this holds true, then the definition of p t implies
                             (r)
                 that Ldim(V t  ) = Ldim(V t ) for both r = 1and r = 0. But, then we can construct
                 a shaterred tree of depth Ldim(V t ) + 1for the class V t , which leads to the desired
                 contradiction.

                    Combining Lemma 21.7 and Lemma 21.6 we obtain:
                 Corollary 21.8. Let H be any hypothesis class. Then, the standard optimal algo-
                 rithm enjoys the mistake bound M SOA (H) = Ldim(H) and no other algorithm can
                 have M A (H) < Ldim(H).

                 Comparison to VC Dimension
                 In the PAC learning model, learnability is characterized by the VC dimension of
                 the class H. Recall that the VC dimension of a class H is the maximal number d
                 such that there are instances x 1 ,...,x d that are shattered by H. That is, for any
   263   264   265   266   267   268   269   270   271   272   273