Page 276 - Understanding Machine Learning
P. 276

Online Learning
           258

                 Theorem 21.15. The Online Gradient Descent algorithm enjoys the following regret

                 bound for every w ∈ H,
                                                         2     T
                                                      w     η        2
                                     Regret (w ,T ) ≤     +       v t   .
                                           A
                                                      2η    2
                                                              t=1
                                                                            √
                 If we further assume that f t is ρ-Lipschitz for all t, then setting η = 1/ T yields
                                                     1      2   2  √

                                      Regret (w ,T ) ≤ ( w   + ρ ) T .
                                            A
                                                     2
                                                                   B
                 If we further assume that H is B-bounded and we set η =  √ then
                                                                  ρ T
                                                             √
                                          Regret (H,T ) ≤ B ρ  T .
                                                 A
                 Proof. The analysis is similar to the analysis of Stochastic Gradient Descent with
                                                                       1
                 projections. Using the projection lemma, the definition of w (t+ )
                                                                       2 , and the definition
                 of subgradients, we have that for every t,
                                    2
                                                2
                         w (t+1)  − w   −⊂w (t)  − w
                                                              1
                                                1
                                        2
                                                                                  2
                                                2 − w   + w
                                                              2 − w   −⊂w
                          = w  (t+1)  − w   −⊂w (t+ )    2  (t+ )     2    (t)  − w
                                 1
                                                     2
                                 2 − w   −⊂w
                          ≤√w  (t+ )    2    (t)  − w
                                                        2
                                           2
                          = w  (t)  − ηv t − w   −⊂w (t)  − w

                                               2
                          =−2η w  (t)  − w ,v t  + η  v t   2
                                                   2
                                    (t)
                                                       2

                          ≤−2η( f t (w ) − f t (w )) + η  v t   .
                 Summing over t and observing that the left-hand side is a telescopic sum we
                 obtain that
                                                       T                      T
                                                             (t)            2       2
                                               2
                                       (1)
                                  2
                        (T +1)
                       w     − w   −⊂w    − w   ≤−2η     ( f t (w ) − f t (w )) + η   v t   .
                                                      t=1                    t=1
                 Rearranging the inequality and using the fact that w (1)  = 0, we get that
                         T                      (1)    2     (T +1)    2    T
                                              w   − w   −⊂w      − w      η
                                (t)                                               2
                           ( f t (w ) − f t (w )) ≤                     +      v t
                                                         2η               2
                        t=1                                                t=1
                                                  2    T
                                              w      η       2
                                           ≤       +       v t   .
                                               2η    2
                                                       t=1
                 This proves the first bound in the theorem. The second bound follows from the
                 assumption that f t is ρ-Lipschitz, which implies that  v t  ≤ ρ.
                 21.4 THE ONLINE PERCEPTRON ALGORITHM
                 The Perceptron is a classic online learning algorithm for binary classification with
                 the hypothesis class of homogenous halfspaces, namely, H ={x  → sign( w,x ):
   271   272   273   274   275   276   277   278   279   280   281