Page 171 - Understanding Machine Learning
P. 171

14.1 Gradient Descent  153


              satisfies
                                 T                    2     T
                                     (t)          w      η       2
                                    w   − w ,v t  ≤    +       v t   .          (14.5)
                                                   2η    2
                                 t=1                       t=1
              In particular, for every B,ρ > 0,if for all t we have that  v t  ≤ ρ and if we set η =

                B 2  , then for every w with  w  ≤ B we have


                 2
                ρ T
                                         T
                                      1      (t)         B ρ
                                            w  − w ,v t  ≤ √ .
                                      T                    T
                                        t=1
              Proof. Using algebraic manipulations (completing the square), we obtain:
                                   1   (t)
                       (t)

                      w  − w ,v t  =  w  − w ,ηv t
                                   η
                                    1      (t)         2     (t)    2   2   2
                                 =    ( −⊂w  − w − ηv t   + w  − w   + η  v t   )
                                   2η
                                    1      (t+1)    2     (t)    2   η    2
                                 =    ( −⊂w    − w   + w    − w   ) +  v t   ,
                                   2η                                2
              where the last equality follows from the definition of the update rule. Summing the
              equality over t, we have

                T                                                       T
                                    T
                    (t)          1          (t+1)    2    (t)     2     η     2
                   w  −w ,v t  =        −⊂w     − w   + w   − w     +      v t   . (14.6)
                                 2η                                   2
               t=1                  t=1                                t=1
              The first sum on the right-hand side is a telescopic sum that collapses to
                                                             2
                                              2
                                     w (1)  − w   −⊂w (T +1)  − w   .
              Plugging this in Equation (14.6), we have
                     T                                                  T
                         (t)           1   (1)     2    (T +1)    2   η       2
                        w  − w ,v t  =  ( w   − w   −⊂w      − w   ) +      v t
                                      2η                              2
                    t=1                                                 t=1
                                                       T
                                      1   (1)     2  η       2
                                   ≤     w  − w   +       v t
                                     2η              2
                                                       t=1
                                                  T
                                      1     2  η        2
                                   =     w   +       v t   ,
                                     2η        2
                                                 t=1
              where the last equality is due to the definition w (1)  = 0. This proves the first part of

              the lemma (Equation (14.5)). The second part follows by upper bounding  w   by
              B,  v t   by ρ, dividing by T, and plugging in the value of η.

                                                                   (t)
                 Lemma 14.1 applies to the GD algorithm with v t =∇ f (w ). As we will show
                                                           (t)
              later in Lemma 14.7,if f is ρ-Lipschitz, then  ∇ f (w ) ≤ ρ. We therefore satisfy
   166   167   168   169   170   171   172   173   174   175   176