Page 178 - Understanding Machine Learning
P. 178

Stochastic Gradient Descent
           160

                 Therefore,
                                       2                2
                                  w − u  = w − v + v − u
                                                 2        2
                                         = w − v  + v − u  + 2 v − w, u − v
                                                 2
                                         ≥√v − u  .



                    Equipped with the preceding lemma, we can easily adapt the analysis of SGD to
                 the case in which we add projection steps on a closed and convex set. Simply note
                 that for every t,
                                                2
                                    2
                         w (t+1)  − w   −⊂w (t)  − w
                                                              1
                                                1
                               (t+1)    2    (t+ )     2    (t+ )     2    (t)    2
                          = w      − w   −⊂w    2 − w   + w   2 − w   −⊂w    − w
                                 1
                                                     2
                                 2 − w   −⊂w
                          ≤√w  (t+ )    2    (t)  − w   .
                 Therefore, Lemma 14.1 holds when we add projection steps and hence the rest of
                 the analysis follows directly.
                 14.4.2 Variable Step Size

                 Another variant of SGD is decreasing the step size as a function of t. That is, rather
                                                                                   B
                 than updating with a constant η,we use η t . For instance, we can set η t =  √ and
                                                                                  ρ t
                 achieve a bound similar to Theorem 14.8. The idea is that when we are closer to the
                 minimum of the function, we take our steps more carefully, so as not to “overshoot”
                 the minimum.


                 14.4.3 Other Averaging Techniques
                                                          1    T  (t)
                 We have set the output vector to be ¯ w =       w . There are alternative
                                                          T   t=1
                 approaches such as outputting w (t)  for some random t ∈ [t], or outputting the aver-
                 age of w (t)  over the last αT iterations, for some α ∈ (0,1). One canalsotake
                 a weighted average of the last few iterates. These more sophisticated averaging
                 schemes can improve the convergence speed in some situations, such as in the case
                 of strongly convex functions defined in the following.


                 14.4.4 Strongly Convex Functions*
                 In this section we show a variant of SGD that enjoys a faster convergence rate for
                 problems in which the objective function is strongly convex (see Definition 13.4 of
                 strong convexity in the previous chapter). We rely on the following claim, which
                 generalizes Lemma 13.5.
                 Claim 14.10. If f is λ-strongly convex then for every w,u and v ∈ ∂ f (w) we have
                                                                     2
                                                             λ
                                      w − u,v ≥ f (w) − f (u) +  w − u  .
                                                             2
                    The proof is similar to the proof of Lemma 13.5 and is left as an exercise.
   173   174   175   176   177   178   179   180   181   182   183