Page 177 - Understanding Machine Learning
P. 177

14.4 Variants  159


              Summing over t, dividing by T, and using the linearity of expectation, we get that
              Equation (14.10) holds, which concludes our proof.


              14.4 VARIANTS

              In this section we describe several variants of Stochastic Gradient Descent.


              14.4.1 Adding a Projection Step
              In the previous analyses of the GD and SGD algorithms, we required that the norm

              of w will be at most B, which is equivalent to requiring that w is in the set H =
              {w :  w ≤ B}. In terms of learning, this means restricting ourselves to a B-bounded
              hypothesis class. Yet any step we take in the opposite direction of the gradient (or
              its expected direction) might result in stepping out of this bound, and there is even
              no guarantee that ¯ w satisfies it. We show in the following how to overcome this
              problem while maintaining the same convergence rate.
                 The basic idea is to add a projection step; namely, we will now have a two-step
              update rule, where we first subtract a subgradient from the current value of w and
              then project the resulting vector onto H. Formally,
                       1
                1. w (t+ )  (t)  − ηv t
                       2 = w
                                             1
                2. w (t+1)  = argmin   w − w (t+ )
                                             2
                                 w∈H
              The projection step replaces the current value of w by the vector in H closest to it.
                 Clearly, the projection step guarantees that w (t)  ∈ H for all t.Since H is convex
              this also implies that ¯ w ∈ H as required. We next show that the analysis of SGD with
              projections remains the same. This is based on the following lemma.
              Lemma 14.9 (Projection Lemma). Let H be a closed convex set and let v be the
              projection of w onto H, namely,
                                                         2
                                         v = argmin x − w  .
                                              x∈H
              Then, for every u ∈ H,
                                              2
                                                       2
                                        w − u  −⊂v − u  ≥ 0.
              Proof. By the convexity of H, for every α ∈ (0,1) we have that v + α(u − v) ∈ H.
              Therefore, from the optimality of v we obtain
                                  2                  2
                            v − w  ≤√v + α(u − v) − w
                                                                      2
                                            2
                                                               2
                                   = v − w  + 2α v − w,u − v + α  u − v  .
              Rearranging, we obtain
                                                             2
                                     2 v − w,u − v ≥ −α  u − v  .
              Taking the limit α → 0 we get that

                                           v − w,u − v ≥ 0.
   172   173   174   175   176   177   178   179   180   181   182