Page 183 - Understanding Machine Learning
P. 183

14.6 Summary   165

                               λ
                                   2
                 Define f (w) =  w  + L S (w). Note that f is a λ-strongly convex function;
                               2
                                                                                   d
              therefore, we can apply the SGD variant given in Section 14.4.4 (with H = R ).
              To apply this algorithm, we only need to find a way to construct an unbiased esti-
                                          (t)
              mate of a subgradient of f at w . This is easily done by noting that if we pick z
                                                         (t)
              uniformly at random from S, and choose v t ∈ ∂Ø (w ,z) then the expected value of
                                            (t)
              λw (t)  + v t is a subgradient of f at w .
                 To analyze the resulting algorithm, we first rewrite the update rule (assuming
                       d
              that H = R and therefore the projection step does not matter) as follows
                                         1     (t)
                             (t+1)
                                    (t)
                            w    = w   −    λw   + v t
                                         λt

                                        1   (t)  1
                                 = 1 −     w  −    v t
                                        t       λt
                                   t − 1  (t)  1
                                 =     w   −   v t
                                     t       λt

                                   t − 1  t − 2  (t−1)  1           1
                                 =           w     −        v t−1 −   v t
                                     t   t − 1       λ(t − 1)      λt
                                         t
                                     1
                                 =−       v i .                                (14.15)
                                     λt
                                        i=1
              If we assume that the loss function is ρ-Lipschitz, it follows that for all t we have
                                     (t)
               v t  ≤ ρ and therefore  λw  ≤ ρ, which yields
                                           λw (t)  + v t  ≤ 2ρ.

              Theorem 14.11 therefore tells us that after performing T iterations we have that

                                                   4ρ 2

                                  E[ f ( ¯ w)] − f (w ) ≤  (1 + log(T )).
                                                   λT


              14.6 SUMMARY
              We have introduced the Gradient Descent and Stochastic Gradient Descent algo-
              rithms, along with several of their variants. We have analyzed their convergence rate
              and calculated the number of iterations that would guarantee an expected objective
              of at most 
 plus the optimal objective. Most importantly, we have shown that by
              using SGD we can directly minimize the risk function. We do so by sampling a
              point i.i.d from D and using a subgradient of the loss of the current hypothesis w (t)
              at this point as an unbiased estimate of the gradient (or a subgradient) of the risk
              function. This implies that a bound on the number of iterations also yields a sam-
              ple complexity bound. Finally, we have also shown how to apply the SGD method
              to the problem of regularized risk minimization. In future chapters we show how
              this yields extremely simple solvers to some optimization problems associated with
              regularized risk minimization.
   178   179   180   181   182   183   184   185   186   187   188