Page 176 - Understanding Machine Learning
P. 176

Stochastic Gradient Descent
           158

                 Therefore, for any 
> 0,to achieve E[ f ( ¯ w)] − f (w ) ≤ 
, it suffices to run the SGD
                 algorithm for a number of iterations that satisfies
                                                      2 2
                                                     B ρ
                                                 T ≥     .
                                                      
 2
                 Proof. Let us introduce the notation v 1:t to denote the sequence v 1 ,...,v t .Taking
                 expectation of Equation (14.2), we obtain
                                                        T



                                                               (t)

                                E [ f ( ¯ w) − f (w )] ≤ E  T 1  ( f (w ) − f (w )) .
                               v 1:T              v 1:T
                                                        t=1
                 Since Lemma 14.1 holds for any sequence v 1 ,v 2 ,...v T , it applies to SGD as well. By
                 taking expectation of the bound in the lemma we have
                                              T
                                            1      (t)          B ρ
                                        E        w   − w ,v t   ≤ √ .               (14.9)
                                       v 1:T T                   T
                                              t=1
                 It is left to show that

                                T                           T
                                       (t)               1      (t)
                              1
                          E       ( f (w ) − f (w )) ≤ E       w  − w ,v t   ,     (14.10)
                              T                          T
                          v 1:T                      v 1:T
                                t=1                        t=1
                 which we will hereby prove.
                    Using the linearity of the expectation we have
                                      T                   T
                                   1      (t)          1          (t)
                               E        w   − w ,v t   =     E [ w  − w ,v t  ].
                              v 1:T T                  T    v 1:T
                                     t=1                 t=1
                 Next, we recall the law of total expectation: For every two random variables α,β,
                 and a function g, E α [g(α)] = E β E α [g(α)|β]. Setting α = v 1:t and β = v 1:t−1 we get
                 that


                                E [ w (t)  − w , v t  ] = E [ w (t)  − w , v t  ]
                               v 1:T              v 1:t

                                                = E E [ w  (t)  − w ,v t  |v 1:t−1 ].
                                                  v 1:t−1 v 1:t
                 Once we know v 1:t−1 , the value of w (t)  is not random any more and therefore


                            E E [ w (t)  − w ,v t  |v 1:t−1 ] = E  w (t)  − w , E[v t |v 1:t−1 ] .
                           v 1:t−1 v 1:t               v 1:t−1      v t
                                                                                    (t)
                                                                          (t)
                 Since w (t)  only depends on v 1:t−1 and SGD requires that E v t  [v t |w ] ∈ ∂ f (w )we
                                              (t)
                              [v t |v 1:t−1 ] ∈ ∂ f (w ). Thus,
                 obtain that E v t
                                                                  (t)


                               E  w (t)  − w ,E[v t |v 1:t−1 ] ≥  E [ f (w ) − f (w )].
                              v 1:t−1      v t           v 1:t−1
                 Overall, we have shown that


                                                             (t)
                                   E [ w (t)  − w ,v t  ] ≥ E [ f (w ) − f (w )]
                                   v 1:T             v 1:t−1

                                                            (t)
                                                   = E [ f (w ) − f (w )].
                                                     v 1:T
   171   172   173   174   175   176   177   178   179   180   181