Page 176 - Understanding Machine Learning
P. 176
Stochastic Gradient Descent
158
Therefore, for any
> 0,to achieve E[ f ( ¯ w)] − f (w ) ≤
, it suffices to run the SGD
algorithm for a number of iterations that satisfies
2 2
B ρ
T ≥ .
2
Proof. Let us introduce the notation v 1:t to denote the sequence v 1 ,...,v t .Taking
expectation of Equation (14.2), we obtain
T
(t)
E [ f ( ¯ w) − f (w )] ≤ E T 1 ( f (w ) − f (w )) .
v 1:T v 1:T
t=1
Since Lemma 14.1 holds for any sequence v 1 ,v 2 ,...v T , it applies to SGD as well. By
taking expectation of the bound in the lemma we have
T
1 (t) B ρ
E w − w ,v t ≤ √ . (14.9)
v 1:T T T
t=1
It is left to show that
T T
(t) 1 (t)
1
E ( f (w ) − f (w )) ≤ E w − w ,v t , (14.10)
T T
v 1:T v 1:T
t=1 t=1
which we will hereby prove.
Using the linearity of the expectation we have
T T
1 (t) 1 (t)
E w − w ,v t = E [ w − w ,v t ].
v 1:T T T v 1:T
t=1 t=1
Next, we recall the law of total expectation: For every two random variables α,β,
and a function g, E α [g(α)] = E β E α [g(α)|β]. Setting α = v 1:t and β = v 1:t−1 we get
that
E [ w (t) − w , v t ] = E [ w (t) − w , v t ]
v 1:T v 1:t
= E E [ w (t) − w ,v t |v 1:t−1 ].
v 1:t−1 v 1:t
Once we know v 1:t−1 , the value of w (t) is not random any more and therefore
E E [ w (t) − w ,v t |v 1:t−1 ] = E w (t) − w , E[v t |v 1:t−1 ] .
v 1:t−1 v 1:t v 1:t−1 v t
(t)
(t)
Since w (t) only depends on v 1:t−1 and SGD requires that E v t [v t |w ] ∈ ∂ f (w )we
(t)
[v t |v 1:t−1 ] ∈ ∂ f (w ). Thus,
obtain that E v t
(t)
E w (t) − w ,E[v t |v 1:t−1 ] ≥ E [ f (w ) − f (w )].
v 1:t−1 v t v 1:t−1
Overall, we have shown that
(t)
E [ w (t) − w ,v t ] ≥ E [ f (w ) − f (w )]
v 1:T v 1:t−1
(t)
= E [ f (w ) − f (w )].
v 1:T