Page 175 - Understanding Machine Learning

P. 175

14.3 Stochastic Gradient Descent (SGD) 157

Figure 14.3. An illustration of the gradient descent algorithm (left) and the stochastic
2
2
gradient descent algorithm (right). The function to be minimized is 1.25(x +6) +(y −8) .
For the stochastic case, the solid line depicts the averaged value of w.

Stochastic Gradient Descent (SGD) for minimizing f (w)
parameters: Scalar η> 0, integer T > 0
initialize: w (1) = 0
for t = 1,2,...,T
(t)
(t)
choose v t at random from a distribution such that E[v t |w ] ∈ ∂ f (w )
update w (t+1) = w (t) − ηv t
1 T (t)
output ¯ w = w
T t=1
An illustration of stochastic gradient descent versus gradient descent is given
in Figure 14.3. As wewill seeinSection 14.5, in the context of learning problems,
it is easy to ﬁnd a random vector whose expectation is a subgradient of the risk
function.

14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions
Recall the bound we achieved for the GD algorithm in Corollary 14.2.For the
(t)
stochastic case, in which only the expectation of v t is in ∂ f (w ), we cannot directly
apply Equation (14.3). However, since the expected value of v t is a subgradient of
(t)
f at w , we can still derive a similar bound on the expected output of stochastic
gradient descent. This is formalized in the following theorem.

Theorem 14.8. Let B,ρ > 0.Let f be a convex function and let w ∈
2
argmin f (w). Assume that SGD is run for T iterations with η = B .
w: w ≤B ρ T
2
Assume also that for all t, v t ≤ ρ with probability 1. Then,
B ρ

E[ f ( ¯ w)] − f (w ) ≤ √ .
T

170 171 172 173 174 175 176 177 178 179 180