Page 170 - Understanding Machine Learning

P. 170

Stochastic Gradient Descent
152

Figure 14.1. An illustration of the gradient descent algorithm. The function to be
2
2
minimized is 1.25(x 1 + 6) + (x 2 − 8) .
14.1.1 Analysis of GD for Convex-Lipschitz Functions

To analyze the convergence rate of the GD algorithm, we limit ourselves to the case
of convex-Lipschitz functions (as we have seen, many problems lend themselves

easily to this setting). Let w be any vector and let B be an upper bound on w .It

is convenient to think of w as the minimizer of f (w), but the analysis that follows

holds for every w .
We would like to obtain an upper bound on the suboptimality of our solu-
1 T (t)
tion with respect to w , namely, f ( ¯ w) − f (w ), where ¯ w = w .From the
T t=1
deﬁnition of ¯ w, and using Jensen’s inequality, we have that

f ( ¯ w) − f (w ) = f 1 w (t) − f (w )
T
t=1
1 (t)
T
≤ f (w ) − f (w )
T
t=1
1 (t)
T
= f (w ) − f (w ) . (14.2)
T
t=1
For every t, because of the convexity of f , we have that

(t)
(t)

f (w ) − f (w ) ≤ø w (t) − w ,∇ f (w ) . (14.3)
Combining the preceding we obtain
T
1
(t) (t)
f ( ¯ w) − f (w ) ≤ w − w ,∇ f (w ) .
T
t=1
To bound the right-hand side we rely on the following lemma:
Lemma 14.1. Let v 1 ,...,v T be an arbitrary sequence of vectors. Any algorithm with
an initialization w (1) = 0 and an update rule of the form
w (t+1) = w (t) − ηv t (14.4)

165 166 167 168 169 170 171 172 173 174 175