Page 182 - Understanding Machine Learning

P. 182

Stochastic Gradient Descent
164

To analyze SGD for convex-smooth problems, let us deﬁne z 1 ,...,z T the random
(t)
samples of the SGD algorithm, let f t ( · ) = (·,z t ), and note that v t =∇ f t (w ).

(t)
For all t, f t is a convex function and therefore f t (w ) − f t (w ) ≤ø v t ,w (t) − w .
Summing over t and using Lemma 14.1 we obtain
T T 2 T
w η
(t) (t) 2
( f t (w ) − f t (w )) ≤ v t ,w − w ≤ + v t .
2η 2
t=1 t=1 t=1
Combining the preceding with the self-boundedness of f t yields
T 2 T
w
(t) (t)
( f t (w ) − f t (w )) ≤ + ηβ f t (w ).
2η
t=1 t=1
Dividing by T and rearranging, we obtain
T T 2
1 (t) 1 1 w
f t (w ) ≤ f t (w ) + .
T 1 − ηβ T 2η T
t=1 t=1
Next, we take expectation of the two sides of the preceding equation with respect to

z 1 ,...,z T . Clearly, E[ f t (w )] = L D (w ). In addition, using the same argument as in
the proof of Theorem 14.8 we have that
T T
1 (t) 1 (t)
E f t (w ) = E L D (w ) ≥ E[L D ( ¯ w)].
T T
t=1 t=1
Combining all we conclude our proof.
As a direct corollary we obtain:
Corollary 14.14. Consider a convex-smooth-bounded learning problem with param-
eters β, B. Assume in addition that (0, z) ≤ 1 for all z ∈ Z. For every
> 0,set
1 2 2
η = . Then, running SGD with T ≥ 12B β/
yields
β(1+3/
)
E[L D ( ¯ w)] ≤ min L D (w) +
.
w∈H

14.5.3 SGD for Regularized Loss Minimization

We have shown that SGD enjoys the same worst-case sample complexity bound
as regularized loss minimization. However, on some distributions, regularized loss
minimization may yield a better solution. Therefore, in some cases we may want
to solve the optimization problem associated with regularized loss minimization,
namely, 1

λ 2
min w + L S (w) . (14.14)
w 2
Since we are dealing with convex learning problems in which the loss function is
convex, the preceding problem is also a convex optimization problem that can be
solved using SGD as well, as we shall see in this section.

1 We divided λ by 2 for convenience.

177 178 179 180 181 182 183 184 185 186 187