Page 178 - Understanding Machine Learning
P. 178
Stochastic Gradient Descent
160
Therefore,
2 2
w − u = w − v + v − u
2 2
= w − v + v − u + 2 v − w, u − v
2
≥√v − u .
Equipped with the preceding lemma, we can easily adapt the analysis of SGD to
the case in which we add projection steps on a closed and convex set. Simply note
that for every t,
2
2
w (t+1) − w −⊂w (t) − w
1
1
(t+1) 2 (t+ ) 2 (t+ ) 2 (t) 2
= w − w −⊂w 2 − w + w 2 − w −⊂w − w
1
2
2 − w −⊂w
≤√w (t+ ) 2 (t) − w .
Therefore, Lemma 14.1 holds when we add projection steps and hence the rest of
the analysis follows directly.
14.4.2 Variable Step Size
Another variant of SGD is decreasing the step size as a function of t. That is, rather
B
than updating with a constant η,we use η t . For instance, we can set η t = √ and
ρ t
achieve a bound similar to Theorem 14.8. The idea is that when we are closer to the
minimum of the function, we take our steps more carefully, so as not to “overshoot”
the minimum.
14.4.3 Other Averaging Techniques
1 T (t)
We have set the output vector to be ¯ w = w . There are alternative
T t=1
approaches such as outputting w (t) for some random t ∈ [t], or outputting the aver-
age of w (t) over the last αT iterations, for some α ∈ (0,1). One canalsotake
a weighted average of the last few iterates. These more sophisticated averaging
schemes can improve the convergence speed in some situations, such as in the case
of strongly convex functions defined in the following.
14.4.4 Strongly Convex Functions*
In this section we show a variant of SGD that enjoys a faster convergence rate for
problems in which the objective function is strongly convex (see Definition 13.4 of
strong convexity in the previous chapter). We rely on the following claim, which
generalizes Lemma 13.5.
Claim 14.10. If f is λ-strongly convex then for every w,u and v ∈ ∂ f (w) we have
2
λ
w − u,v ≥ f (w) − f (u) + w − u .
2
The proof is similar to the proof of Lemma 13.5 and is left as an exercise.