Page 171 - Understanding Machine Learning

P. 171

14.1 Gradient Descent 153

satisﬁes
T 2 T
(t) w η 2
w − w ,v t ≤ + v t . (14.5)
2η 2
t=1 t=1
In particular, for every B,ρ > 0,if for all t we have that v t ≤ ρ and if we set η =

B 2 , then for every w with w ≤ B we have

2
ρ T
T
1 (t) B ρ
w − w ,v t ≤ √ .
T T
t=1
Proof. Using algebraic manipulations (completing the square), we obtain:
1 (t)
(t)

w − w ,v t = w − w ,ηv t
η
1 (t) 2 (t) 2 2 2
= ( −⊂w − w − ηv t + w − w + η v t )
2η
1 (t+1) 2 (t) 2 η 2
= ( −⊂w − w + w − w ) + v t ,
2η 2
where the last equality follows from the deﬁnition of the update rule. Summing the
equality over t, we have

T T
T
(t) 1 (t+1) 2 (t) 2 η 2
w −w ,v t = −⊂w − w + w − w + v t . (14.6)
2η 2
t=1 t=1 t=1
The ﬁrst sum on the right-hand side is a telescopic sum that collapses to
2
2
w (1) − w −⊂w (T +1) − w .
Plugging this in Equation (14.6), we have
T T
(t) 1 (1) 2 (T +1) 2 η 2
w − w ,v t = ( w − w −⊂w − w ) + v t
2η 2
t=1 t=1
T
1 (1) 2 η 2
≤ w − w + v t
2η 2
t=1
T
1 2 η 2
= w + v t ,
2η 2
t=1
where the last equality is due to the deﬁnition w (1) = 0. This proves the ﬁrst part of

the lemma (Equation (14.5)). The second part follows by upper bounding w by
B, v t by ρ, dividing by T, and plugging in the value of η.

(t)
Lemma 14.1 applies to the GD algorithm with v t =∇ f (w ). As we will show
(t)
later in Lemma 14.7,if f is ρ-Lipschitz, then ∇ f (w ) ≤ ρ. We therefore satisfy

166 167 168 169 170 171 172 173 174 175 176