Page 171 - Understanding Machine Learning
P. 171
14.1 Gradient Descent 153
satisfies
T 2 T
(t) w η 2
w − w ,v t ≤ + v t . (14.5)
2η 2
t=1 t=1
In particular, for every B,ρ > 0,if for all t we have that v t ≤ ρ and if we set η =
B 2 , then for every w with w ≤ B we have
2
ρ T
T
1 (t) B ρ
w − w ,v t ≤ √ .
T T
t=1
Proof. Using algebraic manipulations (completing the square), we obtain:
1 (t)
(t)
w − w ,v t = w − w ,ηv t
η
1 (t) 2 (t) 2 2 2
= ( −⊂w − w − ηv t + w − w + η v t )
2η
1 (t+1) 2 (t) 2 η 2
= ( −⊂w − w + w − w ) + v t ,
2η 2
where the last equality follows from the definition of the update rule. Summing the
equality over t, we have
T T
T
(t) 1 (t+1) 2 (t) 2 η 2
w −w ,v t = −⊂w − w + w − w + v t . (14.6)
2η 2
t=1 t=1 t=1
The first sum on the right-hand side is a telescopic sum that collapses to
2
2
w (1) − w −⊂w (T +1) − w .
Plugging this in Equation (14.6), we have
T T
(t) 1 (1) 2 (T +1) 2 η 2
w − w ,v t = ( w − w −⊂w − w ) + v t
2η 2
t=1 t=1
T
1 (1) 2 η 2
≤ w − w + v t
2η 2
t=1
T
1 2 η 2
= w + v t ,
2η 2
t=1
where the last equality is due to the definition w (1) = 0. This proves the first part of
the lemma (Equation (14.5)). The second part follows by upper bounding w by
B, v t by ρ, dividing by T, and plugging in the value of η.
(t)
Lemma 14.1 applies to the GD algorithm with v t =∇ f (w ). As we will show
(t)
later in Lemma 14.7,if f is ρ-Lipschitz, then ∇ f (w ) ≤ ρ. We therefore satisfy