Page 183 - Understanding Machine Learning

P. 183

14.6 Summary 165

λ
2
Deﬁne f (w) = w + L S (w). Note that f is a λ-strongly convex function;
2
d
therefore, we can apply the SGD variant given in Section 14.4.4 (with H = R ).
To apply this algorithm, we only need to ﬁnd a way to construct an unbiased esti-
(t)
mate of a subgradient of f at w . This is easily done by noting that if we pick z
(t)
uniformly at random from S, and choose v t ∈ ∂Ø (w ,z) then the expected value of
(t)
λw (t) + v t is a subgradient of f at w .
To analyze the resulting algorithm, we ﬁrst rewrite the update rule (assuming
d
that H = R and therefore the projection step does not matter) as follows
1 (t)
(t+1)
(t)
w = w − λw + v t
λt

1 (t) 1
= 1 − w − v t
t λt
t − 1 (t) 1
= w − v t
t λt

t − 1 t − 2 (t−1) 1 1
= w − v t−1 − v t
t t − 1 λ(t − 1) λt
t
1
=− v i . (14.15)
λt
i=1
If we assume that the loss function is ρ-Lipschitz, it follows that for all t we have
(t)
v t ≤ ρ and therefore λw ≤ ρ, which yields
λw (t) + v t ≤ 2ρ.

Theorem 14.11 therefore tells us that after performing T iterations we have that

4ρ 2

E[ f ( ¯ w)] − f (w ) ≤ (1 + log(T )).
λT

14.6 SUMMARY
We have introduced the Gradient Descent and Stochastic Gradient Descent algo-
rithms, along with several of their variants. We have analyzed their convergence rate
and calculated the number of iterations that would guarantee an expected objective
of at most
plus the optimal objective. Most importantly, we have shown that by
using SGD we can directly minimize the risk function. We do so by sampling a
point i.i.d from D and using a subgradient of the loss of the current hypothesis w (t)
at this point as an unbiased estimate of the gradient (or a subgradient) of the risk
function. This implies that a bound on the number of iterations also yields a sam-
ple complexity bound. Finally, we have also shown how to apply the SGD method
to the problem of regularized risk minimization. In future chapters we show how
this yields extremely simple solvers to some optimization problems associated with
regularized risk minimization.

178 179 180 181 182 183 184 185 186 187 188