Page 147 - Understanding Machine Learning
P. 147
12.1 Convexity, Lipschitzness, and Smoothness 129
2
The function f (x) = x is not ρ-Lipschitz over R for any ρ. To see this, take
x 1 = 0and x 2 = 1 + ρ,then
2
f (x 2 ) − f (x 1 ) = (1 + ρ) >ρ(1 + ρ) = ρ|x 2 − x 1 |.
However, this function is ρ-Lipschitz over the set C ={x : |x|≤ ρ/2}. Indeed, for
any x 1 ,x 2 ∈ C we have
2
2
|x − x |=|x 1 + x 2 ||x 1 − x 2 |≤ 2(ρ/2)|x 1 − x 2 |= ρ|x 1 − x 2 |.
2
1
d
d
The linear function f : R → R defined by f (w) = v,w + b where v ∈ R is
v -Lipschitz. Indeed, using Cauchy-Schwartz inequality,
| f (w 1 ) − f (w 2 )|=| v,w 1 − w 2 | ≤ v w 1 − w 2 .
The following claim shows that composition of Lipschitz functions preserves
Lipschitzness.
Claim 12.7. Let f (x) = g 1 (g 2 (x)),where g 1 is ρ 1 -Lipschitz and g 2 is ρ 2 -Lipschitz.
Then, f is (ρ 1 ρ 2 )-Lipschitz. In particular, if g 2 is the linear function, g 2 (x) = v,x +b,
d
for some v ∈ R ,b ∈ R,then f is (ρ 1 v )-Lipschitz.
Proof.
| f (w 1 ) − f (w 2 )|=|g 1 (g 2 (w 1 )) − g 1 (g 2 (w 2 ))|
≤ ρ 1 g 2 (w 1 ) − g 2 (w 2 )
≤ ρ 1 ρ 2 w 1 − w 2 .
12.1.3 Smoothness
The definition of a smooth function relies on the notion of gradient. Recall that the
d
gradient of a differentiable function f : R → R at w, denoted ∇ f (w), is the vector
∂ f (w) ∂ f (w)
of partial derivatives of f , namely, ∇ f (w) = ,..., .
∂w 1 ∂w d
d
Definition 12.8 (Smoothness). A differentiable function f : R → R is β-smooth if
its gradient is β-Lipschitz; namely, for all v,w we have ∇ f (v)−8 f (w) ≤ β v−w .
It is possible to show that smoothness implies that for all v,w we have
β 2
f (v) ≤ f (w) + ∇ f (w),v − w + v − w . (12.5)
2
Recall that convexity of f implies that f (v) ≥ f (w) + ∇ f (w),v − w . Therefore,
when a function is both convex and smooth, we have both upper and lower bounds
on the difference between the function and its first order approximation.
1
Setting v = w − ∇ f (w) in the right-hand side of Equation (12.5) and rearrang-
β
ing terms, we obtain
1 2
∇ f (w) ≤ f (w) − f (v).
2β