Page 147 - Understanding Machine Learning
P. 147

12.1 Convexity, Lipschitzness, and Smoothness  129

                                     2
                The function f (x) = x is not ρ-Lipschitz over R for any ρ. To see this, take
                 x 1 = 0and x 2 = 1 + ρ,then
                                                  2
                               f (x 2 ) − f (x 1 ) = (1 + ρ) >ρ(1 + ρ) = ρ|x 2 − x 1 |.
                 However, this function is ρ-Lipschitz over the set C ={x : |x|≤ ρ/2}. Indeed, for
                 any x 1 ,x 2 ∈ C we have
                            2
                                2
                          |x − x |=|x 1 + x 2 ||x 1 − x 2 |≤ 2(ρ/2)|x 1 − x 2 |= ρ|x 1 − x 2 |.
                                2
                           1
                                       d
                                                                                  d
                The linear function f : R → R defined by f (w) = v,w + b where v ∈ R is
                  v -Lipschitz. Indeed, using Cauchy-Schwartz inequality,
                              | f (w 1 ) − f (w 2 )|=| v,w 1 − w 2  | ≤  v  w 1 − w 2  .
                 The following claim shows that composition of Lipschitz functions preserves
              Lipschitzness.

              Claim 12.7. Let f (x) = g 1 (g 2 (x)),where g 1 is ρ 1 -Lipschitz and g 2 is ρ 2 -Lipschitz.
              Then, f is (ρ 1 ρ 2 )-Lipschitz. In particular, if g 2 is the linear function, g 2 (x) = v,x +b,
                          d
              for some v ∈ R ,b ∈ R,then f is (ρ 1  v )-Lipschitz.
              Proof.

                               | f (w 1 ) − f (w 2 )|=|g 1 (g 2 (w 1 )) − g 1 (g 2 (w 2 ))|
                                             ≤ ρ 1  g 2 (w 1 ) − g 2 (w 2 )
                                             ≤ ρ 1 ρ 2  w 1 − w 2  .





              12.1.3 Smoothness
              The definition of a smooth function relies on the notion of gradient. Recall that the
                                                 d
              gradient of a differentiable function f : R → R at w, denoted ∇ f (w), is the vector

                                                      ∂ f (w)  ∂ f (w)
              of partial derivatives of f , namely, ∇ f (w) =  ,...,  .
                                                      ∂w 1     ∂w d
                                                                   d
              Definition 12.8 (Smoothness). A differentiable function f : R → R is β-smooth if
              its gradient is β-Lipschitz; namely, for all v,w we have  ∇ f (v)−8 f (w) ≤ β v−w .
                 It is possible to show that smoothness implies that for all v,w we have
                                                           β       2
                               f (v) ≤ f (w) + ∇ f (w),v − w +   v − w  .       (12.5)
                                                           2
              Recall that convexity of f implies that f (v) ≥ f (w) + ∇ f (w),v − w . Therefore,
              when a function is both convex and smooth, we have both upper and lower bounds
              on the difference between the function and its first order approximation.
                               1
                 Setting v = w − ∇ f (w) in the right-hand side of Equation (12.5) and rearrang-
                              β
              ing terms, we obtain
                                      1         2
                                         ∇ f (w)  ≤ f (w) − f (v).
                                     2β
   142   143   144   145   146   147   148   149   150   151   152