Page 160 - Understanding Machine Learning
P. 160

Regularization and Stability
           142

                 On the other hand, for any v and u,and for all i, we have
                                                       2
                                                                      2
                              f S (v) − f S (u) = L S (v) + λ v  − (L S (u) + λ u  )  (13.8)
                                                        2
                                                                         2
                                          = L (i)(v) + λ v  − (L (i)(u) + λ u  )
                                             S               S


                                               (v,z i ) −  (u,z i )   (u,z ) −  (v,z )
                                            +               +               .
                                                    m               m
                                             (i)
                 In particular, choosing v = A(S ), u = A(S), and using the fact that v minimizes
                               2
                  L (i)(w) + λ w  , we obtain that
                   S
                                              (i)
                                                                                  (i)

                                          (A(S ),z i ) −  (A(S),z i )   (A(S),z ) −  (A(S ),z )

                         (i)
                   f S (A(S )) − f S (A(S)) ≤                  +                        .
                                                   m                        m
                                                                                    (13.9)
                 Combining this with Equation (13.7) we obtain that
                                                                                (i)
                                            (i)
                                        (A(S ),z i ) −  (A(S),z i )   (A(S),z ) −  (A(S ),z )


                          (i)       2
                    λ A(S ) − A(S)  ≤                         +                       .
                                                 m                        m
                                                                                   (13.10)
                    The two subsections that follow continue the stability analysis for either Lip-
                 schitz or smooth loss functions. For both families of loss functions we show that
                 RLM is stable and therefore it does not overfit.
                 13.3.1 Lipschitz Loss
                 If the loss function,  (·,z i ), is ρ-Lipschitz, then by the definition of Lipschitzness,
                                      (i)
                                                               (i)
                                  (A(S ),z i ) −  (A(S),z i ) ≤ ρ  A(S ) − A(S) .  (13.11)
                 Similarly,
                                                               (i)
                                                 (i)

                                  (A(S),z ) −  (A(S ),z ) ≤ ρ  A(S ) − A(S) .

                 Plugging these inequalities into Equation (13.10) we obtain
                                                             (i)
                                                      2ρ  A(S ) − A(S)
                                        (i)       2
                                   λ A(S ) − A(S)  ≤                   ,
                                                              m
                 which yields
                                                            2ρ
                                               (i)
                                            A(S ) − A(S) ≤     .
                                                            λm
                 Plugging the preceding back into Equation (13.11) we conclude that
                                                                2ρ 2
                                            (i)
                                        (A(S ),z i ) −  (A(S),z i ) ≤  .
                                                                λm
                 Since this holds for any S, z , i we immediately obtain:

                 Corollary 13.6. Assume that the loss function is convex and ρ-Lipschitz. Then, the
                                                2
                 RLM rule with the regularizer λ w  is on-average-replace-one-stable with rate  2ρ  2 .
                                                                                      λm
   155   156   157   158   159   160   161   162   163   164   165