Page 163 - Understanding Machine Learning
P. 163

13. 4  C ontrolling  the  Fitting-Sta bility  Tra deoff  145


                 We now derive bounds on the empirical risk term for the RLM rule. Recall that
                                                                  2
              the RLM rule is defined as  A( S) = argmin w  L S (w) + λ w  .  Fix some arbitrary
                     ∗
              vector w . We have
                                                      2
                                                                     ∗ 2

                          L S ( A( S)) ≤ L S ( A( S)) + λ A( S)  ≤ L S (w ) + λ w   .


                                                              ∗



                                                                                 ∗

              Taking expectation of both sides with respect to S and noting that E S [ L S (w )] =

              L D (w ), we obtain that
                   ∗
                                                             ∗ 2
                                   E[ L S ( A( S))] ≤ L D (w ) + λ w   .       (13.16)


                                                     ∗
                                   S
              Plugging this into Equation (13.15) we obtain
                                                 ∗ 2
                                          ∗
                       E[ L D ( A( S))] ≤ L D (w ) + λ w   + E[ L D ( A( S)) − L S ( A( S))].





                       S                              S
              Combining the preceding with Corollary 13.6 we conclude:
              Corollary 13.8.  Assume that the loss function is convex and ρ-Lipschitz. Then, the
                                                      2
              RLM rule with the regularization function λ w  satisfies
                                                                  2ρ 2
                                                             ∗ 2
                                                     ∗



                                ∗
                              ∀w , E[ L D ( A( S))] ≤ L D (w ) + λ w   +   .

                                   S                              λm
                                                                       ∗
                 This bound is often called an oracle inequality – if we think of w as a hypothesis
              with low risk, the bound tells us how many examples are needed so that A( S) will
              be almost as good as w , had we known the norm of w . In practice, however, we
                                                              ∗
                                  ∗
                                            ∗
              usually do not know the norm of w . We therefore usually tune λ on the basis of a
              validation set, as described in Chapter 11.
                                                         1
                 We can also easily derive a PAC-like guarantee from Corollary 13.8 for convex-
              Lipschitz-bounded learning problems:
              Corollary 13.9.  Let (H, Z, ) be a convex-Lipschitz-bounded learning problem with


                                                            2ρ  2
              parameters ρ, B. For any training set size m, let λ =  . Then, the RLM rule with
                                                             2
                                                            B m
                                         2
              the regularization function λ w  satisfies

                                                                8
                                 E[ L D ( A( S))] ≤ min L D (w) + ρ B  .

                                 S             w∈H             m

                                                      2  2

              In  particular,  for  every  
 > 0, if m ≥  8ρ B  then  for  every  distribution  D,
                                                      2


              E S [ L D ( A( S))] ≤ min w∈ H L D (w) + 
.



                 The preceding corollary  holds for Lipschitz  loss functions.  If  instead the loss
              function is smooth and nonnegative, then we can combine Equation (13.16) with
              Corollary 13.7 to get:
              Corollary 13.10. Assume that the loss function is convex, β-smooth, and nonnegative.
                                                                2        2β
              Then, the RLM rule with the regularization function λ w  ,for λ ≥  , satisfies the
                                                                         m
              1  Again, the bound below is on the expected risk, but using Exercise 13.1 it can be used to derive an
               agnostic PAC learning guarantee.
   158   159   160   161   162   163   164   165   166   167   168