Page 162 - Understanding Machine Learning
P. 162

Regularization and Stability
           144

                 Combining the preceding with Equation (13.14) and again using the assumption
                 β ≤ λm/2 yield

                                 (i)
                             (A(S ),z i ) −  (A(S),z i )
                                                              β
                                                  (i)               (i)       2
                              ≤   2βØ (A(S),z i ) A(S ) − A(S) +   A(S ) − A(S)
                                                              2
                                          2                                 2
                                  4β    8β
                                                                    (i)

                              ≤      +             (A(S),z i ) +   (A(S ),z )
                                  λm   (λm) 2
                                                                 2
                                8β
                                                         (i)

                              ≤         (A(S),z i ) +   (A(S ),z )
                                λm
                                24β                   (i)

                              ≤       (A(S),z i ) +  (A(S ),z ) ,
                                 λm
                                                                          2
                                                                      2
                                                               2
                 where in the last step we used the inequality (a + b) ≤ 3(a + b ). Taking expec-
                                                                                (i)
                 tation with respect to S, z , i and noting that E[ (A(S),z i )] = E[ (A(S ), z )] =
                 E[L S (A(S))], we conclude that:
                 Corollary 13.7. Assume that the loss function is β-smooth and nonnegative. Then, the
                                                2          2β
                 RLM rule with the regularizer λ w  ,where λ ≥  , satisfies
                                                           m
                                 
                          48β
                                       (i)
                                E  (A(S ),z i ) −  (A(S),z i ) ≤  E[L S (A(S))].
                                                            λm
                    Note that if for all z we have  (0, z) ≤ C, for some scalar C > 0, then for every S,
                                                      2
                                                                    2
                          L S (A(S)) ≤ L S (A(S)) + λ A(S)  ≤ L S (0) + λ 0  = L S (0) ≤ C.
                 Hence, Corollary 13.7 also implies that
                                      
                         48β C
                                            (i)
                                    E  (A(S ),z i ) −  (A(S),z i ) ≤  .
                                                                  λm
                 13.4 CONTROLLING THE FITTING-STABILITY TRADEOFF
                 We can rewrite the expected risk of a learning algorithm as

                          E[L D (A(S))] = E[L S (A(S))] + E[L D (A(S)) − L S (A(S))].  (13.15)
                          S             S            S
                 The first term reflects how well A(S) fits the training set while the second term
                 reflects the difference between the true and empirical risks of A(S). As we have
                 shown in Theorem 13.2, the second term is equivalent to the stability of A.Since
                 our goal is to minimize the risk of the algorithm, we need that the sum of both terms
                 will be small.
                    In the previous section we have bounded the stability term. We have shown
                 that the stability term decreases as the regularization parameter, λ, increases. On
                 the other hand, the empirical risk increases with λ. We therefore face a tradeoff
                 between fitting and overfitting. This tradeoff is quite similar to the bias-complexity
                 tradeoff we discussed previously in the book.
   157   158   159   160   161   162   163   164   165   166   167