Page 353 - Understanding Machine Learning
P. 353

26.4 Generalization Bounds for Predictors with Low   1 Norm  335


              Let B = w   2 and consider the set H ={w :  w  2 ≤ B}. By the definition of hard-
              SVM and our assumption on the distribution, we have that w S ∈ H with probability
              1and that L S (w S ) = 0. Therefore, using Theorem 26.12 we have that

                                                  2BR      2ln(2/δ)
                                L D (w S ) ≤ L S (w S ) + √  +    .
                                                    m         m


              Remark 26.1. Theorem 26.13 implies that the sample complexity of hard-SVM
                        2    2
              grows like  R  w    . Using a more delicate analysis and the separability assumption,
                         
 2
                                                          2    2
              it is possible to improve the bound to an order of  R  w    .


                 The bound in the preceding theorem depends on  w  , which is unknown. In the
              following we derive a bound that depends on the norm of the output of SVM; hence
              it can be calculated from the training set itself. The proof is similar to the derivation
              of bounds for structure risk minimization (SRM).
              Theorem 26.14. Assume that the conditions of Theorem 26.13 hold. Then, with
                                                         m
              probability of at least 1 − δ over the choice of S ∼ D , we have that
                                                          *
                                                                4log ( w S  )
                                                                  2
                                                 4R w S      ln(   δ    )
                            P  [y  = sign( w S ,x )] ≤  √  +             .
                         (x,y)∼D                     m            m
                                            i                             δ
              Proof. For any integer i,let B i = 2 , H i ={w :  w ≤ B i },and let δ i =  2  .Fix i,then
                                                                         2i
              using Theorem 26.12 we have that with probability of at least 1 − δ i

                                                     2B i R   2ln(2/δ i )
                            ∀w ∈ H i , L D (w) ≤ L S (w) + √  +
                                                       m         m
                                                ∞
              Applying the union bound and using  i=1 i ≤ δ we obtain that with probability of
                                                   δ
              at least 1 − δ this holds for all i. Therefore, for all w,if we let i = log ( w )  then
                                                                          2
                                                2
              w ∈ H i , B i ≤ 2 w ,and  2  =  (2i) 2  ≤  (4log ( w )) 2  . Therefore,
                                  δ i  δ        δ

                                       2B i R    2ln(2/δ i )
                        L D (w) ≤ L S (w) + √  +
                                          m         m

                                       4 w R      4(ln(4log ( w )) + ln(1/δ))
                                                           2
                              ≤ L S (w) + √   +                            .
                                          m                   m
              In particular, it holds for w S , which concludes our proof.
              Remark 26.2. Note that all the bounds we have derived do not depend on the dimen-
              sion of w. This property is utilized when learning SVM with kernels, where the
              dimension of w can be extremely large.


              26.4 GENERALIZATION BOUNDS FOR PREDICTORS WITH LOW   1 NORM
              In the previous section we derived generalization bounds for linear predictors with
              an   2 -norm constraint. In this section we consider the following general   1 -norm
              constraint formulation. Let H ={w :  w  1 ≤ B} be our hypothesis class, and let
   348   349   350   351   352   353   354   355   356   357   358