Page 80 - Understanding Machine Learning
P. 80

Nonuniform Learnability
           62

                 and then Equation (7.3)impliesthat
                                     L D (h) ≤ L S (h) + 
 n(h) (m,w(n(h)) · δ).
                    The SRM paradigm searches for h that minimizes this bound, as formalized in
                 the following pseudocode:

                                     Structural Risk Minimization (SRM)

                           prior knowledge:
                                                                            UC
                             H =    H n where H n has uniform convergence with m
                                   n                                        H n

                             w : N → [0,1] where  w(n) ≤ 1
                                                 n
                           define: 
 n as in Equation (7.1); n(h) as in Equation (7.4)
                                                m
                           input: training set S ∼ D , confidence δ

                           output: h ∈ argmin   L S (h) + 
 n(h) (m,w(n(h)) · δ)
                                           h∈H
                 Unlike the ERM paradigm discussed in previous chapters, we no longer just care
                 about the empirical risk, L S (h), but we are willing to trade some of our bias toward
                 low empirical risk with a bias toward classes for which 
 n(h) (m,w(n(h))·δ) is smaller,
                 for the sake of a smaller estimation error.
                    Next we show that the SRM paradigm can be used for nonuniform learning of
                 every class, which is a countable union of uniformly converging hypothesis classes.

                 Theorem 7.5. Let H be a hypothesis class such that H =  H n , where each H n has
                                                                   n∈N
                 the uniform convergence property with sample complexity m UC  .Let w : N → [0,1] be
                                                                     H n
                                  6
                 such that w(n) =   . Then, H is nonuniformly learnable using the SRM rule with
                                  2 2
                                 n π
                 rate

                                     m NUL (
,δ,h) ≤ m UC  
/2,  6δ  .
                                      H             H n(h)    (πn(h)) 2
                 Proof. Let A be the SRM algorithm with respect to the weighting function w.For
                 every h ∈ H, 
,and δ,let m ≥ m UC  (
,w(n(h))δ). Using the fact that     w(n) = 1,
                                             H n(h)                             n
                 we can apply Theorem 7.4 to get that, with probability of at least 1 − δ over the
                               m

                 choice of S ∼ D , we have that for every h ∈ H,



                                     L D (h ) ≤ L S (h ) + 
 n(h ) (m,w(n(h ))δ).

                 The preceding holds in particular for the hypothesis A(S) returned by the SRM rule.
                 By the definition of SRM we obtain that



                     L D (A(S)) ≤ min L S (h ) + 
 n(h ) (m,w(n(h ))δ) ≤ L S (h) + 
 n(h) (m,w(n(h))δ).

                                h
                 Finally, if m ≥ m UC  (
/2,w(n(h))δ) then clearly 
 n(h) (m,w(n(h))δ) ≤ 
/2. In addi-
                                H n(h)
                 tion, from the uniform convergence property of each H n we have that with
                 probability of more than 1 − δ,
                                            L S (h) ≤ L D (h) + 
/2.
                 Combining all the preceding we obtain that L D (A(S)) ≤ L D (h)+
, which concludes
                 our proof.
                    Note that the previous theorem also proves Theorem 7.3.
   75   76   77   78   79   80   81   82   83   84   85