Page 50 - Understanding Machine Learning
P. 50

Learning via Uniform Convergence
           32

                 Proof. For every h ∈ H,




                          L D (h S ) ≤ L S (h S ) + ≤ L S (h) + ≤ L D (h) + + = L D (h) + 
,
                                           2          2          2  2
                 where the first and third inequalities are due to the assumption that S is
                  
 -representative (Definition 4.1) and the second inequality holds since h S is an ERM
                  2
                 predictor.
                    The preceding lemma implies that to ensure that the ERM rule is an agnostic
                 PAC learner, it suffices to show that with probability of at least 1 − δ over the ran-
                 dom choice of a training set, it will be an 
-representative training set. The uniform
                 convergence condition formalizes this requirement.
                 Definition 4.3 (Uniform Convergence). We say that a hypothesis class H has the
                 uniform convergence property (w.r.t. a domain Z and a loss function  ) if there exists
                                    2
                 a function m UC  :(0,1) → N such that for every 
,δ ∈ (0,1) and for every probabil-
                            H
                 ity distribution D over Z,if S is asampleof m ≥ m UC (
,δ) examples drawn i.i.d.
                                                               H
                 according to D, then, with probability of at least 1 − δ, S is 
-representative.
                    Similar to the definition of sample complexity for PAC learning, the function
                 m UC  measures the (minimal) sample complexity of obtaining the uniform con-
                   H
                 vergence property, namely, how many examples we need to ensure that with
                 probability of at least 1 − δ the sample would be 
-representative.
                 The term uniform here refers to having a fixed sample size that works for all
                 members of H and over all possible probability distributions over the domain.
                    The following corollary follows directly from Lemma 4.2 and the definition of
                 uniform convergence.

                 Corollary 4.4. If a class H has the uniform convergence property with a function m  UC
                                                                                       H
                 then the class is agnostically PAC learnable with the sample complexity m H (
,δ) ≤
                 m UC (
/2,δ). Furthermore, in that case, the ERM H paradigm is a successful agnostic
                   H
                 PAC learner for H.


                 4.2 FINITE CLASSES ARE AGNOSTIC PAC LEARNABLE
                 In view of Corollary 4.4, the claim that every finite hypothesis class is agnostic PAC
                 learnable will follow once we establish that uniform convergence holds for a finite
                 hypothesis class.
                    To show that uniform convergence holds we follow a two step argument, similar
                 to the derivation in Chapter 2. The first step applies the union bound while the
                 second step employs a measure concentration inequality. We now explain these two
                 steps in detail.
                    Fix some 
,δ. We need to find a sample size m that guarantees that for any D,
                 with probability of at least 1 − δ of the choice of S = (z 1 ,...,z m ) sampled i.i.d. from
                 D we have that for all h ∈ H, |L S (h) − L D (h)|≤ 
.That is,
                                   m
                                  D ({S : ∀h ∈ H,|L S (h) − L D (h)|≤ 
}) ≥ 1 − δ.
                 Equivalently, we need to show that

                                     m
                                   D ({S : ∃h ∈ H,|L S (h) − L D (h)| >
}) <δ.
   45   46   47   48   49   50   51   52   53   54   55