Page 374 - Understanding Machine Learning
P. 374

Multiclass Learnability
           356

                 Let A be some ERM algorithm for H. Assume that A operates on a sample labeled
                 by h A ∈ H.Since h A is the only hypothesis in H that might return the label A,if
                 A observes the label A, it “knows" that the learned hypothesis is h A , and, as an
                 ERM, must return it (note that in this case the error of the returned hypothesis
                 is 0). Therefore, to specify an ERM, we should only specify the hypothesis it returns
                 upon receiving a sample of the form

                                           S ={(x 1 ,∗),...,(x m ,∗)}.

                 We consider two ERMs: The first, A good , is defined by

                                               A good (S) = h ∅ ;
                 that is, it outputs the hypothesis which predicts ‘*’ for every x ∈ X. The second
                 ERM, A bad , is defined by
                                             A bad (S) = h {x 1 ,...x m } .
                                                             c
                 The following claim shows that the sample complexity of A bad is about |X|-times
                 larger than the sample complexity of A good . This establishes a gap between different
                 ERMs. If X is infinite, we even obtain a learnable class that is not learnable by
                 every ERM.

                 Claim 29.9.
                    1. Let 
,δ > 0, D a distribution over X and h A ∈ H.Let S be an i.i.d. sample

                                      1
                       consisting of m ≥ log  1  examples, sampled according to D and labeled by
                                      
     δ
                       h A . Then, with probability of at least 1 − δ, the hypothesis returned by A good
                       will have an error of at most 
.
                    2. There exists a constant a > 0 such that for every 0 <
 < a there exists a dis-
                       tribution D over X and h A ∈ H such that the following holds. The hypothesis
                                                                    |X|−1
                       returned by A bad upon receiving a sample of size m ≤  , sampled according
                                                                     6
                                                                              1
                       to D and labeled by h A , will have error ≥ 
 with probability ≥ e − 6 .
                 Proof. Let D be a distribution over X and suppose that the correct labeling is h A .
                 For any sample, A good returns either h ∅ or h A . Ifitreturns h A then its true error is
                 zero. Thus, it returns a hypothesis with error ≥ 
 only if all the m examples in the
                 sample are from X \ A while the error of h ∅ , L D (h ∅ ) = P D [A], is ≥ 
. Assume m ≥
                  1    1                                                      m   −
m
                       δ
                  
  log( ); then the probability of the latter event is no more than (1−
) ≤ e  ≤ δ.
                 This establishes item 1.
                    Next we prove item 2. We restrict the proof to the case that |X|= d < ∞.The
                 proof for infinite X is similar. Suppose that X ={x 0 ,...,x d−1 }.
                    Let a > 0 be small enough such that 1 − 2
 ≥ e −4
  for every 
< a and fix some
                 
< a. Define a distribution on X by setting P[x 0 ] = 1 − 2
 and for all 1 ≤ i ≤ d − 1,
                         2
                 P[x i ] =  . Suppose that the correct hypothesis is h ∅ and let the sample size be m.
                         d−1
                 Clearly, the hypothesis returned by A bad will err on all the examples from X which
                                                             d−1                      − 1
                 are not in the sample. By Chernoff’s bound, if m ≤  , then with probability ≥ e  6 ,
                                                              6
                 the sample will include no more than  d−1  examples from X. Thus the returned
                                                     2
                 hypothesis will have error ≥ 
.
   369   370   371   372   373   374   375   376   377   378   379