Page 374 - Understanding Machine Learning

P. 374

Multiclass Learnability
356

Let A be some ERM algorithm for H. Assume that A operates on a sample labeled
by h A ∈ H.Since h A is the only hypothesis in H that might return the label A,if
A observes the label A, it “knows" that the learned hypothesis is h A , and, as an
ERM, must return it (note that in this case the error of the returned hypothesis
is 0). Therefore, to specify an ERM, we should only specify the hypothesis it returns
upon receiving a sample of the form

S ={(x 1 ,∗),...,(x m ,∗)}.

We consider two ERMs: The ﬁrst, A good , is deﬁned by

A good (S) = h ∅ ;
that is, it outputs the hypothesis which predicts ‘*’ for every x ∈ X. The second
ERM, A bad , is deﬁned by
A bad (S) = h {x 1 ,...x m } .
c
The following claim shows that the sample complexity of A bad is about |X|-times
larger than the sample complexity of A good . This establishes a gap between different
ERMs. If X is inﬁnite, we even obtain a learnable class that is not learnable by
every ERM.

Claim 29.9.
1. Let
,δ > 0, D a distribution over X and h A ∈ H.Let S be an i.i.d. sample

1
consisting of m ≥ log 1 examples, sampled according to D and labeled by

δ
h A . Then, with probability of at least 1 − δ, the hypothesis returned by A good
will have an error of at most
.
2. There exists a constant a > 0 such that for every 0 <
< a there exists a dis-
tribution D over X and h A ∈ H such that the following holds. The hypothesis
|X|−1
returned by A bad upon receiving a sample of size m ≤ , sampled according
6
1
to D and labeled by h A , will have error ≥
with probability ≥ e − 6 .
Proof. Let D be a distribution over X and suppose that the correct labeling is h A .
For any sample, A good returns either h ∅ or h A . Ifitreturns h A then its true error is
zero. Thus, it returns a hypothesis with error ≥
only if all the m examples in the
sample are from X \ A while the error of h ∅ , L D (h ∅ ) = P D [A], is ≥
. Assume m ≥
1 1 m −
m
δ

log( ); then the probability of the latter event is no more than (1−
) ≤ e ≤ δ.
This establishes item 1.
Next we prove item 2. We restrict the proof to the case that |X|= d < ∞.The
proof for inﬁnite X is similar. Suppose that X ={x 0 ,...,x d−1 }.
Let a > 0 be small enough such that 1 − 2
≥ e −4
for every
< a and ﬁx some

< a. Deﬁne a distribution on X by setting P[x 0 ] = 1 − 2
and for all 1 ≤ i ≤ d − 1,
2
P[x i ] = . Suppose that the correct hypothesis is h ∅ and let the sample size be m.
d−1
Clearly, the hypothesis returned by A bad will err on all the examples from X which
d−1 − 1
are not in the sample. By Chernoff’s bound, if m ≤ , then with probability ≥ e 6 ,
6
the sample will include no more than d−1 examples from X. Thus the returned
2
hypothesis will have error ≥
.

369 370 371 372 373 374 375 376 377 378 379