Page 373 - Understanding Machine Learning

P. 373

29.4 On Good and Bad ERMs 355

Since VCdim(H) = d, it will follow that |C|=|ρ(C)|≤ d, as required.
To establish our claim it is enough to show that |H ρ(C) |= 2 |C| . Indeed, given a
subset B ⊂ C, by the deﬁnition of shattering, there exists h B ∈ H for which
∀x ∈ B,h B (x) = f 0 (x)and ∀x ∈ C \ B,h B (x) = f 1 (x).
d
Let w B ∈ R be a vector that deﬁnes h B . We have that, for every x ∈ B,
w, (x, f 0 (x)) > w, (x, f 1 (x)) ⇒ w,ρ(x) > 0.

Similarly, for every x ∈ C \ B,
w,ρ(x) < 0.
d
It follows that the hypothesis g B ∈ H deﬁned by the same w ∈ R label the points in
ρ(B) by 1 and the points in ρ(C \ B) by 0. Since this holds for every B ⊆ C we obtain
that |C|= |ρ(C)| and |H ρ(C) |= 2 |C| , which concludes our proof.
The theorem is tight in the sense that there are mappings for which
Ndim(H ) = (d). For example, this is true for the multivector construction
(see Section 17.2 and the Bibliographic Remarks at the end of this chapter). We
therefore conclude:
n
Corollary 29.8. Let X = R and let : X × [k] → R nk be the class sensitive feature
mapping for the multi-vector construction:
(x, y) = [0,...,0, x 1 ,...,x n , 0,...,0 ].
9 :; < 9 :; < 9 :; <
∈R (y−1)n ∈R n ∈R (k−y)n
Let H be as deﬁned in Equation (29.1). Then, the Natarajan dimension of H
satisﬁes
(k − 1)(n − 1) ≤ Ndim(H ) ≤ kn.

29.4 ON GOOD AND BAD ERMS
In this section we present an example of a hypothesis class with the property that not
all ERMs for the class are equally successful. Furthermore, if we allow an inﬁnite
number of labels, we will also obtain an example of a class that is learnable by some
ERM, but other ERMs will fail to learn it. Clearly, this also implies that the class is
learnable but it does not have the uniform convergence property. For simplicity, we
consider only the realizable case.
The class we consider is deﬁned as follows. The instance space X will be any
ﬁnite or countable set. Let P f (X) be the collection of all ﬁnite and coﬁnite subsets
of X (that is, for each A ∈ P f (X), either A or X \ A must be ﬁnite). Instead of [k],
the label set is Y = P f (X) ∪{∗},where ∗ is some special label. For every A ∈ P f (X)
deﬁne h A : X → Y by

A x ∈ A
h A (x) =
∗ x /∈ A
Finally, the hypothesis class we take is
H ={h A : A ∈ P f (X)}.

368 369 370 371 372 373 374 375 376 377 378