Page 373 - Understanding Machine Learning
P. 373
29.4 On Good and Bad ERMs 355
Since VCdim(H) = d, it will follow that |C|=|ρ(C)|≤ d, as required.
To establish our claim it is enough to show that |H ρ(C) |= 2 |C| . Indeed, given a
subset B ⊂ C, by the definition of shattering, there exists h B ∈ H for which
∀x ∈ B,h B (x) = f 0 (x)and ∀x ∈ C \ B,h B (x) = f 1 (x).
d
Let w B ∈ R be a vector that defines h B . We have that, for every x ∈ B,
w, (x, f 0 (x)) > w, (x, f 1 (x)) ⇒ w,ρ(x) > 0.
Similarly, for every x ∈ C \ B,
w,ρ(x) < 0.
d
It follows that the hypothesis g B ∈ H defined by the same w ∈ R label the points in
ρ(B) by 1 and the points in ρ(C \ B) by 0. Since this holds for every B ⊆ C we obtain
that |C|= |ρ(C)| and |H ρ(C) |= 2 |C| , which concludes our proof.
The theorem is tight in the sense that there are mappings for which
Ndim(H ) = (d). For example, this is true for the multivector construction
(see Section 17.2 and the Bibliographic Remarks at the end of this chapter). We
therefore conclude:
n
Corollary 29.8. Let X = R and let : X × [k] → R nk be the class sensitive feature
mapping for the multi-vector construction:
(x, y) = [0,...,0, x 1 ,...,x n , 0,...,0 ].
9 :; < 9 :; < 9 :; <
∈R (y−1)n ∈R n ∈R (k−y)n
Let H be as defined in Equation (29.1). Then, the Natarajan dimension of H
satisfies
(k − 1)(n − 1) ≤ Ndim(H ) ≤ kn.
29.4 ON GOOD AND BAD ERMS
In this section we present an example of a hypothesis class with the property that not
all ERMs for the class are equally successful. Furthermore, if we allow an infinite
number of labels, we will also obtain an example of a class that is learnable by some
ERM, but other ERMs will fail to learn it. Clearly, this also implies that the class is
learnable but it does not have the uniform convergence property. For simplicity, we
consider only the realizable case.
The class we consider is defined as follows. The instance space X will be any
finite or countable set. Let P f (X) be the collection of all finite and cofinite subsets
of X (that is, for each A ∈ P f (X), either A or X \ A must be finite). Instead of [k],
the label set is Y = P f (X) ∪{∗},where ∗ is some special label. For every A ∈ P f (X)
define h A : X → Y by
A x ∈ A
h A (x) =
∗ x /∈ A
Finally, the hypothesis class we take is
H ={h A : A ∈ P f (X)}.