Page 373 - Understanding Machine Learning
P. 373

29.4 On Good and Bad ERMs  355


              Since VCdim(H) = d, it will follow that |C|=|ρ(C)|≤ d, as required.
                 To establish our claim it is enough to show that |H ρ(C) |= 2 |C| . Indeed, given a
              subset B ⊂ C, by the definition of shattering, there exists h B ∈ H   for which
                         ∀x ∈ B,h B (x) = f 0 (x)and ∀x ∈ C \ B,h B (x) = f 1 (x).
                       d
              Let w B ∈ R be a vector that defines h B . We have that, for every x ∈ B,
                            w, (x, f 0 (x))  >  w, (x, f 1 (x)) ⇒  w,ρ(x)  > 0.

              Similarly, for every x ∈ C \ B,
                                             w,ρ(x)  < 0.
                                                                    d
              It follows that the hypothesis g B ∈ H defined by the same w ∈ R label the points in
              ρ(B) by 1 and the points in ρ(C \ B) by 0. Since this holds for every B ⊆ C we obtain
              that |C|= |ρ(C)| and |H ρ(C) |= 2 |C| , which concludes our proof.
                 The theorem is tight in the sense that there are mappings   for which
              Ndim(H   ) =  (d). For example, this is true for the multivector construction
              (see Section 17.2 and the Bibliographic Remarks at the end of this chapter). We
              therefore conclude:
                                     n
              Corollary 29.8. Let X = R and let   : X × [k] → R nk  be the class sensitive feature
              mapping for the multi-vector construction:
                                 (x, y) = [0,...,0, x 1 ,...,x n , 0,...,0 ].
                                          9 :; < 9   :;  < 9 :; <
                                          ∈R (y−1)n  ∈R n  ∈R (k−y)n
              Let H   be as defined in Equation (29.1). Then, the Natarajan dimension of H
              satisfies
                                  (k − 1)(n − 1) ≤ Ndim(H   ) ≤ kn.


              29.4 ON GOOD AND BAD ERMS
              In this section we present an example of a hypothesis class with the property that not
              all ERMs for the class are equally successful. Furthermore, if we allow an infinite
              number of labels, we will also obtain an example of a class that is learnable by some
              ERM, but other ERMs will fail to learn it. Clearly, this also implies that the class is
              learnable but it does not have the uniform convergence property. For simplicity, we
              consider only the realizable case.
                 The class we consider is defined as follows. The instance space X will be any
              finite or countable set. Let P f (X) be the collection of all finite and cofinite subsets
              of X (that is, for each A ∈ P f (X), either A or X \ A must be finite). Instead of [k],
              the label set is Y = P f (X) ∪{∗},where ∗ is some special label. For every A ∈ P f (X)
              define h A : X → Y by

                                                  A  x ∈ A
                                         h A (x) =
                                                  ∗  x /∈ A
              Finally, the hypothesis class we take is
                                        H ={h A : A ∈ P f (X)}.
   368   369   370   371   372   373   374   375   376   377   378