Page 372 - Understanding Machine Learning
P. 372

Multiclass  Learna bility
           354

                    How  tight  is  Lemma  29.5?  It  is  not  hard  to  see  that  for  some  classes,
                         OvA,k
                 Ndim(H      ) can be much smaller than dk  (see Exercise 29.1). However, there
                         bin
                                                                                  OvA,k
                 are several natural binary classes, H bin  (e.g., halfspaces), for which Ndim(H  ) =
                                                                                  bin
                  (dk) (see Exercise 29.6).
                 29.3.2  General  Multiclass-to-Binary  Reductio ns

                 The same reasoning used to establish Lemma 29.5 can be used to upper bound the
                 Natarajan dimension of more general multiclass-to-binary reductions. These reduc-
                 tions train several binary classifiers on the data. Then, given a new instance, they
                 predict its label by using some rule that takes into account the labels predicted by
                 the binary classifiers. These reductions include One-vs.-All and All-Pairs.
                    Suppose that such a method trains l binary classifiers from a binary class H bin ,

                            l

                 and r : {0,1} → [ k] is the rule that determines the (multiclass) label according to

                 the predictions of the binary classifiers. The hypothesis class corresponding to this
                                                                             l


                 method can be defined as follows. For every h = (h 1 ,...,h l ) ∈ (H bin ) define R(h):
                                                                                       ¯
                                                         ¯

                 X → [ k] by

                                           ¯
                                         R(h)( x) = r(h 1 ( x),...,h l ( x)).


                 Finally, let
                                           r                   l
                                                       ¯
                                                   ¯
                                         H bin  = {R(h) : h ∈ (H bin ) }.
                 Similarly to Lemma 29.5 it can be proven that:
                 Lemma 29.6. If d = VCdim(H bin ) then
                                                r
                                         Ndim(H bin ) ≤ 3ld log(ld).
                    The proof is left as Exercise 29.2.
                 29.3.3 Linear Multiclass Predictors
                 Next, we consider the class of linear multiclass predictors (see Section 17.2). Let
                               d
                   : X × [k] → R be some class-sensitive feature mapping and let
                                                                      +
                                  H   =  x  è  argmax w, (x,i)  : w ∈ R d  .        (29.1)
                                               i∈[k]
                                                                                        d
                 Each hypothesis in H   is determined by d parameters, namely, a vector w ∈ R .
                 Therefore, we would expect that the Natarajan dimension would be upper bounded
                 by d. Indeed:
                 Theorem 29.7. Ndim(H   ) ≤ d .
                 Proof. Let C ⊂ X be a shattered set, and let f 0 , f 1 : C → [k] be the two functions
                 that witness the shattering. We need to show that |C|≤ d. For every x ∈ C let ρ(x) =
                                                                def
                  (x, f 0 (x)) −  (x, f 1 (x)). We claim that the set ρ(C) ={ρ(x): x ∈ C} consists of
                 |C| elements (i.e., ρ is one to one) and is shattered by the binary hypothesis class of
                                                 d
                 homogeneous linear separators on R ,
                                                                 d
                                       H ={x  è  sign( w,x ): w ∈ R }.
   367   368   369   370   371   372   373   374   375   376   377