Page 372 - Understanding Machine Learning
P. 372
Multiclass Learna bility
354
How tight is Lemma 29.5? It is not hard to see that for some classes,
OvA,k
Ndim(H ) can be much smaller than dk (see Exercise 29.1). However, there
bin
OvA,k
are several natural binary classes, H bin (e.g., halfspaces), for which Ndim(H ) =
bin
(dk) (see Exercise 29.6).
29.3.2 General Multiclass-to-Binary Reductio ns
The same reasoning used to establish Lemma 29.5 can be used to upper bound the
Natarajan dimension of more general multiclass-to-binary reductions. These reduc-
tions train several binary classifiers on the data. Then, given a new instance, they
predict its label by using some rule that takes into account the labels predicted by
the binary classifiers. These reductions include One-vs.-All and All-Pairs.
Suppose that such a method trains l binary classifiers from a binary class H bin ,
l
and r : {0,1} → [ k] is the rule that determines the (multiclass) label according to
the predictions of the binary classifiers. The hypothesis class corresponding to this
l
method can be defined as follows. For every h = (h 1 ,...,h l ) ∈ (H bin ) define R(h):
¯
¯
X → [ k] by
¯
R(h)( x) = r(h 1 ( x),...,h l ( x)).
Finally, let
r l
¯
¯
H bin = {R(h) : h ∈ (H bin ) }.
Similarly to Lemma 29.5 it can be proven that:
Lemma 29.6. If d = VCdim(H bin ) then
r
Ndim(H bin ) ≤ 3ld log(ld).
The proof is left as Exercise 29.2.
29.3.3 Linear Multiclass Predictors
Next, we consider the class of linear multiclass predictors (see Section 17.2). Let
d
: X × [k] → R be some class-sensitive feature mapping and let
+
H = x è argmax w, (x,i) : w ∈ R d . (29.1)
i∈[k]
d
Each hypothesis in H is determined by d parameters, namely, a vector w ∈ R .
Therefore, we would expect that the Natarajan dimension would be upper bounded
by d. Indeed:
Theorem 29.7. Ndim(H ) ≤ d .
Proof. Let C ⊂ X be a shattered set, and let f 0 , f 1 : C → [k] be the two functions
that witness the shattering. We need to show that |C|≤ d. For every x ∈ C let ρ(x) =
def
(x, f 0 (x)) − (x, f 1 (x)). We claim that the set ρ(C) ={ρ(x): x ∈ C} consists of
|C| elements (i.e., ρ is one to one) and is shattered by the binary hypothesis class of
d
homogeneous linear separators on R ,
d
H ={x è sign( w,x ): w ∈ R }.