Page 211 - Understanding Machine Learning
P. 211

17.2 Linear Multiclass Predictors  193


              17.2 LINEAR MULTICLASS PREDICTORS
              In light of the inadequacy of reduction methods, in this section we study a more
              direct approach for learning multiclass predictors. We describe the family of linear
              multiclass predictors. To motivate the construction of this family, recall that a linear
              predictor for binary classification (i.e., a halfspace) takes the form

                                         h(x) = sign( w,x ).
              An equivalent way to express the prediction is as follows:

                                       h(x) = argmax w, yx ,
                                               y∈{±1}
              where yx is the vector obtained by multiplying each element of x by y.
                 This representation leads to a natural generalization of halfspaces to multiclass
                                                d
              problems as follows. Let   : X × Y → R be a class-sensitive feature mapping.That
              is,   takes as input a pair (x, y) and maps it into a d dimensional feature vector.
              Intuitively, we can think of the elements of  (x, y) as score functions that assess
              how well the label y fits the instance x. We will elaborate on   later on. Given
                              d
              and a vector w ∈ R , we can define a multiclass predictor, h : X → Y, as follows:
                                     h(x) = argmax w, (x, y) .
                                              y∈Y
              That is, the prediction of h for the input x is the label that achieves the highest
              weighted score, where weighting is according to the vector w.
                                                                          d
                                                d
                 Let W be some set of vectors in R , for example, W ={w ∈ R :  w ≤ B},
              for some scalar B > 0. Each pair ( ,W) defines a hypothesis class of multiclass
              predictors:
                               H  ,W ={x  → argmax w, (x, y)  : w ∈ W}.
                                            y∈Y
              Of course, the immediate question, which we discuss in the sequel, is how to con-
                                                                               d
              struct a good  . Note that if Y ={±1} and we set  (x, y) = yx and W = R ,then
              H  ,W becomes the hypothesis class of homogeneous halfspace predictors for binary
              classification.

              17.2.1 How to Construct

              As mentioned before, we can think of the elements of  (x, y) as score functions
              that assess how well the label y fits the instance x. Naturally, designing a good
              is similar to the problem of designing a good feature mapping (as we discussed in
              Chapter 16 and as we will discuss in more detail in Chapter 25). Two examples of
              useful constructions are given in the following.

              The Multivector Construction:
                                                                    d
                                          n
              Let Y ={1,...,k} and let X = R . We define   : X × Y → R ,where d = nk,as
              follows
                                 (x, y) = [0,...,0, x 1 ,...,x n , 0,...,0 ].   (17.2)
                                          9 :; < 9   :;  < 9 :; <
                                           ∈R (y−1)n  ∈R n  ∈R (k−y)n
   206   207   208   209   210   211   212   213   214   215   216