Page 209 - Understanding Machine Learning
P. 209

17.1 One-versus-All and All-Pairs  191


              to learn a function h : X → Y. Without loss of generality let us denote
              Y ={1,...,k}.
                 In the One-versus-All method (a.k.a. One-versus-Rest) we train k binary classi-
              fiers, each of which discriminates between one class and the rest of the classes. That
              is, given a training set S = (x 1 , y 1 ),...,(x m , y m ), where every y i is in Y, we construct
              k binary training sets, S 1 ,..., S k ,where S i = (x 1 ,(−1) 1 [y  =i] ),...,(x m ,(−1) 1 [ym  =i] ). In
                                                             1
              words, S i is the set of instances labeled 1 if their label in S was i,and −1otherwise.
              For every i ∈ [k] we train a binary predictor h i : X →{±1} based on S i , hoping that
              h i (x) should equal 1 if and only if x belongs to class i. Then, given h 1 ,...,h k ,we
              construct a multiclass predictor using the rule


                                         h(x) ∈ argmaxh i (x).                  (17.1)
                                                i∈[k]

              When more than one binary hypothesis predicts “1” we should somehow decide
              which class to predict (e.g., we can arbitrarily decide to break ties by taking the
              minimal index in argmax h i (x)). A better approach can be applied whenever each
                                   i
              h i hides additional information, which can be interpreted as the confidence in the
              prediction y = i. For example, this is the case in halfspaces, where the actual predic-
              tion is sign( w, x ), but we can interpret  w, x  as the confidence in the prediction.
              In such cases, we can apply the multiclass rule given in Equation (17.1) on the real
              valued predictions. A pseudocode of the One-versus-All approach is given in the
              following.



                                           One-versus-All
                       input:
                         training set S = (x 1 , y 1 ),...,(x m , y m )
                         algorithm for binary classification A
                       foreach i ∈ Y
                         let S i = (x 1 , ( − 1) 1 [y 1  =i] ),...,(x m , ( − 1) 1 [ym  =i] )
                         let h i = A(S i )
                       output:
                         the multiclass hypothesis defined by h(x) ∈ argmax  h i (x)
                                                                     i∈Y


                 Another popular reduction is the All-Pairs approach, in which all pairs of classes
              are compared to each other. Formally, given a training set S = (x 1 , y 1 ),...,(x m , y m ),
              where every y i is in [k], for every 1 ≤ i < j ≤ k we construct a binary training
              sequence, S i, j , containing all examples from S whose label is either i or j. For each
              such an example, we set the binary label in S i, j to be +1 if the multiclass label in
              S is i and −1 if the multiclass label in S is j. Next, we train a binary classification
              algorithm based on every S i, j to get h i, j . Finally, we construct a multiclass classifier
              by predicting the class that had the highest number of “wins.” A pseudocode of the
              All-Pairs approach is given in the following.
   204   205   206   207   208   209   210   211   212   213   214