Page 225 - Understanding Machine Learning
P. 225

17.5 Bipartite Ranking and Multivariate Performance Measures  207


              the positive labels we predict, namely,  a  .The specificity is the fraction of true
                                                 a+b
              negatives that our predictor “catches,” namely,  d  .
                                                       d+b
                 Note that as we decrease θ the recall increases (attaining the value 1 when θ =
              −  ). On the other hand, the precision and the specificity usually decrease as we
              decrease θ. Therefore, there is a tradeoff between precision and recall, and we can
              control it by changing θ. The loss functions defined in the following use various
              techniques for combining both the precision and recall.

                Averaging sensitivity and specificity: This measure is the average of the sensitiv-

                 ity and specificity, namely,  1  a  +  d  . This is also the accuracy on positive
                                         2  a+c  d+b
                 examples averaged with the accuracy on negative examples. Here, we set θ = 0

                                                              1   a    d
                 and the corresponding loss function is  (y ,y) = 1 −  +  .
                                                              2  a+c  d+b
                F 1 -score:The F 1 score is the harmonic mean of the precision and recall:
                      2
                   1     1  . Its maximal value (of 1) is obtained when both precision and recall
                 Precision  +  Recall
                 are 1, and its minimal value (of 0) is obtained whenever one of them is 0 (even
                 if the other one is 1). The F 1 score can be written using the numbers a, b, c
                                  2a
                 as follows; F 1 =    . Again, we set θ = 0, and the loss function becomes
                                2a+b+c
                  (y , y) = 1 − F 1 .

                F β -score: It is like F 1 score, but we attach β 2  times more importance to
                                                  1+β 2
                 recall than to precision, that is,  1  2  1  . It can also be written as F β =
                                              Precision +β  Recall
                       2
                    (1+β )a
                          2
                    2
                 (1+β )a+b+β c . Again, we set θ = 0, and the loss function becomes  (y ,y) =
                 1 − F β .
                Recall at k: We measure the recall while the prediction must contain at most k
                 positive labels. That is, we should set θ so that a + b ≤ k. This is convenient, for
                 example, in the application of a fraud detection system, where a bank employee
                 can only handle a small number of suspicious transactions.
                Precision at k: We measure the precision while the prediction must contain at
                 least k positive labels. That is, we should set θ so that a + b ≥ k.
                 The measures defined previously are often referred to as multivariate perfor-
              mance measures. Note that these measures are highly different from the average
              zero-one loss, which in the preceding notation equals  b+d  . In the aforemen-
                                                              a+b+c+d
              tioned example of fraud detection, when 99.9% of the examples are negatively
              labeled, the zero-one loss of predicting that all the examples are negatives is 0.1%.
              In contrast, the recall of such prediction is 0 and hence the F 1 score is also 0, which
              means that the corresponding loss will be 1.


              17.5.1 Linear Predictors for Bipartite Ranking

              We next describe how to train linear predictors for bipartite ranking. As in the
              previous section, a linear predictor for ranking is defined to be

                                     h w (¯ x) = ( w,x 1  ,..., w,x r  ).

              The corresponding loss function is one of the multivariate performance measures
              described before. The loss function depends on y = h w (¯ x) via the binary vector it
   220   221   222   223   224   225   226   227   228   229   230