Page 224 - Understanding Machine Learning
P. 224

Multiclass and Ranking
           206

                 which cannot hold. We have thus shown that some permutation matrix, C i ,sat-
                 isfies  A, B = A,C i  . But, since for every other permutation matrix C we have
                  A, B ≤  A,C  we conclude that C i is an optimal solution of both Equation (17.9)
                 and Equation (17.10).


                 17.5 BIPARTITE RANKING AND MULTIVARIATE
                 PERFORMANCE MEASURES

                 In the previous section we described the problem of ranking. We used a vector
                      r
                 y ∈ R for representing an order over the elements x 1 ,...,x r . If all elements in y
                 are different from each other, then y specifies a full order over [r]. However, if two
                 elements of y attain the same value, y i = y j for i  = j,then y can only specify a partial
                 order over [r]. In such a case, we say that x i and x j are of equal relevance according
                                                 r
                 to y. In the extreme case, y ∈{±1} , which means that each x i is either relevant
                 or nonrelevant. This setting is often called “bipartite ranking.” For example, in the
                 fraud detection application mentioned in the previous section, each transaction is
                 labeled as either fraudulent (y i = 1) or benign (y i =−1).
                    Seemingly, we can solve the bipartite ranking problem by learning a binary clas-
                 sifier, applying it on each instance, and putting the positive ones at the top of the
                 ranked list. However, this may lead to poor results as the goal of a binary learner
                 is usually to minimize the zero-one loss (or some surrogate of it), while the goal of
                 a ranker might be significantly different. To illustrate this, consider again the prob-
                 lem of fraud detection. Usually, most of the transactions are benign (say 99.9%).
                 Therefore, a binary classifier that predicts “benign” on all transactions will have a
                 zero-one error of 0.1%. While this is a very small number, the resulting predictor is
                 meaningless for the fraud detection application. The crux of the problem stems from
                 the inadequacy of the zero-one loss for what we are really interested in. A more ade-
                 quate performance measure should take into account the predictions over the entire
                 set of instances. For example, in the previous section we have defined the NDCG
                 loss, which emphasizes the correctness of the top-ranked items. In this section we
                 describe additional loss functions that are specifically adequate for bipartite ranking
                 problems.
                    As in the previous section, we are given a sequence of instances, ¯ x = (x 1 ,...,x r ),
                                                  r
                                                                               r
                 and we predict a ranking vector y ∈ R . The feedback vector is y ∈{±1} . We define

                 a loss that depends on y and y and depends on a threshold θ ∈ R. This threshold

                                          r
                                                                                        r



                 transforms the vector y ∈ R into the vector (sign(y − θ),...,sign(y − θ)) ∈{±1} .
                                                              i             r
                 Usually, the value of θ is set to be 0. However, as we will see, we sometimes set θ
                 while taking into account additional constraints on the problem.
                    The loss functions we define in the following depend on the following 4 numbers:

                               True positives: a =|{i : y i =+1 ∧ sign(y − θ) =+1}|
                                                                 i

                               False positives: b =|{i : y i =−1 ∧ sign(y − θ) =+1}|
                                                                 i
                                                                                   (17.11)

                              False negatives: c =|{i : y i =+1 ∧ sign(y − θ) =−1}|
                                                                 i

                              True negatives: d =|{i : y i =−1 ∧ sign(y − θ) =−1}|
                                                                 i
                    The recall (a.k.a. sensitivity) of a prediction vector is the fraction of true positives
                 y “catches,” namely,  a  .The precision is the fraction of correct predictions among

                                    a+c
   219   220   221   222   223   224   225   226   227   228   229