Page 224 - Understanding Machine Learning
P. 224
Multiclass and Ranking
206
which cannot hold. We have thus shown that some permutation matrix, C i ,sat-
isfies A, B = A,C i . But, since for every other permutation matrix C we have
A, B ≤ A,C we conclude that C i is an optimal solution of both Equation (17.9)
and Equation (17.10).
17.5 BIPARTITE RANKING AND MULTIVARIATE
PERFORMANCE MEASURES
In the previous section we described the problem of ranking. We used a vector
r
y ∈ R for representing an order over the elements x 1 ,...,x r . If all elements in y
are different from each other, then y specifies a full order over [r]. However, if two
elements of y attain the same value, y i = y j for i = j,then y can only specify a partial
order over [r]. In such a case, we say that x i and x j are of equal relevance according
r
to y. In the extreme case, y ∈{±1} , which means that each x i is either relevant
or nonrelevant. This setting is often called “bipartite ranking.” For example, in the
fraud detection application mentioned in the previous section, each transaction is
labeled as either fraudulent (y i = 1) or benign (y i =−1).
Seemingly, we can solve the bipartite ranking problem by learning a binary clas-
sifier, applying it on each instance, and putting the positive ones at the top of the
ranked list. However, this may lead to poor results as the goal of a binary learner
is usually to minimize the zero-one loss (or some surrogate of it), while the goal of
a ranker might be significantly different. To illustrate this, consider again the prob-
lem of fraud detection. Usually, most of the transactions are benign (say 99.9%).
Therefore, a binary classifier that predicts “benign” on all transactions will have a
zero-one error of 0.1%. While this is a very small number, the resulting predictor is
meaningless for the fraud detection application. The crux of the problem stems from
the inadequacy of the zero-one loss for what we are really interested in. A more ade-
quate performance measure should take into account the predictions over the entire
set of instances. For example, in the previous section we have defined the NDCG
loss, which emphasizes the correctness of the top-ranked items. In this section we
describe additional loss functions that are specifically adequate for bipartite ranking
problems.
As in the previous section, we are given a sequence of instances, ¯ x = (x 1 ,...,x r ),
r
r
and we predict a ranking vector y ∈ R . The feedback vector is y ∈{±1} . We define
a loss that depends on y and y and depends on a threshold θ ∈ R. This threshold
r
r
transforms the vector y ∈ R into the vector (sign(y − θ),...,sign(y − θ)) ∈{±1} .
i r
Usually, the value of θ is set to be 0. However, as we will see, we sometimes set θ
while taking into account additional constraints on the problem.
The loss functions we define in the following depend on the following 4 numbers:
True positives: a =|{i : y i =+1 ∧ sign(y − θ) =+1}|
i
False positives: b =|{i : y i =−1 ∧ sign(y − θ) =+1}|
i
(17.11)
False negatives: c =|{i : y i =+1 ∧ sign(y − θ) =−1}|
i
True negatives: d =|{i : y i =−1 ∧ sign(y − θ) =−1}|
i
The recall (a.k.a. sensitivity) of a prediction vector is the fraction of true positives
y “catches,” namely, a .The precision is the fraction of correct predictions among
a+c