Page 224 - Understanding Machine Learning

P. 224

Multiclass and Ranking
206

which cannot hold. We have thus shown that some permutation matrix, C i ,sat-
isﬁes A, B = A,C i . But, since for every other permutation matrix C we have
A, B ≤ A,C we conclude that C i is an optimal solution of both Equation (17.9)
and Equation (17.10).

17.5 BIPARTITE RANKING AND MULTIVARIATE
PERFORMANCE MEASURES

In the previous section we described the problem of ranking. We used a vector
r
y ∈ R for representing an order over the elements x 1 ,...,x r . If all elements in y
are different from each other, then y speciﬁes a full order over [r]. However, if two
elements of y attain the same value, y i = y j for i = j,then y can only specify a partial
order over [r]. In such a case, we say that x i and x j are of equal relevance according
r
to y. In the extreme case, y ∈{±1} , which means that each x i is either relevant
or nonrelevant. This setting is often called “bipartite ranking.” For example, in the
fraud detection application mentioned in the previous section, each transaction is
labeled as either fraudulent (y i = 1) or benign (y i =−1).
Seemingly, we can solve the bipartite ranking problem by learning a binary clas-
siﬁer, applying it on each instance, and putting the positive ones at the top of the
ranked list. However, this may lead to poor results as the goal of a binary learner
is usually to minimize the zero-one loss (or some surrogate of it), while the goal of
a ranker might be signiﬁcantly different. To illustrate this, consider again the prob-
lem of fraud detection. Usually, most of the transactions are benign (say 99.9%).
Therefore, a binary classiﬁer that predicts “benign” on all transactions will have a
zero-one error of 0.1%. While this is a very small number, the resulting predictor is
meaningless for the fraud detection application. The crux of the problem stems from
the inadequacy of the zero-one loss for what we are really interested in. A more ade-
quate performance measure should take into account the predictions over the entire
set of instances. For example, in the previous section we have deﬁned the NDCG
loss, which emphasizes the correctness of the top-ranked items. In this section we
describe additional loss functions that are speciﬁcally adequate for bipartite ranking
problems.
As in the previous section, we are given a sequence of instances, ¯ x = (x 1 ,...,x r ),
r
r
and we predict a ranking vector y ∈ R . The feedback vector is y ∈{±1} . We deﬁne

a loss that depends on y and y and depends on a threshold θ ∈ R. This threshold

r
r

transforms the vector y ∈ R into the vector (sign(y − θ),...,sign(y − θ)) ∈{±1} .
i r
Usually, the value of θ is set to be 0. However, as we will see, we sometimes set θ
while taking into account additional constraints on the problem.
The loss functions we deﬁne in the following depend on the following 4 numbers:

True positives: a =|{i : y i =+1 ∧ sign(y − θ) =+1}|
i

False positives: b =|{i : y i =−1 ∧ sign(y − θ) =+1}|
i
(17.11)

False negatives: c =|{i : y i =+1 ∧ sign(y − θ) =−1}|
i

True negatives: d =|{i : y i =−1 ∧ sign(y − θ) =−1}|
i
The recall (a.k.a. sensitivity) of a prediction vector is the fraction of true positives
y “catches,” namely, a .The precision is the fraction of correct predictions among

a+c

219 220 221 222 223 224 225 226 227 228 229