Page 225 - Understanding Machine Learning

P. 225

17.5 Bipartite Ranking and Multivariate Performance Measures 207

the positive labels we predict, namely, a .The speciﬁcity is the fraction of true
a+b
negatives that our predictor “catches,” namely, d .
d+b
Note that as we decrease θ the recall increases (attaining the value 1 when θ =
− ). On the other hand, the precision and the speciﬁcity usually decrease as we
decrease θ. Therefore, there is a tradeoff between precision and recall, and we can
control it by changing θ. The loss functions deﬁned in the following use various
techniques for combining both the precision and recall.

Averaging sensitivity and speciﬁcity: This measure is the average of the sensitiv-

ity and speciﬁcity, namely, 1 a + d . This is also the accuracy on positive
2 a+c d+b
examples averaged with the accuracy on negative examples. Here, we set θ = 0

1 a d
and the corresponding loss function is (y ,y) = 1 − + .
2 a+c d+b
F 1 -score:The F 1 score is the harmonic mean of the precision and recall:
2
1 1 . Its maximal value (of 1) is obtained when both precision and recall
Precision + Recall
are 1, and its minimal value (of 0) is obtained whenever one of them is 0 (even
if the other one is 1). The F 1 score can be written using the numbers a, b, c
2a
as follows; F 1 = . Again, we set θ = 0, and the loss function becomes
2a+b+c
(y , y) = 1 − F 1 .

F β -score: It is like F 1 score, but we attach β 2 times more importance to
1+β 2
recall than to precision, that is, 1 2 1 . It can also be written as F β =
Precision +β Recall
2
(1+β )a
2
2
(1+β )a+b+β c . Again, we set θ = 0, and the loss function becomes (y ,y) =
1 − F β .
Recall at k: We measure the recall while the prediction must contain at most k
positive labels. That is, we should set θ so that a + b ≤ k. This is convenient, for
example, in the application of a fraud detection system, where a bank employee
can only handle a small number of suspicious transactions.
Precision at k: We measure the precision while the prediction must contain at
least k positive labels. That is, we should set θ so that a + b ≥ k.
The measures deﬁned previously are often referred to as multivariate perfor-
mance measures. Note that these measures are highly different from the average
zero-one loss, which in the preceding notation equals b+d . In the aforemen-
a+b+c+d
tioned example of fraud detection, when 99.9% of the examples are negatively
labeled, the zero-one loss of predicting that all the examples are negatives is 0.1%.
In contrast, the recall of such prediction is 0 and hence the F 1 score is also 0, which
means that the corresponding loss will be 1.

17.5.1 Linear Predictors for Bipartite Ranking

We next describe how to train linear predictors for bipartite ranking. As in the
previous section, a linear predictor for ranking is deﬁned to be

h w (¯ x) = ( w,x 1 ,..., w,x r ).

The corresponding loss function is one of the multivariate performance measures
described before. The loss function depends on y = h w (¯ x) via the binary vector it

220 221 222 223 224 225 226 227 228 229 230