Page 221 - Understanding Machine Learning

P. 221

17.4 Ranking 203

function
h w ((x 1 ,...,x r )) = ( w,x 1 ,..., w,x r ). (17.6)

As we discussed in Chapter 16, we can also apply a feature mapping that maps
instances into some feature space and then takes the inner products with w in the
feature space. For simplicity, we focus on the simpler form as in Equation (17.6).
d

Given some W ⊂ R , we can now deﬁne the hypothesis class H W = {h w : w ∈ W}.

Once we have deﬁned this hypothesis class, and have chosen a ranking loss function,
we can apply the ERM rule as follows: Given a training set, S = (¯ x 1 ,y 1 ),...,(¯ x m ,y m ),

where each (¯ x i ,y i ) is in (X × R) , for some r i ∈ N, we should search w ∈ W that

r i

m
minimizes the empirical loss, (h w (¯ x i ),y i ). As in the case of binary classiﬁca-

i=1
tion, for many loss functions this problem is computationally hard, and we therefore
turn to describe convex surrogate loss functions. We describe the surrogates for the
Kendall tau loss and for the NDCG loss.
A Hinge Loss for the Kendall Tau Loss Function:
We can think of the Kendall tau loss as an average of 0−1 losses for each pair. In

particular, for every (i, j) we can rewrite

1 = 1 .

[sign( y −y ) =sign( y i −y j )]

j
i

i j [sign( y i −y j )( y −y )≤0]

In our case, y − y = w,x i − x j . It follows that we can use the hinge loss upper

i j
bound as follows:
2 3>

=
1 [sign( y i −y j )( y −y )≤0] ≤ max 0,1 − sign y i − y j w,x i − x j .

j
Taking the average over the pairs we obtain the following surrogate convex loss for
the Kendall tau loss function:
r−1 r
2 = 2 3>

(h w (¯ x),y) ≤ max 0,1 − sign( y i − y j ) w,x i − x j .

r(r − 1)

i=1 j=i+1
The right-hand side is convex with respect to w and upper bounds the Kendall tau
loss. It is also a ρ-Lipschitz function with parameter ρ ≤ max i, j x i − x j .

A Hinge Loss for the NDCG Loss Function:
r

The NDCG loss function depends on the predicted ranking vector y ∈ R via the
permutation it induces. To derive a surrogate loss function we ﬁrst make the fol-
lowing observation. Let V be the set of all permutations of [r] encoded as vectors;

namely, each v ∈ V is a vector in [r] such that for all i = j we have v i = v j .Then

(see Exercise 17.4),
r

π(y ) = argmax v i y . (17.7)

i
v∈V
i=1

216 217 218 219 220 221 222 223 224 225 226