Page 226 - Understanding Machine Learning

P. 226

Multiclass and Ranking
208

induces, which we denote by
r

b(y ) = (sign( y − θ),..., sign( y − θ)) ∈ {±1} . (17.12)
1 r
As in the previous section, to facilitate an efﬁcient algorithm we derive a con-
vex surrogate loss function on . The derivation is similar to the derivation of the
generalized hinge loss for the NDCG ranking loss, as described in the previous
section.
Our ﬁrst observation is that for all the values of θ deﬁned before, there is some

V ⊆ {±1} such that b(y ) can be rewritten as
r

b(y ) = argmax v i y . (17.13)
i
v∈V

i=1
r
This is clearly true for the case θ = 0 if we choose V = {±1} . The two measures for

which θ is not taken to be 0 are precision at k and recall at k. For precision at k we

r
can take V to be the set V ≥k , containing all vectors in {±1} whose number of ones

is at least k.For recall at k,we can take V to be V ≤k , which is deﬁned analogously.
See Exercise 17.5.
Once we have deﬁned b as in Equation (17.13), we can easily derive a convex
surrogate loss as follows. Assuming that y ∈ V , we have that
(h w (¯ x),y) = (b(h w (¯ x)),y)
r

≤ (b(h w (¯ x)),y) + (b i (h w (¯ x)) − y i ) w,x i
i=1

≤ max (v,y) + (v i − y i ) w,x i . (17.14)
v∈V
i=1
The right-hand side is a convex function with respect to w.
We can now solve the learning problem using SGD as described in
Section 17.2.5. The main computational bottleneck is calculating a subgradient of
the loss function, which is equivalent to ﬁnding v that achieves the maximum in
Equation (17.14) (see Claim 14.6).
In the following we describe how to ﬁnd this maximizer efﬁciently for any per-
formance measure that can be written as a function of the numbers a, b, c, d given
r
in Equation (17.11), and for which the set V contains all elements in {±1} for which
the values of a,b satisfy some constraints. For example, for “recall at k”theset V is
all vectors for which a + b ≤ k.
The idea is as follows. For any a,b ∈ [r], let
Y a,b = {v : |{i : v i = 1 ∧ y i = 1}| = a ∧|{i : v i = 1 ∧ y i =−1}| = b}.
¯
Any vector v ∈ V falls into Y a,b for some a,b ∈ [r]. Furthermore, if Y a,b ∩ V is not
¯
¯
¯
¯
empty for some a,b ∈ [r]then Y a,b ∩ V = Y a,b . Therefore, we can search within each
Y a,b that has a nonempty intersection with V separately, and then take the optimal
¯
value. The key observation is that once we are searching only within Y a,b , the value
¯

221 222 223 224 225 226 227 228 229 230 231