Page 226 - Understanding Machine Learning
P. 226
Multiclass and Ranking
208
induces, which we denote by
r
b(y ) = (sign( y − θ),..., sign( y − θ)) ∈ {±1} . (17.12)
1 r
As in the previous section, to facilitate an efficient algorithm we derive a con-
vex surrogate loss function on . The derivation is similar to the derivation of the
generalized hinge loss for the NDCG ranking loss, as described in the previous
section.
Our first observation is that for all the values of θ defined before, there is some
r
V ⊆ {±1} such that b(y ) can be rewritten as
r
b(y ) = argmax v i y . (17.13)
i
v∈V
i=1
r
This is clearly true for the case θ = 0 if we choose V = {±1} . The two measures for
which θ is not taken to be 0 are precision at k and recall at k. For precision at k we
r
can take V to be the set V ≥k , containing all vectors in {±1} whose number of ones
is at least k.For recall at k,we can take V to be V ≤k , which is defined analogously.
See Exercise 17.5.
Once we have defined b as in Equation (17.13), we can easily derive a convex
surrogate loss as follows. Assuming that y ∈ V , we have that
(h w (¯ x),y) = (b(h w (¯ x)),y)
r
≤ (b(h w (¯ x)),y) + (b i (h w (¯ x)) − y i ) w,x i
i=1
r
≤ max (v,y) + (v i − y i ) w,x i . (17.14)
v∈V
i=1
The right-hand side is a convex function with respect to w.
We can now solve the learning problem using SGD as described in
Section 17.2.5. The main computational bottleneck is calculating a subgradient of
the loss function, which is equivalent to finding v that achieves the maximum in
Equation (17.14) (see Claim 14.6).
In the following we describe how to find this maximizer efficiently for any per-
formance measure that can be written as a function of the numbers a, b, c, d given
r
in Equation (17.11), and for which the set V contains all elements in {±1} for which
the values of a,b satisfy some constraints. For example, for “recall at k”theset V is
all vectors for which a + b ≤ k.
The idea is as follows. For any a,b ∈ [r], let
Y a,b = {v : |{i : v i = 1 ∧ y i = 1}| = a ∧|{i : v i = 1 ∧ y i =−1}| = b}.
¯
Any vector v ∈ V falls into Y a,b for some a,b ∈ [r]. Furthermore, if Y a,b ∩ V is not
¯
¯
¯
¯
empty for some a,b ∈ [r]then Y a,b ∩ V = Y a,b . Therefore, we can search within each
Y a,b that has a nonempty intersection with V separately, and then take the optimal
¯
value. The key observation is that once we are searching only within Y a,b , the value
¯