Page 226 - Understanding Machine Learning
P. 226

Multiclass  and  Ranking
           208

                 induces, which we denote by
                                                                         r



                                 b(y ) = (sign( y − θ),..., sign( y − θ))  ∈ {±1} .  (17.12)
                                             1              r
                    As in the previous section, to facilitate an efficient algorithm we derive a con-
                 vex surrogate loss function on  . The derivation is similar to the derivation of the
                 generalized hinge  loss  for  the  NDCG  ranking  loss,  as  described  in  the  previous
                 section.
                    Our first observation is that for all the values of θ defined before, there is some

                         r


                  V ⊆ {±1} such that b(y ) can be rewritten as
                                                          r



                                           b(y ) = argmax   v i y .                (17.13)
                                                               i
                                                   v∈V

                                                         i=1
                                                                     r
                 This is clearly true for the case θ = 0 if we choose V = {±1} . The two measures for

                 which θ is not taken to be 0 are precision at k and recall at k. For precision at k we


                                                                    r
                 can take V to be the set V ≥k , containing all vectors in {±1} whose number of ones

                 is at least k.For recall at k,we can take V to be V ≤k , which is defined analogously.
                 See Exercise 17.5.
                    Once we have defined b as in Equation (17.13), we can easily derive a convex
                 surrogate loss as follows. Assuming that y ∈ V , we have that
                             (h w (¯ x),y) =  (b(h w (¯ x)),y)
                                                         r

                                       ≤  (b(h w (¯ x)),y) +  (b i (h w (¯ x)) − y i ) w,x i
                                                         i=1

                                                         r

                                       ≤ max     (v,y) +   (v i − y i ) w,x i   .  (17.14)
                                          v∈V
                                                        i=1
                 The right-hand side is a convex function with respect to w.
                    We can now solve the learning problem using SGD as described in
                 Section 17.2.5. The main computational bottleneck is calculating a subgradient of
                 the loss function, which is equivalent to finding v that achieves the maximum in
                 Equation (17.14) (see Claim 14.6).
                    In the following we describe how to find this maximizer efficiently for any per-
                 formance measure that can be written as a function of the numbers a, b, c, d given
                                                                                r
                 in Equation (17.11), and for which the set V contains all elements in {±1} for which
                 the values of a,b satisfy some constraints. For example, for “recall at k”theset V is
                 all vectors for which a + b ≤ k.
                    The idea is as follows. For any a,b ∈ [r], let
                         Y a,b = {v : |{i : v i = 1 ∧ y i = 1}| = a ∧|{i : v i = 1 ∧ y i =−1}| = b}.
                          ¯
                 Any vector v ∈ V falls into Y a,b for some a,b ∈ [r]. Furthermore, if Y a,b ∩ V is not
                                                                             ¯
                                          ¯
                                            ¯
                                                     ¯
                 empty for some a,b ∈ [r]then Y a,b ∩ V = Y a,b . Therefore, we can search within each
                 Y a,b that has a nonempty intersection with V separately, and then take the optimal
                  ¯
                 value. The key observation is that once we are searching only within Y a,b , the value
                                                                             ¯
   221   222   223   224   225   226   227   228   229   230   231