Page 222 - Understanding Machine Learning
P. 222

Multiclass and Ranking
           204
                                        r

                                           v
                 Let us denote  (¯ x,v) =  i=1 i x i ; it follows that
                                                         r

                                       π(h w (¯ x)) = argmax  v i  w,x i
                                                   v∈V
                                                        i=1
                                                        !         "
                                                             r

                                               = argmax w,     v i x i
                                                   v∈V
                                                            i=1
                                               = argmax w, (¯ x,v) .
                                                   v∈V
                 On the basis of this observation, we can use the generalized hinge loss for cost-
                 sensitive multiclass classification as a surrogate loss function for the NDCG loss as
                 follows:



                          (h w (¯ x),y) ≤  (h w (¯ x),y) + w, (¯ x,π(h w (¯ x))) −  w, (¯ x,π(y))
                                                   2         3  2           3
                                    ≤ max  (v,y) + w, (¯ x,v) − w, (¯ x,π(y))
                                      v∈V

                                                     r

                                    = max   (v,y) +    (v i − π(y) i ) w,x i   .    (17.8)
                                      v∈V
                                                    i=1
                 The right-hand side is a convex function with respect to w.
                    We can now solve the learning problem using SGD as described in
                 Section 17.2.5. The main computational bottleneck is calculating a subgradient of
                 the loss function, which is equivalent to finding v that achieves the maximum in
                 Equation (17.8) (see Claim 14.6). Using the definition of the NDCG loss, this is
                 equivalent to solving the problem

                                                 r

                                         argmin    (α i v i + β i D(v i )),
                                           v∈V
                                                i=1

                 where α i =− w,x i   and β i = y i /G(y,y). We can think of this problem a little bit
                 differently by defining a matrix A ∈ R r,r  where


                                             A i, j = jα i + D( j)β i .


                 Now, let us think about each j as a “worker,” each i as a “task,” and A i, j as the cost
                 of assigning task i to worker j. With this view, the problem of finding v becomes
                 the problem of finding an assignment of the tasks to workers of minimal cost. This
                 problem is called “the assignment problem” and can be solved efficiently. One par-
                 ticular algorithm is the “Hungarian method” (Kuhn 1955). Another way to solve
                 the assignment problem is using linear programming. To do so, let us first write the
   217   218   219   220   221   222   223   224   225   226   227