Page 214 - Understanding Machine Learning
P. 214

Multiclass and Ranking
           196

                 Since h w (x) ∈ Y we can upper bound the right-hand side of the preceding by
                                                                 def


                             max  (y , y) + w, (x, y ) −  (x, y)   =  (w,(x, y)).   (17.3)

                             y ∈Y
                 We use the term “generalized hinge loss” to denote the preceding expression. As we
                 have shown,  (w,(x, y)) ≥  (h w (x), y). Furthermore, equality holds whenever the

                 score of the correct label is larger than the score of any other label, y , by at least

                  (y , y), namely,



                               ∀y ∈ Y \{y},  w, (x,y) ≥  w, (x,y ) +  (y , y).
                 It is also immediate to see that  (w,(x, y)) is a convex function with respect to w
                 since it is a maximum over linear functions of w (see Claim 12.5 in Chapter 12), and
                 that  (w,(x, y)) is ρ-Lipschitz with ρ = max y ∈Y   (x, y ) −  (x, y) .


                 Remark 17.2. We use the name “generalized hinge loss” since in the binary case,
                                                yx
                 when Y ={±1},ifweset  (x, y) =   , then the generalized hinge loss becomes the
                                                 2
                 vanilla hinge loss for binary classification,
                                        (w,(x, y)) = max{0,1 − y w,x }.
                 Geometric Intuition:
                                                d
                                                                             d
                 The feature function   : X ×Y → R maps each x into |Y| vectors in R . The value of
                  (w,(x, y)) will be zero if there exists a direction w such that when projecting the |Y|
                 vectors onto this direction we obtain that each vector is represented by the scalar
                  w, (x, y) , and we can rank the different points on the basis of these scalars so
                 that
                    The point corresponding to the correct y is top-ranked
                                                        2        3     2         3
                    For each y  = y, the difference between w, (x, y) and w, (x, y ) is larger


                                                                             2        3

                     than the loss of predicting y instead of y. The difference  w, (x, y) −
                     2         3

                      w, (x, y ) is also referred to as the “margin” (see Section 15.1).
                 This is illustrated in the figure following:
                                                             w
                                                 Ψ(x, y)
                                                ∆(y, y")
                                        Ψ(x, y")   ≥   ≥ ∆(y, y')  ≥ ∆(y, y')

                                                         Ψ(x, y')






                 17.2.5 Multiclass SVM and SGD

                 Once we have defined the generalized hinge loss, we obtain a convex-Lipschitz
                 learning problem and we can apply our general techniques for solving such
   209   210   211   212   213   214   215   216   217   218   219