Page 214 - Understanding Machine Learning
P. 214
Multiclass and Ranking
196
Since h w (x) ∈ Y we can upper bound the right-hand side of the preceding by
def
max (y , y) + w, (x, y ) − (x, y) = (w,(x, y)). (17.3)
y ∈Y
We use the term “generalized hinge loss” to denote the preceding expression. As we
have shown, (w,(x, y)) ≥ (h w (x), y). Furthermore, equality holds whenever the
score of the correct label is larger than the score of any other label, y , by at least
(y , y), namely,
∀y ∈ Y \{y}, w, (x,y) ≥ w, (x,y ) + (y , y).
It is also immediate to see that (w,(x, y)) is a convex function with respect to w
since it is a maximum over linear functions of w (see Claim 12.5 in Chapter 12), and
that (w,(x, y)) is ρ-Lipschitz with ρ = max y ∈Y (x, y ) − (x, y) .
Remark 17.2. We use the name “generalized hinge loss” since in the binary case,
yx
when Y ={±1},ifweset (x, y) = , then the generalized hinge loss becomes the
2
vanilla hinge loss for binary classification,
(w,(x, y)) = max{0,1 − y w,x }.
Geometric Intuition:
d
d
The feature function : X ×Y → R maps each x into |Y| vectors in R . The value of
(w,(x, y)) will be zero if there exists a direction w such that when projecting the |Y|
vectors onto this direction we obtain that each vector is represented by the scalar
w, (x, y) , and we can rank the different points on the basis of these scalars so
that
The point corresponding to the correct y is top-ranked
2 3 2 3
For each y = y, the difference between w, (x, y) and w, (x, y ) is larger
2 3
than the loss of predicting y instead of y. The difference w, (x, y) −
2 3
w, (x, y ) is also referred to as the “margin” (see Section 15.1).
This is illustrated in the figure following:
w
Ψ(x, y)
∆(y, y")
Ψ(x, y") ≥ ≥ ∆(y, y') ≥ ∆(y, y')
Ψ(x, y')
17.2.5 Multiclass SVM and SGD
Once we have defined the generalized hinge loss, we obtain a convex-Lipschitz
learning problem and we can apply our general techniques for solving such