Page 215 - Understanding Machine Learning

P. 215

17.2 Linear Multiclass Predictors 197

problems. In particular, the RLM technique we have studied in Chapter 13 yields
the multiclass SVM rule:

Multiclass SVM

input: (x 1 , y 1 ),...,(x m , y m )
parameters:
regularization parameter λ> 0
loss function : Y × Y → R +
class-sensitive feature mapping : X × Y → R d
solve:
m
1
2

min λ w + max (y , y i ) + w, (x i , y ) − (x i , y i )

w∈R d m y ∈Y
i=1
output the predictor h w (x) = argmax w, (x, y)
y∈Y
We can solve the optimization problem associated with multiclass SVM
using generic convex optimization algorithms (or using the method described in
Section 15.5). Let us analyze the risk of the resulting hypothesis. The analysis
seamlessly follows from our general analysis for convex-Lipschitz problems given
in Chapter 13. In particular, applying Corollary 13.8 and using the fact that the gen-
eralized hinge loss upper bounds the loss, we immediately obtain an analog of
Corollary 15.7:

d
Corollary 17.1. Let D be a distribution over X × Y,let : X × Y → R , and assume
that for all x ∈ X and y ∈ Y we have (x, y) ≤ ρ/2.Let B > 0. Consider running

2ρ 2 m
Multiclass SVM with λ = on a training set S ∼ D and let h w be the output of
2
B m
Multiclass SVM. Then,
*
2
g−hinge g−hinge 8ρ B 2
E [L (h w )] ≤ E [L (w)] ≤ min L (u) + ,
S∼D m D S∼D m D u: u ≤B D m

where L (h) = E (x,y)∼D [ (h(x), y)] and L g−hinge (w) = E (x,y)∼D [ (w,(x, y))] with
D D
being the generalized hinge-loss as deﬁned in Equation (17.3).

g−hinge
We can also apply the SGD learning framework for minimizing L (w)
D
as described in Chapter 14. Recall Claim 14.6, which dealt with subgradients
of max functions. In light of this claim, in order to ﬁnd a subgradient of the
generalized hinge loss all we need to do is to ﬁnd y ∈ Y that achieves the max-
imum in the deﬁnition of the generalized hinge loss. This yields the following

210 211 212 213 214 215 216 217 218 219 220