Page 218 - Understanding Machine Learning

P. 218

Multiclass and Ranking
200

values are indicative of a certain letter. The second type of features take the form
r
1
i, j,2 (x,y) = 1 [y t =i] 1 [y t−1 = j] .
r
t=2
That is, we sum the number of times the letter i follows the letter j. Intuitively,
these features can capture rules like “It is likely to see the pair ‘qu’ in a word” or “It
is unlikely to see the pair ‘rz’ in a word.” Of course, some of these features will not
be very useful, so the goal of the learning process is to assign weights to features by
learning the vector w, so that the weighted score will give us a good prediction via

h w (x) = argmax w, (x,y) .
y∈Y
It is left to show how to solve the optimization problem in the deﬁnition of h w (x)
efﬁciently, as well as how to solve the optimization problem in the deﬁnition of ˆy in
the SGD algorithm. We can do this by applying a dynamic programming procedure.
We describe the procedure for solving the maximization in the deﬁnition of h w and
leave as an exercise the maximization problem in the deﬁnition of ˆy in the SGD
algorithm.
To derive the dynamic programming procedure, let us ﬁrst observe that we can
write
r

(x,y) = φ(x, y t , y t−1 ),
t=1
d
for an appropriate φ : X × [q] × [q] ∪{0}→ R , and for simplicity we assume that y 0
is always equal to 0. Indeed, each feature function i, j,1 canbewritten interms of
φ i, j,1 (x, y t , y t−1 ) = x i,t 1 [y t = j] ,

while the feature function i, j,2 canbewritten interms of
φ i, j,2 (x, y t , y t−1 ) = 1 [y t =i] 1 [y t−1 = j] .

Therefore, the prediction can be written as
r

h w (x) = argmax w,φ(x, y t , y t−1 ) . (17.4)
y∈Y
t=1
In the following we derive a dynamic programming procedure that solves every
problem of the form given in Equation (17.4). The procedure will maintain a matrix
M ∈ R q,r such that
τ

M s,τ = max w,φ(x, y t , y t−1 ) .
(y 1 ,...,y τ ):y τ =s
t=1
Clearly, the maximum of w, (x,y) equals max s M s,r . Furthermore, we can
calculate M in a recursive manner:

2 3

M s,τ = max M s ,τ−1 + w,φ(x,s,s ) . (17.5)

213 214 215 216 217 218 219 220 221 222 223