Page 218 - Understanding Machine Learning
P. 218

Multiclass and Ranking
           200

                 values are indicative of a certain letter. The second type of features take the form
                                                     r
                                                   1
                                         i, j,2 (x,y) =  1 [y t =i] 1 [y t−1 = j] .
                                                   r
                                                    t=2
                 That is, we sum the number of times the letter i follows the letter j. Intuitively,
                 these features can capture rules like “It is likely to see the pair ‘qu’ in a word” or “It
                 is unlikely to see the pair ‘rz’ in a word.” Of course, some of these features will not
                 be very useful, so the goal of the learning process is to assign weights to features by
                 learning the vector w, so that the weighted score will give us a good prediction via

                                         h w (x) = argmax  w, (x,y) .
                                                  y∈Y
                    It is left to show how to solve the optimization problem in the definition of h w (x)
                 efficiently, as well as how to solve the optimization problem in the definition of ˆy in
                 the SGD algorithm. We can do this by applying a dynamic programming procedure.
                 We describe the procedure for solving the maximization in the definition of h w and
                 leave as an exercise the maximization problem in the definition of ˆy in the SGD
                 algorithm.
                    To derive the dynamic programming procedure, let us first observe that we can
                 write
                                                   r

                                           (x,y) =   φ(x, y t , y t−1 ),
                                                   t=1
                                                        d
                 for an appropriate φ : X × [q] × [q] ∪{0}→ R , and for simplicity we assume that y 0
                 is always equal to 0. Indeed, each feature function   i, j,1 canbewritten interms of
                                         φ i, j,1 (x, y t , y t−1 ) = x i,t 1 [y t = j] ,

                 while the feature function   i, j,2 canbewritten interms of
                                       φ i, j,2 (x, y t , y t−1 ) = 1 [y t =i] 1 [y t−1 = j] .

                 Therefore, the prediction can be written as
                                                     r

                                     h w (x) = argmax   w,φ(x, y t , y t−1 ) .      (17.4)
                                              y∈Y
                                                    t=1
                 In the following we derive a dynamic programming procedure that solves every
                 problem of the form given in Equation (17.4). The procedure will maintain a matrix
                  M ∈ R q,r  such that
                                                      τ

                                    M s,τ =  max         w,φ(x, y t , y t−1 ) .
                                          (y 1 ,...,y τ ):y τ =s
                                                     t=1
                 Clearly, the maximum of  w, (x,y)  equals max s M s,r . Furthermore, we can
                 calculate M in a recursive manner:

                                                         2          3

                                     M s,τ = max M s ,τ−1 + w,φ(x,s,s )  .          (17.5)

                                            s
   213   214   215   216   217   218   219   220   221   222   223