Page 331 - Understanding Machine Learning
P. 331

25.1 Feature Selection  313


              onto the subspace spanned by V t−1 and u j is the part of X j orthogonal to V t−1 (see
              Appendix C). Then,

                                             2
                          min V t−1 θ + αu j − y
                           θ,α

                                             2   2    2
                             = min  V t−1 θ − y  + α  u j   + 2α u j ,V t−1 θ − y
                               θ,α

                                             2   2    2
                             = min  V t−1 θ − y  + α  u j   + 2α u j ,−y
                               θ,α

                                             2         2    2
                             = min  V t−1 θ − y   + min α  u j   − 2α u j ,y
                                θ                 α

                                            2         2    2
                             =  V t−1 θ t−1 − y   + min α  u j   − 2α u j ,y
                                                 α
                                              ( u j ,y ) 2
                                           2
                             = V t−1 θ t−1 − y  −   2  .
                                                 u j
              It follows that we should select the feature
                                                   ( u j ,y ) 2
                                        j t = argmax    2  .
                                               j     u j

              The rest of the update is to set

                                                                ,y
                                          u j t               u j t
                              V t = V t−1 ,    ,   θ t = θ t−1 ;    .
                                              2                    2
                                          u j t               u j t
                 The OMP procedure maintains an orthonormal basis of the selected features,
              where in the preceding description, the orthonormalization property is obtained by
              a procedure similar to Gram-Schmidt orthonormalization. In practice, the Gram-
              Schmidt procedure is often numerically unstable. In the pseudocode that follows we
              use SVD (see Section C.4) at the end of each round to obtain an orthonormal basis
              in a numerically stable manner.


                                 Orthogonal Matching Pursuit (OMP)
                       input:
                                                             m
                          data matrix X ∈ R m,d  , labels vector y ∈ R ,
                          budget of features T
                       initialize: I 1 =∅
                       for t = 1,...,T
                          use SVD to find an orthonormal basis V ∈ R m,t−1  of X I t
                            (for t = 1set V to be the all zeros matrix)

                          foreach j ∈ [d] \ I t let u j = X j − VV X j
                                                ( u j ,y ) 2
                          let j t = argmax j /∈I t : u j  >0   u j   2
                          update I t+1 = I t ∪{ j t }
                       output I T +1
   326   327   328   329   330   331   332   333   334   335   336