Page 331 - Understanding Machine Learning

P. 331

25.1 Feature Selection 313

onto the subspace spanned by V t−1 and u j is the part of X j orthogonal to V t−1 (see
Appendix C). Then,

2
min V t−1 θ + αu j − y
θ,α

2 2 2
= min V t−1 θ − y + α u j + 2α u j ,V t−1 θ − y
θ,α

2 2 2
= min V t−1 θ − y + α u j + 2α u j ,−y
θ,α

2 2 2
= min V t−1 θ − y + min α u j − 2α u j ,y
θ α

2 2 2
= V t−1 θ t−1 − y + min α u j − 2α u j ,y
α
( u j ,y ) 2
2
= V t−1 θ t−1 − y − 2 .
u j
It follows that we should select the feature
( u j ,y ) 2
j t = argmax 2 .
j u j

The rest of the update is to set

,y
u j t u j t
V t = V t−1 , , θ t = θ t−1 ; .
2 2
u j t u j t
The OMP procedure maintains an orthonormal basis of the selected features,
where in the preceding description, the orthonormalization property is obtained by
a procedure similar to Gram-Schmidt orthonormalization. In practice, the Gram-
Schmidt procedure is often numerically unstable. In the pseudocode that follows we
use SVD (see Section C.4) at the end of each round to obtain an orthonormal basis
in a numerically stable manner.

Orthogonal Matching Pursuit (OMP)
input:
m
data matrix X ∈ R m,d , labels vector y ∈ R ,
budget of features T
initialize: I 1 =∅
for t = 1,...,T
use SVD to ﬁnd an orthonormal basis V ∈ R m,t−1 of X I t
(for t = 1set V to be the all zeros matrix)

foreach j ∈ [d] \ I t let u j = X j − VV X j
( u j ,y ) 2
let j t = argmax j /∈I t : u j >0 u j 2
update I t+1 = I t ∪{ j t }
output I T +1

326 327 328 329 330 331 332 333 334 335 336