Page 339 - Understanding Machine Learning

P. 339

25.5 Bibliographic Remarks 321

most s nonzero elements, where s is a small integer. In particular, let ψ and φ be
functions that depend on µ ,...,µ . The function ψ maps an instance vector x to a
k
1
k
vector ψ(x) ∈ R ,where ψ(x) should have at most s nonzero elements. The function
k
v
φ(v)isdeﬁnedtobe i=1 i µ . As before, our goal is to have a small reconstruction
i
error, and therefore we can deﬁne
2
ψ(x) = argmin x − φ(v) s.t. v 0 ≤ s,
v
where v 0 =|{ j : v j = 0}|. Note that when s = 1 and we further restrict v 1 = 1
then we obtain the k-means encoding function; that is, ψ(x) is the indicator vector
of the centroid closest to x. For larger values of s, the optimization problem in the
preceding deﬁnition of ψ becomes computationally difﬁcult. Therefore, in practice,
we sometime use 1 regularization instead of the sparsity constraint and deﬁne ψ
to be

2
ψ(x) = argmin x − φ(v) + λ v 1 ,
v
where λ> 0 is a regularization parameter. Anyway, the dictionary learning problem
m
is now to ﬁnd the vectors µ ,...,µ such that the reconstruction error, i=1 x i −
k
1
2
φ(ψ(x)) , is as small as possible. Even if ψ is deﬁned using the 1 regularization,
this is still a computationally hard problem (similar to the k-means problem). How-
ever, several heuristic search algorithms may give reasonably good solutions. These
algorithms are beyond the scope of this book.

25.4 SUMMARY
Many machine learning algorithms take the feature representation of instances for
granted. Yet the choice of representation requires careful attention. We discussed
approaches for feature selection, introducing ﬁlters, greedy selection algorithms,
and sparsity-inducing norms. Next we presented several examples for feature trans-
formations and demonstrated their usefulness. Last, we discussed feature learn-
ing, and in particular dictionary learning. We have shown that feature selection,
manipulation, and learning all depend on some prior knowledge on the data.

25.5 BIBLIOGRAPHIC REMARKS
Guyon and Elisseeff (2003) surveyed several feature selection procedures, including
many types of ﬁlters.
Forward greedy selection procedures for minimizing a convex objective sub-
ject to a polyhedron constraint date back to the Frank-Wolfe algorithm (Frank &
Wolfe 1956). The relation to boosting has been studied by several authors, including,
(Warmuth, Liao & Ratsch 2006, Warmuth, Glocer & Vishwanathan 2008, Shalev-
Shwartz & Singer 2008). Matching pursuit has been studied in the signal processing
community (Mallat & Zhang 1993). Several papers analyzed greedy selection meth-
ods under various conditions. See, for example, Shalev-Shwartz, Zhang, and Srebro
(2010) and the references therein.
The use of the 1 -norm as a surrogate for sparsity has a long history
(e.g., Tibshirani (1996) and the references therein), and much work has been done

334 335 336 337 338 339 340 341 342 343 344