Page 339 - Understanding Machine Learning
P. 339
25.5 Bibliographic Remarks 321
most s nonzero elements, where s is a small integer. In particular, let ψ and φ be
functions that depend on µ ,...,µ . The function ψ maps an instance vector x to a
k
1
k
vector ψ(x) ∈ R ,where ψ(x) should have at most s nonzero elements. The function
k
v
φ(v)isdefinedtobe i=1 i µ . As before, our goal is to have a small reconstruction
i
error, and therefore we can define
2
ψ(x) = argmin x − φ(v) s.t. v 0 ≤ s,
v
where v 0 =|{ j : v j = 0}|. Note that when s = 1 and we further restrict v 1 = 1
then we obtain the k-means encoding function; that is, ψ(x) is the indicator vector
of the centroid closest to x. For larger values of s, the optimization problem in the
preceding definition of ψ becomes computationally difficult. Therefore, in practice,
we sometime use 1 regularization instead of the sparsity constraint and define ψ
to be
2
ψ(x) = argmin x − φ(v) + λ v 1 ,
v
where λ> 0 is a regularization parameter. Anyway, the dictionary learning problem
m
is now to find the vectors µ ,...,µ such that the reconstruction error, i=1 x i −
k
1
2
φ(ψ(x)) , is as small as possible. Even if ψ is defined using the 1 regularization,
this is still a computationally hard problem (similar to the k-means problem). How-
ever, several heuristic search algorithms may give reasonably good solutions. These
algorithms are beyond the scope of this book.
25.4 SUMMARY
Many machine learning algorithms take the feature representation of instances for
granted. Yet the choice of representation requires careful attention. We discussed
approaches for feature selection, introducing filters, greedy selection algorithms,
and sparsity-inducing norms. Next we presented several examples for feature trans-
formations and demonstrated their usefulness. Last, we discussed feature learn-
ing, and in particular dictionary learning. We have shown that feature selection,
manipulation, and learning all depend on some prior knowledge on the data.
25.5 BIBLIOGRAPHIC REMARKS
Guyon and Elisseeff (2003) surveyed several feature selection procedures, including
many types of filters.
Forward greedy selection procedures for minimizing a convex objective sub-
ject to a polyhedron constraint date back to the Frank-Wolfe algorithm (Frank &
Wolfe 1956). The relation to boosting has been studied by several authors, including,
(Warmuth, Liao & Ratsch 2006, Warmuth, Glocer & Vishwanathan 2008, Shalev-
Shwartz & Singer 2008). Matching pursuit has been studied in the signal processing
community (Mallat & Zhang 1993). Several papers analyzed greedy selection meth-
ods under various conditions. See, for example, Shalev-Shwartz, Zhang, and Srebro
(2010) and the references therein.
The use of the 1 -norm as a surrogate for sparsity has a long history
(e.g., Tibshirani (1996) and the references therein), and much work has been done