Page 338 - Understanding Machine Learning
P. 338
Feature Selection and Generation
320
where each w i is a string representing a word in the dictionary, and given a docu-
ment, (p 1 ,..., p d ), where each p i is a word in the document, we represent the
k
document as a vector x ∈{0,1} ,where x i is 1 if w i = p j for some j ∈ [d], and
x i = 0 otherwise. It was empirically observed in many text processing tasks that lin-
ear predictors are quite powerful when applied on this representation. Intuitively,
we can think of each word as a feature that measures some aspect of the docu-
ment. Given labeled examples (e.g., topics of the documents), a learning algorithm
searches for a linear predictor that weights these features so that a right combination
of appearances of words is indicative of the label.
While in text processing there is a natural meaning to words and to the dictio-
nary, in other applications we do not have such an intuitive representation of an
instance. For example, consider the computer vision application of object recogni-
tion. Here, the instance is an image and the goal is to recognize which object appears
in the image. Applying a linear predictor on the pixel-based representation of the
image does not yield a good classifier. What we would like to have is a mapping
ψ that would take the pixel-based representation of the image and would output a
bag of “visual words,” representing the content of the image. For example, a “visual
word” can be “there is an eye in the image.” If we had such representation, we could
have applied a linear predictor on top of this representation to train a classifier for,
say, face recognition. Our question is, therefore, How can we learn a dictionary
of “visual words” such that a bag-of-words representation of an image would be
helpful for predicting which object appears in the image?
A first naive approach for dictionary learning relies on a clustering algorithm
(see Chapter 22). Suppose that we learn a function c : X →{1,...,k},where c(x)
is the cluster to which x belongs. Then, we can think of the clusters as “words,”
and of instances as “documents,” where a document x is mapped to the vector
k
ψ(x) ∈{0,1} ,where ψ(x) i is 1 if and only if x belongs to the ith cluster. Now, it
is straightforward to see that applying a linear predictor on ψ(x) is equivalent to
assigning the same target value to all instances that belong to the same cluster. Fur-
thermore, if the clustering is based on distances from a class center (e.g., k-means),
then a linear predictor on ψ(x) yields a piece-wise constant predictor on x.
Both the k-means and PCA approaches can be regarded as special cases of a
more general approach for dictionary learning which is called auto-encoders.In an
k
d
auto-encoder we learn a pair of functions: an “encoder” function, ψ : R → R ,and
d
k
a “decoder” function, φ : R → R . The goal of the learning process is to find a pair
2
of functions such that the reconstruction error, x i − φ(ψ(x i )) , is small. Of
i
course, we can trivially set k = d and both ψ,φ to be the identity mapping, which
yields a perfect reconstruction. We therefore must restrict ψ and φ in some way. In
PCA, we constrain k < d and further restrict ψ and φ to be linear functions. In k-
means, k is not restricted to be smaller than d, but now ψ and φ rely on k centroids,
k
µ ,...,µ ,and ψ(x) returns an indicator vector in {0,1} that indicates the closest
1
k
centroid to x, while φ takes as input an indicator vector and returns the centroid
representing this vector.
An important property of the k-means construction, which is key in allowing
k to be larger than d,is that ψ maps instances into sparse vectors. In fact, in k-
means only a single coordinate of ψ(x) is nonzero. An immediate extension of the
k-means construction is therefore to restrict the range of ψ to be vectors with at