Page 338 - Understanding Machine Learning
P. 338

Feature Selection and Generation
           320

                 where each w i is a string representing a word in the dictionary, and given a docu-
                 ment, (p 1 ,..., p d ), where each p i is a word in the document, we represent the
                                             k
                 document as a vector x ∈{0,1} ,where x i is 1 if w i = p j for some j ∈ [d], and
                  x i = 0 otherwise. It was empirically observed in many text processing tasks that lin-
                 ear predictors are quite powerful when applied on this representation. Intuitively,
                 we can think of each word as a feature that measures some aspect of the docu-
                 ment. Given labeled examples (e.g., topics of the documents), a learning algorithm
                 searches for a linear predictor that weights these features so that a right combination
                 of appearances of words is indicative of the label.
                    While in text processing there is a natural meaning to words and to the dictio-
                 nary, in other applications we do not have such an intuitive representation of an
                 instance. For example, consider the computer vision application of object recogni-
                 tion. Here, the instance is an image and the goal is to recognize which object appears
                 in the image. Applying a linear predictor on the pixel-based representation of the
                 image does not yield a good classifier. What we would like to have is a mapping
                 ψ that would take the pixel-based representation of the image and would output a
                 bag of “visual words,” representing the content of the image. For example, a “visual
                 word” can be “there is an eye in the image.” If we had such representation, we could
                 have applied a linear predictor on top of this representation to train a classifier for,
                 say, face recognition. Our question is, therefore, How can we learn a dictionary
                 of “visual words” such that a bag-of-words representation of an image would be
                 helpful for predicting which object appears in the image?
                    A first naive approach for dictionary learning relies on a clustering algorithm
                 (see Chapter 22). Suppose that we learn a function c : X →{1,...,k},where c(x)
                 is the cluster to which x belongs. Then, we can think of the clusters as “words,”
                 and of instances as “documents,” where a document x is mapped to the vector
                            k
                 ψ(x) ∈{0,1} ,where ψ(x) i is 1 if and only if x belongs to the ith cluster. Now, it
                 is straightforward to see that applying a linear predictor on ψ(x) is equivalent to
                 assigning the same target value to all instances that belong to the same cluster. Fur-
                 thermore, if the clustering is based on distances from a class center (e.g., k-means),
                 then a linear predictor on ψ(x) yields a piece-wise constant predictor on x.
                    Both the k-means and PCA approaches can be regarded as special cases of a
                 more general approach for dictionary learning which is called auto-encoders.In an
                                                                                    k
                                                                               d
                 auto-encoder we learn a pair of functions: an “encoder” function, ψ : R → R ,and
                                              d
                                         k
                 a “decoder” function, φ : R → R . The goal of the learning process is to find a pair
                                                                             2
                 of functions such that the reconstruction error,   x i − φ(ψ(x i ))  , is small. Of
                                                              i
                 course, we can trivially set k = d and both ψ,φ to be the identity mapping, which
                 yields a perfect reconstruction. We therefore must restrict ψ and φ in some way. In
                 PCA, we constrain k < d and further restrict ψ and φ to be linear functions. In k-
                 means, k is not restricted to be smaller than d, but now ψ and φ rely on k centroids,
                                                                  k
                 µ ,...,µ ,and ψ(x) returns an indicator vector in {0,1} that indicates the closest
                   1
                         k
                 centroid to x, while φ takes as input an indicator vector and returns the centroid
                 representing this vector.
                    An important property of the k-means construction, which is key in allowing
                 k to be larger than d,is that ψ maps instances into sparse vectors. In fact, in k-
                 means only a single coordinate of ψ(x) is nonzero. An immediate extension of the
                 k-means construction is therefore to restrict the range of ψ to be vectors with at
   333   334   335   336   337   338   339   340   341   342   343