Page 334 - Understanding Machine Learning
P. 334

Feature Selection and Generation
           316
                                                   2
                                                                           i=1 i y i ; then the
                 For simplicity let us assume that  1     x = 1, and denote  x,y =    m  x
                                              m  i i
                 optimal solution is
                                       w = sign( x,y )[| x,y |/m − λ] .
                                                                 +
                 That is, the solution will be zero unless the correlation between the feature x and
                 the labels vector y is larger than λ.
                 Remark 25.4. Unlike the   1 norm, the   2 norm does not induce sparse solutions.
                 Indeed, consider aforementioned problem with an   2 regularization, namely,
                                                  m
                                               1            2     2
                                      argmin        (x i w − y i ) + λw  .
                                       w∈R m  2m
                                                 i=1
                 Then, the optimal solution is
                                                     x,y /m
                                              w =            .
                                                     2
                                                   x  /m + 2λ
                 This solution will be nonzero even if the correlation between x and y is very small. In
                 contrast, as we have shown before, when using   1 regularization, w will be nonzero
                 only if the correlation between x and y is larger than the regularization parameter λ.
                    Adding   1 regularization to a linear regression problem with the squared loss
                 yields the LASSO algorithm, defined as


                                               1          2
                                       argmin     Xw − y  + λ w  1 .                (25.8)
                                         w     2m
                 Under some assumptions on the distribution and the regularization parameter λ,
                 the LASSO will find sparse solutions (see,  for example,  (Zhao & Yu 2006) and
                 the references therein). Another advantage of the   1 norm is that a vector with
                 low   1 norm can be “sparsified” (see, for example, (Shalev-Shwartz, Zhang, and
                 Srebro 2010) and the references therein).


                 25.2 FEATURE MANIPULATION AND NORMALIZATION
                 Feature manipulations or normalization include simple transformations that we
                 apply on each of our original features. Such transformations may decrease the
                 approximation or estimation errors of our hypothesis class or can yield a faster algo-
                 rithm. Similarly to the problem of feature selection, here again there are no absolute
                 “good” and “bad” transformations, but rather each transformation that we apply
                 should be related to the learning algorithm we are going to apply on the resulting
                 feature vector as well as to our prior assumptions on the problem.
                    To motivate normalization, consider a linear regression problem with the
                 squared loss. Let X ∈ R m,d  be a matrix whose rows are the instance vectors and let
                      m
                 y ∈ R be a vector of target values. Recall that ridge regression returns the vector

                                     1         2      2                −1


                             argmin    Xw − y  + λ w    = (2λmI + X X)   X y.
                                w    m
                 Suppose that d = 2 and the underlying data distribution is as follows. First we sample
                  y uniformly at random from {±1}. Then, we set x 1 to be y +0.5α,where α is sampled
                 uniformly at random from {±1},and we set x 2 to be 0.0001y. Note that the optimal
   329   330   331   332   333   334   335   336   337   338   339