Page 334 - Understanding Machine Learning

P. 334

Feature Selection and Generation
316
2
i=1 i y i ; then the
For simplicity let us assume that 1 x = 1, and denote x,y = m x
m i i
optimal solution is
w = sign( x,y )[| x,y |/m − λ] .
+
That is, the solution will be zero unless the correlation between the feature x and
the labels vector y is larger than λ.
Remark 25.4. Unlike the 1 norm, the 2 norm does not induce sparse solutions.
Indeed, consider aforementioned problem with an 2 regularization, namely,
m
1 2 2
argmin (x i w − y i ) + λw .
w∈R m 2m
i=1
Then, the optimal solution is
x,y /m
w = .
2
x /m + 2λ
This solution will be nonzero even if the correlation between x and y is very small. In
contrast, as we have shown before, when using 1 regularization, w will be nonzero
only if the correlation between x and y is larger than the regularization parameter λ.
Adding 1 regularization to a linear regression problem with the squared loss
yields the LASSO algorithm, deﬁned as

1 2
argmin Xw − y + λ w 1 . (25.8)
w 2m
Under some assumptions on the distribution and the regularization parameter λ,
the LASSO will ﬁnd sparse solutions (see, for example, (Zhao & Yu 2006) and
the references therein). Another advantage of the 1 norm is that a vector with
low 1 norm can be “sparsiﬁed” (see, for example, (Shalev-Shwartz, Zhang, and
Srebro 2010) and the references therein).

25.2 FEATURE MANIPULATION AND NORMALIZATION
Feature manipulations or normalization include simple transformations that we
apply on each of our original features. Such transformations may decrease the
approximation or estimation errors of our hypothesis class or can yield a faster algo-
rithm. Similarly to the problem of feature selection, here again there are no absolute
“good” and “bad” transformations, but rather each transformation that we apply
should be related to the learning algorithm we are going to apply on the resulting
feature vector as well as to our prior assumptions on the problem.
To motivate normalization, consider a linear regression problem with the
squared loss. Let X ∈ R m,d be a matrix whose rows are the instance vectors and let
m
y ∈ R be a vector of target values. Recall that ridge regression returns the vector

1 2 2 −1

argmin Xw − y + λ w = (2λmI + X X) X y.
w m
Suppose that d = 2 and the underlying data distribution is as follows. First we sample
y uniformly at random from {±1}. Then, we set x 1 to be y +0.5α,where α is sampled
uniformly at random from {±1},and we set x 2 to be 0.0001y. Note that the optimal

329 330 331 332 333 334 335 336 337 338 339