Page 114 - Understanding Machine Learning
P. 114
Linear Predictors
96
with respect to this class, given a training set S, and using the homogenous version
of L d , is to find
m
1 2
argmin L S (h w ) = argmin ( w,x i − y i ) .
w w m
i=1
To solve the problem we calculate the gradient of the objective function and
compare it to zero. That is, we need to solve
m
2
( w,x i − y i )x i = 0.
m
i=1
We can rewrite the problem as the problem Aw = b where
m m
A = x i x
and b = y i x i . (9.6)
i
i=1 i=1
Or, in matrix form:
. . . .
. . . .
. . . .
A = x 1 ... x m x 1 ... x m , (9.7)
. . . .
. . . .
. . . .
. .
. . y 1
. .
.
.
b = x 1 ... x m . . (9.8)
. .
. . y m
. .
If A is invertible then the solution to the ERM problem is
w = A −1 b.
Thecasein which A is not invertible requires a few standard tools from linear alge-
bra, which are available in Appendix C. It can be easily shown that if the training
d
instances do not span the entire space of R then A is not invertible. Nevertheless,
we can always find a solution to the system Aw = b because b is in the range of A.
Indeed, since A is symmetric we can write it using its eigenvalue decomposition as
A = VDV ,where D is a diagonal matrix and V is an orthonormal matrix (that is,
V V is the identity d × d matrix). Define D + to be the diagonal matrix such that
D + = 0if D i,i = 0 and otherwise D + = 1/D i,i . Now, define
i,i i,i
+
+
+
A = VD V
and w = A b.
ˆ
Let v i denote the i’th column of V. Then, we have
+
+
+
A ˆ w = AA b = VDV VD V b = VDD V b = v i v b.
i
i:D i,i =0
That is, A ˆ w is the projection of b onto the span of those vectors v i for which D i,i = 0.
Since the linear span of x 1 ,...,x m is the same as the linear span of those v i ,and b is
in the linear span of the x i , we obtain that A ˆ w = b, which concludes our argument.