Page 114 - Understanding Machine Learning

P. 114

Linear Predictors
96

with respect to this class, given a training set S, and using the homogenous version
of L d , is to ﬁnd
m
1 2
argmin L S (h w ) = argmin ( w,x i − y i ) .
w w m
i=1
To solve the problem we calculate the gradient of the objective function and
compare it to zero. That is, we need to solve
m
2
( w,x i − y i )x i = 0.
m
i=1
We can rewrite the problem as the problem Aw = b where

m m

A = x i x
and b = y i x i . (9.6)
i
i=1 i=1
Or, in matrix form:
  
. . . .
. . . .
. . . .
  
A =  x 1 ... x m  x 1 ... x m  , (9.7)
  
. . . .
. . . .
. . . .
 
. .  
. . y 1
. .
   . 
.
b =  x 1 ... x m   .  . (9.8)
 
. .
. . y m
. .
If A is invertible then the solution to the ERM problem is
w = A −1 b.
Thecasein which A is not invertible requires a few standard tools from linear alge-
bra, which are available in Appendix C. It can be easily shown that if the training
d
instances do not span the entire space of R then A is not invertible. Nevertheless,
we can always ﬁnd a solution to the system Aw = b because b is in the range of A.
Indeed, since A is symmetric we can write it using its eigenvalue decomposition as
A = VDV ,where D is a diagonal matrix and V is an orthonormal matrix (that is,

V V is the identity d × d matrix). Deﬁne D + to be the diagonal matrix such that
D + = 0if D i,i = 0 and otherwise D + = 1/D i,i . Now, deﬁne
i,i i,i
+
+
+
A = VD V
and w = A b.
ˆ
Let v i denote the i’th column of V. Then, we have

+
+
+
A ˆ w = AA b = VDV VD V b = VDD V b = v i v b.
i
i:D i,i =0
That is, A ˆ w is the projection of b onto the span of those vectors v i for which D i,i = 0.
Since the linear span of x 1 ,...,x m is the same as the linear span of those v i ,and b is
in the linear span of the x i , we obtain that A ˆ w = b, which concludes our argument.

109 110 111 112 113 114 115 116 117 118 119