Page 114 - Understanding Machine Learning
P. 114

Linear Predictors
           96

                 with respect to this class, given a training set S, and using the homogenous version
                 of L d , is to find
                                                           m
                                                        1               2
                                  argmin L S (h w ) = argmin  ( w,x i  − y i ) .
                                    w              w    m
                                                          i=1
                 To solve the problem we calculate the gradient of the objective function and
                 compare it to zero. That is, we need to solve
                                              m
                                           2
                                                ( w,x i  − y i )x i = 0.
                                           m
                                             i=1
                 We can rewrite the problem as the problem Aw = b where

                                            m                  m

                                      A =     x i x 
  and b =    y i x i .          (9.6)
                                                 i
                                           i=1                 i=1
                 Or, in matrix form:
                                                                
                                         .       .       .       .
                                         .       .       .       .
                                         .       .       .       .
                                                                
                                  A =    x 1  ...  x m    x 1  ...  x m    ,     (9.7)
                                                                
                                         .       .       .       .
                                         .       .       .       .
                                         .       .       .       .
                                                  
                                         .       .         
                                         .       .       y 1
                                         .       .
                                                     . 
                                                         .
                                  b =    x 1  ...  x m     .  .                  (9.8)
                                                  
                                         .       .
                                         .       .       y m
                                         .       .
                    If A is invertible then the solution to the ERM problem is
                                                 w = A −1 b.
                 Thecasein which A is not invertible requires a few standard tools from linear alge-
                 bra, which are available in Appendix C. It can be easily shown that if the training
                                                       d
                 instances do not span the entire space of R then A is not invertible. Nevertheless,
                 we can always find a solution to the system Aw = b because b is in the range of A.
                 Indeed, since A is symmetric we can write it using its eigenvalue decomposition as
                  A = VDV ,where D is a diagonal matrix and V is an orthonormal matrix (that is,


                  V V is the identity d × d matrix). Define D  +  to be the diagonal matrix such that
                  D  +  = 0if D i,i = 0 and otherwise D  +  = 1/D i,i . Now, define
                   i,i                          i,i
                                                                +
                                         +
                                                +
                                        A = VD V   
  and w = A b.
                                                           ˆ
                 Let v i denote the i’th column of V. Then, we have

                                   +        
    +  
        +
                           A ˆ w = AA b = VDV VD V b = VDD V b =           v i v b.
                                                                             i
                                                                     i:D i,i  =0
                 That is, A ˆ w is the projection of b onto the span of those vectors v i for which D i,i  = 0.
                 Since the linear span of x 1 ,...,x m is the same as the linear span of those v i ,and b is
                 in the linear span of the x i , we obtain that A ˆ w = b, which concludes our argument.
   109   110   111   112   113   114   115   116   117   118   119