Page 202 - Understanding Machine Learning
P. 202

Kernel Methods
           184
                                   2         3


                 for which K(x,x ) = ψ(x),ψ(x ) . For simplicity, denote x 0 = x = 1. Then, we have

                                                                        0
                                                 k




                               K(x,x ) = (1 + x,x  ) = (1 + x,x  ) · ··· · (1 + x,x  )
                                                            
                                          n              n

                                      =     x j x       · ··· ·    x j x     
                                                j             j
                                          j=0            j=0
                                                 k


                                                      x
                                      =            x J i J i
                                        J∈{0,1,...,n} i=1
                                                k
                                                 k     k


                                      =            x J i  x .
                                                          J i
                                        J∈{0,1,...,n} i=1  i=1
                                                k
                                     n
                                                                        k
                 Now, if we define ψ : R → R (n+1) k  such that for J ∈{0, 1,...,n} there is an element
                                  1 k
                 of ψ(x) that equals     , we obtain that
                                    i=1  x J i


                                           K(x,x ) = ψ(x),ψ(x ) .
                 Since ψ contains all the monomials up to degree k, a halfspace over the range
                 of ψ corresponds to a polynomial predictor of degree k over the original space.
                 Hence, learning a halfspace with a k degree polynomial kernel enables us to learn
                 polynomial predictors of degree k over the original space.
                    Note that here the complexity of implementing K is O(n) while the dimension
                                                   k
                 of the feature space is on the order of n .
                 Example 16.2 (Gaussian Kernel). Let the original instance space be R and con-
                 sider the mapping ψ where for each nonnegative integer n ≥ 0 there exists an
                                               x 2
                                          1
                                                  n
                 element ψ(x) n that equals √ e −  2 x . Then,
                                          n!

                                                                        2
                                            ∞         x  2          (x )
                                                 1             1
                                                                            n
                                                     −  2 x n      −  2 (x )
                              ψ(x),ψ(x ) =      √   e         √   e
                                                  n!            n!
                                           n=0
                                              2     2 ∞       n
                                              x +(x )    (xx )
                                         = e −  2
                                                          n!
                                                    n=0
                                                   2
                                               x−x
                                         = e −  2  .
                 Here the feature space is of infinite dimension while evaluating the kernel is very
                 simple. More generally, given a scalar σ> 0, the Gaussian kernel is defined to be
                                                             2
                                                         x−x

                                             K(x,x ) = e −  2σ  .
                    Intuitively, the Gaussian kernel sets the inner product in the feature space
                 between x,x to be close to zero if the instances are far away from each other (in

                 the original domain) and close to 1 if they are close. σ is a parameter that controls
                 the scale determining what we mean by “close.” It is easy to verify that K imple-
                 ments an inner product in a space in which for any n and any monomial of order k
   197   198   199   200   201   202   203   204   205   206   207