Page 202 - Understanding Machine Learning

P. 202

Kernel Methods
184
2 3

for which K(x,x ) = ψ(x),ψ(x ) . For simplicity, denote x 0 = x = 1. Then, we have

0
k

K(x,x ) = (1 + x,x ) = (1 + x,x ) · ··· · (1 + x,x )
   
n n

=  x j x  · ··· ·  x j x 
j j
j=0 j=0
k

x
= x J i J i
J∈{0,1,...,n} i=1
k
k k

= x J i x .
J i
J∈{0,1,...,n} i=1 i=1
k
n
k
Now, if we deﬁne ψ : R → R (n+1) k such that for J ∈{0, 1,...,n} there is an element
1 k
of ψ(x) that equals , we obtain that
i=1 x J i

K(x,x ) = ψ(x),ψ(x ) .
Since ψ contains all the monomials up to degree k, a halfspace over the range
of ψ corresponds to a polynomial predictor of degree k over the original space.
Hence, learning a halfspace with a k degree polynomial kernel enables us to learn
polynomial predictors of degree k over the original space.
Note that here the complexity of implementing K is O(n) while the dimension
k
of the feature space is on the order of n .
Example 16.2 (Gaussian Kernel). Let the original instance space be R and con-
sider the mapping ψ where for each nonnegative integer n ≥ 0 there exists an
x 2
1
n
element ψ(x) n that equals √ e − 2 x . Then,
n!

2
∞ x 2 (x )
1 1
n
− 2 x n − 2 (x )
ψ(x),ψ(x ) = √ e √ e
n! n!
n=0
2 2 ∞ n
x +(x ) (xx )
= e − 2
n!
n=0
2
x−x
= e − 2 .
Here the feature space is of inﬁnite dimension while evaluating the kernel is very
simple. More generally, given a scalar σ> 0, the Gaussian kernel is deﬁned to be
2
x−x

K(x,x ) = e − 2σ .
Intuitively, the Gaussian kernel sets the inner product in the feature space
between x,x to be close to zero if the instances are far away from each other (in

the original domain) and close to 1 if they are close. σ is a parameter that controls
the scale determining what we mean by “close.” It is easy to verify that K imple-
ments an inner product in a space in which for any n and any monomial of order k

197 198 199 200 201 202 203 204 205 206 207