Page 380 - Understanding Machine Learning
P. 380
Compression Bounds
362
convex hull of {x 1 ,...,x m } and has minimal norm. Then, it represents it as a convex
combination of d points in the sample (it will be shown later that this is always pos-
sible). The output of A are these d points. The algorithm B receives these d points
and set w to be the point in their convex hull of minimal norm.
Next we prove that this indeed is a compression sceme. Since the data is linearly
separable, the convex hull of {x 1 ,...,x m } does not contain the origin. Consider the
point w in this convex hull closest to the origin. (This is a unique point which is the
Euclidean projection of the origin onto this convex hull.) We claim that w separates
1
the data. To see this, assume by contradiction that w,x i ≤ 0 for some i.Take
2
w
w = (1 −α)w+αx i for α = 2 2 ∈ (0,1). Then w is also in the convex hull and
x i + w
2 2 2 2 2
w = (1 − α) w + α x i + 2α(1 − α) w,x i
2 2 2 2
≤ (1 − α) w + α x i
4 2 2 4
x i w + x i w
= 2 2 2
( w + x i )
2 2
x i w
=
2
w + x i 2
1
2
= w ·
2
2
w / x i + 1
2
< w ,
which leads to a contradiction.
We have thus shown that w is also an ERM. Finally, since w is in the convex hull
of the examples, we can apply Caratheodory’s theorem to obtain that w is also in the
convex hull of a subset of d + 1 points of the polygon. Furthermore, the minimality
of w implies that w must be on a face of the polygon and this implies it can be
represented as a convex combination of d points.
It remains to show that w is also the projection onto the polygon defined by the
d points. But this must be true: On one hand, the smaller polygon is a subset of the
larger one; hence the projection onto the smaller cannot be smaller in norm. On the
other hand, w itself is a valid solution. The uniqueness of projection concludes our
proof.
30.2.3 Separating Polynomials
d
Let X = R and consider the class x → sign(p(x)) where p is a degree r polynomial.
Note that p(x) can be rewritten as w,ψ(x) where the elements of ψ(x)are
all the monomials of x up to degree r. Therefore, the problem of constructing a
compression scheme for p(x) reduces to the problem of constructing a compression
d
r
scheme for halfspaces in R where d = O(d ).
1 It can be shown that w is the direction of the max-margin solution.