Page 70 - Understanding Machine Learning
P. 70
The VC-Dimension
52
To bound the left-hand side of Equation (6.4) we first note that for every h ∈ H,
m
we can rewrite L D (h) = E S ∼D [L S (h)], where S = z ,...,z is an additional i.i.d.
m
1
sample. Therefore,
E sup|L D (h) − L S (h)| = E sup
E L S (h) − L S (h) .
S∼D m S∼D m h∈H S ∼D m
h∈H
A generalization of the triangle inequality yields
E [L S E |L S
(h) − L S (h)] ≤ (h) − L S (h)|,
m
m
S ∼D S ∼D
and the fact that supermum of expectation is smaller than expectation of supremum
yields
sup E |L S (h) − L S (h)|≤ E sup|L S (h) − L S (h)|.
m m
h∈H S ∼D S ∼D h∈H
Formally, the previous two inequalities follow from Jensen’s inequality. Combining
all we obtain
E sup |L D (h) − L S (h)| ≤ E sup |L S (h) − L S (h)|
S∼D m S,S ∼D m
h∈H h∈H
m
= E sup 1
( (h,z ) − (h,z i ))
. (6.5)
i
m
S,S ∼D
h∈H m
i=1
The expectation on the right-hand side is over a choice of two i.i.d. samples S =
z 1 ,...,z m and S = z ,...,z .Since all of these 2m vectors are chosen i.i.d., nothing
1 m
will change if we replace the name of the random vector z i with the name of the
random vector z . If we do it, instead of the term ( (h,z )− (h,z i )) in Equation (6.5)
i i
m
we will have the term −( (h,z ) − (h,z i )). It follows that for every σ ∈{±1} we
i
have that Equation (6.5) equals
m
E sup 1
σ i ( (h,z ) − (h,z i ))
i
m
S,S ∼D h∈H m
i=1
m
Since this holds for every σ ∈{±1} , it also holds if we sample each component of σ
uniformly at random from the uniform distribution over {±1}, denoted U ± .Hence,
Equation (6.5) also equals
m
E E sup 1
σ i ( (h,z ) − (h,z i ))
,
i
σ∼U m S,S ∼D m
± h∈H m
i=1
and by the linearity of expectation it also equals
m
E E sup 1
σ i ( (h,z ) − (h,z i ))
.
i
m
S,S ∼D σ∼U m
± h∈H m
i=1