Page 70 - Understanding Machine Learning

P. 70

The VC-Dimension
52

To bound the left-hand side of Equation (6.4) we ﬁrst note that for every h ∈ H,

we can rewrite L D (h) = E S ∼D [L S (h)], where S = z ,...,z is an additional i.i.d.
m
1
sample. Therefore,

E sup|L D (h) − L S (h)| = E sup
E L S (h) − L S (h) .

S∼D m S∼D m h∈H S ∼D m
h∈H
A generalization of the triangle inequality yields

E [L S E |L S

(h) − L S (h)] ≤ (h) − L S (h)|,

m
m
S ∼D S ∼D
and the fact that supermum of expectation is smaller than expectation of supremum
yields
sup E |L S (h) − L S (h)|≤ E sup|L S (h) − L S (h)|.

m m
h∈H S ∼D S ∼D h∈H
Formally, the previous two inequalities follow from Jensen’s inequality. Combining
all we obtain

E sup |L D (h) − L S (h)| ≤ E sup |L S (h) − L S (h)|

S∼D m S,S ∼D m
h∈H h∈H

= E sup 1

( (h,z ) − (h,z i ))
. (6.5)
i
m
S,S ∼D
h∈H m
i=1
The expectation on the right-hand side is over a choice of two i.i.d. samples S =
z 1 ,...,z m and S = z ,...,z .Since all of these 2m vectors are chosen i.i.d., nothing

1 m
will change if we replace the name of the random vector z i with the name of the

random vector z . If we do it, instead of the term ( (h,z )− (h,z i )) in Equation (6.5)
i i
m

we will have the term −( (h,z ) − (h,z i )). It follows that for every σ ∈{±1} we
i
have that Equation (6.5) equals

E sup 1

σ i ( (h,z ) − (h,z i ))
i
m
S,S ∼D h∈H m
i=1
m
Since this holds for every σ ∈{±1} , it also holds if we sample each component of σ
uniformly at random from the uniform distribution over {±1}, denoted U ± .Hence,
Equation (6.5) also equals

E E sup 1

σ i ( (h,z ) − (h,z i ))
,

i
σ∼U m S,S ∼D m

± h∈H m
i=1
and by the linearity of expectation it also equals

E E sup 1

σ i ( (h,z ) − (h,z i ))
.

i
m

S,S ∼D σ∼U m
± h∈H m
i=1

65 66 67 68 69 70 71 72 73 74 75