Page 69 - Understanding Machine Learning

P. 69

6.5 Proof of Theorem 6.7 51

Theorem 6.11. Let H be a class and let τ H be its growth function. Then, for every D
m
and every δ ∈ (0,1), with probability of at least 1 − δ over the choice of S ∼ D we
have

4 + log(τ H (2m))
|L D (h) − L S (h)|≤ √ .
δ 2m
Before proving the theorem, let us ﬁrst conclude the proof of Theorem 6.7.

Proof of Theorem 6.7. It sufﬁces to prove that if the VC-dimension is ﬁnite then the
uniform convergence property holds. We will prove that

16d 16d 16d log(2e/d)
UC
m (
,δ) ≤ 4 log + .
H 2 2 2
(δ
) (δ
) (δ
)
d
From Sauer’s lemma we have that for m > d, τ H (2m) ≤ (2em/d) . Combining this
with Theorem 6.11 we obtain that with probability of at least 1 − δ,

4 + d log(2em/d)
|L S (h) − L D (h)|≤ √ .
δ 2m

For simplicity assume that d log(2em/d) ≥ 4; hence,

1 2d log(2em/d)
|L S (h) − L D (h)|≤ .
δ m
To ensure that the preceding is at most
we need that

2d log(m) 2d log(2e/d)
m ≥ + .
(δ
) 2 (δ
) 2
Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a
sufﬁcient condition for the preceding to hold is that

2d 2d 4d log(2e/d)
m ≥ 4 log + .
(δ
) 2 (δ
) 2 (δ
) 2

Remark 6.4. The upper bound on m UC we derived in the proof Theorem 6.7 is not
H
the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8
can be found in Chapter 28.

Proof of Theorem 6.11*
We will start by showing that

4 + log(τ H (2m))
E sup |L D (h) − L S (h)| ≤ √ . (6.4)
S∼D m 2m
h∈H
Since the random variable sup |L D (h) − L S (h)| is nonnegative, the proof of
h∈H
the theorem follows directly from the preceding using Markov’s inequality (see
Section B.1).

64 65 66 67 68 69 70 71 72 73 74