Page 69 - Understanding Machine Learning
P. 69
6.5 Proof of Theorem 6.7 51
Theorem 6.11. Let H be a class and let τ H be its growth function. Then, for every D
m
and every δ ∈ (0,1), with probability of at least 1 − δ over the choice of S ∼ D we
have
4 + log(τ H (2m))
|L D (h) − L S (h)|≤ √ .
δ 2m
Before proving the theorem, let us first conclude the proof of Theorem 6.7.
Proof of Theorem 6.7. It suffices to prove that if the VC-dimension is finite then the
uniform convergence property holds. We will prove that
16d 16d 16d log(2e/d)
UC
m (
,δ) ≤ 4 log + .
H 2 2 2
(δ
) (δ
) (δ
)
d
From Sauer’s lemma we have that for m > d, τ H (2m) ≤ (2em/d) . Combining this
with Theorem 6.11 we obtain that with probability of at least 1 − δ,
4 + d log(2em/d)
|L S (h) − L D (h)|≤ √ .
δ 2m
For simplicity assume that d log(2em/d) ≥ 4; hence,
1 2d log(2em/d)
|L S (h) − L D (h)|≤ .
δ m
To ensure that the preceding is at most
we need that
2d log(m) 2d log(2e/d)
m ≥ + .
(δ
) 2 (δ
) 2
Standard algebraic manipulations (see Lemma A.2 in Appendix A) show that a
sufficient condition for the preceding to hold is that
2d 2d 4d log(2e/d)
m ≥ 4 log + .
(δ
) 2 (δ
) 2 (δ
) 2
Remark 6.4. The upper bound on m UC we derived in the proof Theorem 6.7 is not
H
the tightest possible. A tighter analysis that yields the bounds given in Theorem 6.8
can be found in Chapter 28.
Proof of Theorem 6.11*
We will start by showing that
4 + log(τ H (2m))
E sup |L D (h) − L S (h)| ≤ √ . (6.4)
S∼D m 2m
h∈H
Since the random variable sup |L D (h) − L S (h)| is nonnegative, the proof of
h∈H
the theorem follows directly from the preceding using Markov’s inequality (see
Section B.1).