Page 347 - Understanding Machine Learning
P. 347
26.1 The Rademacher Complexity 329
3. For any h , with probability of at least 1 − δ,
2ln(8/δ)
L D (ERM H (S)) − L D (h ) ≤ 2 R( ◦ H ◦ S) + 5c .
m
Proof. First note that the random variable Rep (F, S) = sup L D (h) − L S (h)
D h∈H
satisfies the bounded differences condition of Lemma 26.4 with a constant 2c/m.
Combining the bounds in Lemma 26.4 with Lemma 26.2 we obtain that with
probability of at least 1 − δ,
2ln(2/δ) 2ln(2/δ)
Rep (F, S) ≤ ERep (F, S) + c ≤ 2E R( ◦ H ◦ S ) + c .
D D
m S m
The first inequality of the theorem follows from the definition of Rep (F, S). For
D
the second inequality we note that the random variable R( ◦ H ◦ S) also satisfies
the bounded differences condition of Lemma 26.4 with a constant 2c/m. Therefore,
the second inequality follows from the first inequality, Lemma 26.4, and the union
bound. Finally, for the last inequality, denote h S = ERM H (S) and note that
L D (h S ) − L D (h )
= L D (h S ) − L S (h S ) + L S (h S ) − L S (h ) + L S (h ) − L D (h )
≤ L D (h S ) − L S (h S ) + L S (h ) − L D (h ) . (26.10)
The first summand on the right-hand side is bounded by the second inequality of
the theorem. For the second summand, we use the fact that h does not depend on
S; hence by using Hoeffding’s inequality we obtain that with probaility of at least
1 − δ/2,
ln(4/δ)
L S (h ) − L D (h ) ≤ c . (26.11)
2m
Combining this with the union bound we conclude our proof.
The preceding theorem tells us that if the quantity R( ◦ H ◦ S) is small then it
is possible to learn the class H using the ERM rule. It is important to emphasize
that the last two bounds given in the theorem depend on the specific training set S.
That is, we use S both for learning a hypothesis from H as well as for estimating the
quality of it. This type of bound is called a data-dependent bound.
26.1.1 Rademacher Calculus
Let us now discuss some properties of the Rademacher complexity measure. These
properties will help us in deriving some simple bounds on R( ◦ H ◦ S)for specific
cases of interest.
The following lemma is immediate from the definition.
m
m
Lemma 26.6. For any A ⊂ R ,scalar c ∈ R, and vector a 0 ∈ R , we have
R({ca + a 0 : a ∈ A}) ≤|c| R(A).
The following lemma tells us that the convex hull of A has the same complexity
as A.