Page 347 - Understanding Machine Learning

P. 347

26.1 The Rademacher Complexity 329

3. For any h , with probability of at least 1 − δ,

2ln(8/δ)

L D (ERM H (S)) − L D (h ) ≤ 2 R( ◦ H ◦ S) + 5c .
m

Proof. First note that the random variable Rep (F, S) = sup L D (h) − L S (h)
D h∈H
satisﬁes the bounded differences condition of Lemma 26.4 with a constant 2c/m.
Combining the bounds in Lemma 26.4 with Lemma 26.2 we obtain that with
probability of at least 1 − δ,

2ln(2/δ) 2ln(2/δ)

Rep (F, S) ≤ ERep (F, S) + c ≤ 2E R( ◦ H ◦ S ) + c .
D D
m S m
The ﬁrst inequality of the theorem follows from the deﬁnition of Rep (F, S). For
D
the second inequality we note that the random variable R( ◦ H ◦ S) also satisﬁes
the bounded differences condition of Lemma 26.4 with a constant 2c/m. Therefore,
the second inequality follows from the ﬁrst inequality, Lemma 26.4, and the union
bound. Finally, for the last inequality, denote h S = ERM H (S) and note that

L D (h S ) − L D (h )

= L D (h S ) − L S (h S ) + L S (h S ) − L S (h ) + L S (h ) − L D (h )

≤ L D (h S ) − L S (h S ) + L S (h ) − L D (h ) . (26.10)
The ﬁrst summand on the right-hand side is bounded by the second inequality of

the theorem. For the second summand, we use the fact that h does not depend on
S; hence by using Hoeffding’s inequality we obtain that with probaility of at least
1 − δ/2,

ln(4/δ)

L S (h ) − L D (h ) ≤ c . (26.11)
2m
Combining this with the union bound we conclude our proof.
The preceding theorem tells us that if the quantity R( ◦ H ◦ S) is small then it
is possible to learn the class H using the ERM rule. It is important to emphasize
that the last two bounds given in the theorem depend on the speciﬁc training set S.
That is, we use S both for learning a hypothesis from H as well as for estimating the
quality of it. This type of bound is called a data-dependent bound.

26.1.1 Rademacher Calculus

Let us now discuss some properties of the Rademacher complexity measure. These
properties will help us in deriving some simple bounds on R( ◦ H ◦ S)for speciﬁc
cases of interest.
The following lemma is immediate from the deﬁnition.

m
m
Lemma 26.6. For any A ⊂ R ,scalar c ∈ R, and vector a 0 ∈ R , we have
R({ca + a 0 : a ∈ A}) ≤|c| R(A).
The following lemma tells us that the convex hull of A has the same complexity
as A.

342 343 344 345 346 347 348 349 350 351 352