Page 346 - Understanding Machine Learning

P. 346

Rademacher Complexities
328

Theorem 26.3. We have
E [L D (ERM H (S)) − L S (ERM H (S))] ≤ 2 E R( ◦ H ◦ S).
m m
S∼D S∼D

Furthermore, for any h ∈ H

E [L D (ERM H (S)) − L D (h )] ≤ 2 E R( ◦ H ◦ S).
m m
S∼D S∼D

Furthermore, if h = argmin L D (h) then for each δ ∈ (0,1) with probability of at least
h
1 − δ over the choice of S we have
2 E S ∼D R( ◦ H ◦ S )

L D (ERM H (S)) − L D (h ) ≤ .
δ
Proof. The ﬁrst inequality follows directly from Lemma 26.2. The second inequality

follows because for any ﬁxed h ,

L D (h ) = E[L S (h )] ≥ E[L S (ERM H (S))].
S S
The third inequality follows from the previous inequality by relying on Markov’s

inequality (note that the random variable L D (ERM H (S))− L D (h ) is nonnegative).
Next, we derive bounds similar to the bounds in Theorem 26.3 with a bet-
ter dependence on the conﬁdence parameter δ. To do so, we ﬁrst introduce the
following bounded differences concentration inequality.
Lemma 26.4 (McDiarmid’s Inequality). Let V be some set and let f : V m → R
be a function of m variables such that for some c > 0,for all i ∈ [m] and for all
x 1 ,...,x m ,x ∈ V we have

| f (x 1 ,...,x m ) − f (x 1 ,...,x i−1 ,x ,x i+1 ,...,x m )|≤ c.
i
Let X 1 ,..., X m be m independent random variables taking values in V. Then, with
probability of at least 1 − δ we have

| f (X 1 ,..., X m ) − E[ f (X 1 ,..., X m )]|≤ c ln 2 m/2.
δ
On the basis of the McDiarmid inequality we can derive generalization bounds
with a better dependence on the conﬁdence parameter.

Theorem 26.5. Assume that for all z and h ∈ H we have that | (h,z)|≤ c. Then,
1. With probability of at least 1 − δ,for all h ∈ H,

2ln(2/δ)

L D (h) − L S (h) ≤ 2 E R( ◦ H ◦ S ) + c .
S ∼D m m

In particular, this holds for h = ERM H (S).
2. With probability of at least 1 − δ,for all h ∈ H,

2ln(4/δ)
L D (h) − L S (h) ≤ 2 R( ◦ H ◦ S) + 4c .
m
In particular, this holds for h = ERM H (S).

341 342 343 344 345 346 347 348 349 350 351