Page 346 - Understanding Machine Learning
P. 346
Rademacher Complexities
328
Theorem 26.3. We have
E [L D (ERM H (S)) − L S (ERM H (S))] ≤ 2 E R( ◦ H ◦ S).
m m
S∼D S∼D
Furthermore, for any h ∈ H
E [L D (ERM H (S)) − L D (h )] ≤ 2 E R( ◦ H ◦ S).
m m
S∼D S∼D
Furthermore, if h = argmin L D (h) then for each δ ∈ (0,1) with probability of at least
h
1 − δ over the choice of S we have
2 E S ∼D R( ◦ H ◦ S )
m
L D (ERM H (S)) − L D (h ) ≤ .
δ
Proof. The first inequality follows directly from Lemma 26.2. The second inequality
follows because for any fixed h ,
L D (h ) = E[L S (h )] ≥ E[L S (ERM H (S))].
S S
The third inequality follows from the previous inequality by relying on Markov’s
inequality (note that the random variable L D (ERM H (S))− L D (h ) is nonnegative).
Next, we derive bounds similar to the bounds in Theorem 26.3 with a bet-
ter dependence on the confidence parameter δ. To do so, we first introduce the
following bounded differences concentration inequality.
Lemma 26.4 (McDiarmid’s Inequality). Let V be some set and let f : V m → R
be a function of m variables such that for some c > 0,for all i ∈ [m] and for all
x 1 ,...,x m ,x ∈ V we have
i
| f (x 1 ,...,x m ) − f (x 1 ,...,x i−1 ,x ,x i+1 ,...,x m )|≤ c.
i
Let X 1 ,..., X m be m independent random variables taking values in V. Then, with
probability of at least 1 − δ we have
| f (X 1 ,..., X m ) − E[ f (X 1 ,..., X m )]|≤ c ln 2 m/2.
δ
On the basis of the McDiarmid inequality we can derive generalization bounds
with a better dependence on the confidence parameter.
Theorem 26.5. Assume that for all z and h ∈ H we have that | (h,z)|≤ c. Then,
1. With probability of at least 1 − δ,for all h ∈ H,
2ln(2/δ)
L D (h) − L S (h) ≤ 2 E R( ◦ H ◦ S ) + c .
S ∼D m m
In particular, this holds for h = ERM H (S).
2. With probability of at least 1 − δ,for all h ∈ H,
2ln(4/δ)
L D (h) − L S (h) ≤ 2 R( ◦ H ◦ S) + 4c .
m
In particular, this holds for h = ERM H (S).