Page 395 - Understanding Machine Learning

P. 395

B.5 Bennet’s and Bernstein’s Inequalities 377

2
By using the inequality h(a) ≥ a /(2+2a/3) it is possible to derive the following:

Lemma B.9 (Bernstein’s Inequality). Let Z 1 ,..., Z m be i.i.d. random variables with
a zero mean. If for all i, P(|Z i | < M) = 1, then for all t > 0:

m 2
t /2
P Z i > t ≤ exp − .
2
E Z + Mt/3
i=1 j
B.5.1 Application

Bernstein’s inequality can be used to interpolate between the rate 1/
we derived
2
for PAC learning in the realizable case (in Chapter 2)and therate1/
we derived
for the unrealizable case (in Chapter 4).

Lemma B.10. Let : H × Z → [0,1] be a loss function. Let D be an arbitrary
distribution over Z. Fix some h. Then, for any δ ∈ (0,1) we have

2L D (h)log(1/δ) 2log(1/δ)
1. P L S (h) ≥ L D (h) + + ≤ δ
S∼D m 3m m

2L S (h)log(1/δ) 4log(1/δ)
2. P L D (h) ≥ L S (h) + + ≤ δ
m
S∼D m m
Proof. Deﬁne random variables α 1 ,...,α m s.t. α i = (h,z i ) − L D (h). Note that
E[α i ] = 0and that

2
2
E[α ] = E[ (h,z i ) ] − 2L D (h)E[ (h,z i )] + L D (h) 2
i
2
= E[ (h,z i ) ] − L D (h) 2
2
≤ E[ (h,z i ) ]
≤ E[ (h,z i )] = L D (h),

2
where in the last inequality we used the fact that (h,z i ) ∈ [0,1] and thus (h,z i ) ≤
(h,z i ). Applying Bernsein’s inequality over the α i ’s yields

m 2
t /2
P α i > t ≤ exp −
2
Eα + t/3
i=1 j
2
t /2 def
≤ exp − = δ.
mL D (h) + t/3

390 391 392 393 394 395 396 397 398 399 400