Page 51 - Understanding Machine Learning

P. 51

4.2 Finite Classes Are Agnostic PAC Learnable 33

Writing
{S : ∃h ∈ H,|L S (h) − L D (h)| >
}= ∪ h∈H {S : |L S (h) − L D (h)| >
},
and applying the union bound (Lemma 2.2) we obtain

m m
D ({S : ∃h ∈ H,|L S (h) − L D (h)| >
}) ≤ D ({S : |L S (h) − L D (h)| >
}). (4.1)
h∈H
Our second step will be to argue that each summand of the right-hand side of
this inequality is small enough (for a sufﬁciently large m). That is, we will show that
for any ﬁxed hypothesis, h, (which is chosen in advance prior to the sampling of the
training set), the gap between the true and empirical risks, |L S (h)− L D (h)|, is likely
to be small.
1 m
Recall that L D (h) = E z∼D [ (h,z)] and that L S (h) = (h,z i ). Since each z i
m i=1
is sampled i.i.d. from D, the expected value of the random variable (h,z i )is L D (h).
By the linearity of expectation, it follows that L D (h) is also the expected value of
L S (h). Hence, the quantity |L D (h) − L S (h)| is the deviation of the random variable
L S (h) from its expectation. We therefore need to show that the measure of L S (h)is
concentrated around its expected value.
A basic statistical fact, the law of large numbers, states that when m goes to
inﬁnity, empirical averages converge to their true expectation. This is true for L S (h),
since it is the empirical average of m i.i.d random variables. However, since the law
of large numbers is only an asymptotic result, it provides no information about the
gap between the empirically estimated error and its true value for any given, ﬁnite,
sample size.
Instead, we will use a measure concentration inequality due to Hoeffding, which
quantiﬁes the gap between empirical averages and their expected value.
Lemma 4.5 (Hoeffding’s Inequality). Let θ 1 ,...,θ m be a sequence of i.i.d. random
variables and assume that for all i, E[θ i ] = µ and P[a ≤ θ i ≤ b] = 1. Then, for any

> 0

1
2 2
m
P
θ i − µ
>
≤ 2 exp −2m
/(b − a) .

i=1
The proof can be found in Appendix B.
Getting back to our problem, let θ i be the random variable (h,z i ). Since h is
ﬁxed and z 1 ,...,z m are sampled i.i.d., it follows that θ 1 ,...,θ m are also i.i.d. random
1 m
θ
variables. Furthermore, L S (h) = i=1 i and L D (h) = µ. Let us further assume
m
that the range of is [0,1] and therefore θ i ∈ [0,1]. We therefore obtain that

m
1
2
m
D ({S : |L S (h) − L D (h)| >
}) = P
θ i − µ
>
≤ 2 exp −2m
. (4.2)

i=1
Combining this with Equation (4.1) yields

m 2
D ({S : ∃h ∈ H,|L S (h) − L D (h)| >
}) ≤ 2 exp −2m
h∈H

2
= 2|H| exp −2m
.

46 47 48 49 50 51 52 53 54 55 56