Page 362 - Understanding Machine Learning
P. 362
Proof of the Fundamental Theorem of Learning Theory
344
1
≥ (P − [y]1 [A(y) =+] + P − [y]1 [A(y) =−] )
2
y∈N +
1
+ (P + [y]1 [A(y) =+] + P + [y]1 [A(y) =−] )
2
y∈N −
1 1
= P − [y] + P + [y].
2 2
y∈N + y∈N −
Next note that + P − [y] = − P + [y], and both values are the probability that
y∈N y∈N
a Binomial (m,(1 −
)/2) random variable will have value greater than m/2. Using
Lemma B.11, this probability is lower bounded by
1 1
2
2
2
1 − 1 − exp( − m
/(1 −
)) ≥ 1 − 1 − exp( − 2m
) ,
2 2
2
where we used the assumption that
≤ 1/2. It follows that if m ≤ 0.5log(1/(4δ))/
2
then there exists b such that
(h b ) =
]
P[L D b (A(y)) − L D b
1 √
≥ 1 − 1 − 4δ ≥ δ,
2
where the last inequality follows by standard algebraic manipulations. This con-
cludes our proof.
28.2.2 Showing That m(
,1/8) ≥ 8d/
2
√
8d
We shall now prove that for every
< 1/(8 2) we have that m(
,δ) ≥ 2 .
√
Let ρ = 8
and note that ρ ∈ (0,1/ 2). We will construct a family of distributions
as follows. First, let C ={c 1 ,...,c d } be a set of d instances which are shattered by H.
d
Second, for each vector (b 1 ,...,b d ) ∈{±1} , define a distribution D b such that
1 1+yb i ρ
d · 2 if ∃i : x = c i
D b ({(x, y)}) =
0 otherwise.
That is, to sample an example according to D b , we first sample an element c i ∈ C
uniformly at random, and then set the label to be b i with probability (1 + ρ)/2or
−b i with probability (1 − ρ)/2.
It is easy to verify that the Bayes optimal predictor for D b is the hypothesis h ∈ H
such that h(c i ) = b i for all i ∈ [d], and its error is 1−ρ . In addition, for any other
2
function f : X →{±1}, it is easy to verify that
1 + ρ |{i ∈ [d]: f (c i ) = b i }| 1 − ρ |{i ∈ [d]: f (c i ) = b i }|
( f ) = · + · .
L D b
2 d 2 d
Therefore,
|{i ∈ [d]: f (c i ) = b i }|
(h) = ρ · . (28.2)
L D b ( f ) − min L D b
h∈H d