Page 364 - Understanding Machine Learning
P. 364
Proof of the Fundamental Theorem of Learning Theory
346
The sum within the parentheses is minimized when A( j, y)(c i ) is the maximizer of
I
P[y |b i ] over b i ∈{±1}, which is exactly the Maximum-Likelihood rule. Repeating
the same argument for all i we conclude our proof.
Fix i. For every j,let n i ( j) ={|t : j t = i|} be the number of instances in which the
instance is c i . For the Maximum-Likelihood rule, we have that the quantity
E E 1 [A ML (S)(c i ) =b i ]
d
b∼U({±1} ) ∀r,y r ∼b jr
is exactly the probability that a binomial (n i ( j),(1 − ρ)/2) random variable will be
2
larger than n i ( j)/2. Using Lemma B.11, and the assumption ρ ≤ 1/2, we have that
1 2
P[B ≥ n i ( j)/2] ≥ 1 − 1 − e −2n i ( j)ρ .
2
We have thus shown that
d
ρ
E E E 1 [A(S)(c i ) =b i ]
d j∼U([d]) b∼U({±1} ) ∀r,y r ∼b jr
m
d
i=1
d
ρ 2
≥ E 1 − 1 − e −2ρ n i ( j)
2d j∼U([d]) m
i=1
d
ρ
2
≥ E 1 − 2ρ n i ( j) ,
2d j∼U([d]) m
i=1
where in the last inequality we used the inequality 1 − e −a ≤ a.
Since the square root function is concave, we can apply Jensen’s inequality to
obtain that the above is lower bounded by
d
ρ
≥ 1 − 2ρ 2 E n i ( j)
2d j∼U([d]) m
i=1
d
ρ
2
= 1 − 2ρ m/d
2d
i=1
ρ
2
= 1 − 2ρ m/d .
2
As long as m < d , this term would be larger than ρ/4.
8ρ 2
In summary, we have shown that if m < d then for any algorithm there exists a
8ρ 2
distribution such that
E L D (A(S)) − min L D (h) ≥ ρ/4.
m
S∼D h∈H