Page 366 - Understanding Machine Learning
P. 366
Proof of the Fundamental Theorem of Learning Theory
348
Note that (S,T ) ∈ B implies S ∈ B and therefore 1 [(S,T)∈B ] = 1 [(S,T)∈B ] 1 [S∈B] ,which
gives
P[(S,T ) ∈ B ] = E E 1 [(S,T)∈B ] 1 [S∈B]
m
S∼D T ∼D m
= E 1 [S∈B] E 1 [(S,T)∈B ] .
m m
S∼D T ∼D
Fix some S. Then, either 1 [S∈B] = 0or S ∈ B and then ∃h S such that D(h S ) ≥
and
|h S ∩ S|= 0. It follows that a sufficient condition for (S,T ) ∈ B is that |T ∩h S | >
m .
2
Therefore, whenever S ∈ B we have
E 1 [(S,T)∈B ] ≥ P [|T ∩ h S | >
m ].
2
T ∼D m T ∼D m
But, since we now assume S ∈ B we know that D(h S ) = ρ ≥
. Therefore, |T ∩ h S |
is a binomial random variable with parameters ρ (probability of success for a single
try) and m (number of tries). Chernoff’s inequality implies
2 2
ρm − mρ (mρ−mρ/2) −mρ/2 −m
/2 −d log(1/δ)/2 d/2
P[|T ∩ h S |≤ ] ≤ e = e ≤ e ≤ e = δ ≤ 1/2.
2
Thus,
m
m ρm
P[|T ∩ h S | > ] = 1 − P[|T ∩ h S |≤ ] ≥ 1 − P[|T ∩ h S |≤ ] ≥ 1/2.
2 2 2
Combining all the preceding we conclude the proof of Claim 1.
Claim 2 (Symmetrization):
P[(S,T ) ∈ B ] ≤ e −
m/4 τ H (2m).
Proof of Claim 2: To simplify notation, let α = m
/2 and for a sequence A =
(x 1 ,...,x 2m )let A 0 = (x 1 ,...,x m ). Using the definition of B we get that
P[A ∈ B ] = E max1 [D(h)≥
] 1 [|h∩A 0 |=0] 1 [|h∩A|≥α]
2m h∈H
A∼D
≤ E max1 [|h∩A 0 |=0] 1 [|h∩A|≥α] .
A∼D 2m h∈H
Now, let us define by H A the effective number of different hypotheses on A, namely,
H A ={h ∩ A : h ∈ H }. It follows that
P[A ∈ B ] ≤ E max 1 [|h∩A 0 |=0] 1 [|h∩A|≥α]
A∼D 2m h∈H A
≤ E 1 [|h∩A 0 |=0] 1 [|h∩A|≥α] .
2m
A∼D
h∈H A
Let J ={j ⊂ [2m]: |j|= m}.For any j ∈ J and A = (x 1 ,...,x 2m ) define A j =
). Since the elements of A are chosen i.i.d., we have that for any j ∈ J and
(x j 1 ,...,x j m
any function f (A, A 0 ) it holds that E 2m [ f (A, A 0 )] = E 2m [ f (A, A j )]. Since
A∼D A∼D
this holds for any j it also holds for the expectation of j chosen at random from J.
In particular, it holds for the function f (A, A 0 ) = 1 [|h∩A 0 |=0] 1 [|h∩A|≥α] .We
h∈H A