Page 378 - Understanding Machine Learning
P. 378
Compression Bounds
360
the examples in S. Note that this protocol is equivalent to the protocol we defined
before – hence Lemma 30.1 still holds.
Applying a union bound over the choice of the sequence of indices we obtain
the following theorem.
k
Theorem 30.2. Let k be an integer and let B : Z → H be a mapping from sequences
of k examples to the hypothesis class. Let m ≥ 2k be a training set size and let A :
m
Z → H be a learning rule that receives a training sequence S of size m and returns
k
) for some (i 1 ,...,i k ) ∈ [m] .Let V ={z j :
a hypothesis such that A(S) = B(z i 1 ,...,z i k
j /∈ (i 1 ,...,i k )} be the set of examples which were not selected for defining A(S). Then,
with probability of at least 1 − δ over the choice of S we have
4k log(m/δ) 8k log(m/δ)
L D (A(S)) ≤ L V (A(S)) + L V (A(S)) + .
m m
k
Proof. For any I ∈ [m] let h I = B(z i 1 ,...,z i k ). Let n = m − k. Combining
Lemma 30.1 with the union bound we have
2L V (h I )log(1/δ) 4log(1/δ)
k
P ∃I ∈ [m] s.t. L D (h I ) − L V (h I ) ≥ +
n n
2L V (h I )log(1/δ) 4log(1/δ)
≤ P L D (h I ) − L V (h I ) ≥ +
n n
I∈[m] k
k
≤ m δ.
k
Denote δ = m δ. Using the assumption k ≤ m/2, which implies that n = m −k ≥ m/2,
the above implies that with probability of at least 1 − δ we have that
4k log(m/δ ) 8k log(m/δ )
L D (A(S)) ≤ L V (A(S)) + L V (A(S)) + ,
m m
which concludes our proof.
As a direct corollary we obtain:
Corollary 30.3. Assuming the conditions of Theorem 30.2, and further assuming that
L V (A(S)) = 0, then, with probability of at least 1 − δ over the choice of S we have
8k log(m/δ)
L D (A(S)) ≤ .
m
These results motivate the following definition:
Definition 30.4. (Compression Scheme) Let H be a hypothesis class of functions
from X to Y and let k be an integer. We say that H has a compression scheme of
size k if the following holds:
k
k
m
For all m there exists A : Z → [m] and B : Z → H such that for all h ∈ H,if we
feed any training set of the form (x 1 ,h(x 1 )),...,(x m ,h(x m )) into A and then feed
)) into B,where (i 1 ,...,i k ) is the output of A, then the
(x i 1 ,h(x i 1 )),...,(x i k ,h(x i k
output of B, denoted h ,satisfies L S (h ) = 0.