Page 344 - Understanding Machine Learning
P. 344
Rademacher Complexities
326
refer to S 1 as a validation set and to S 2 as a training set. We can then estimate
the representativeness of S by
( f )). (26.2)
sup (L S 1 ( f ) − L S 2
f ∈F
m
This can be written more compactly by defining σ = (σ 1 ,...,σ m ) ∈{±1} to be a
vector such that S 1 ={z i : σ i = 1} and S 2 ={z i : σ i =−1}. Then, if we further assume
that |S 1 |=|S 2 | then Equation (26.2) can be rewritten as
m
2
sup σ i f (z i ). (26.3)
m f ∈F
i=1
The Rademacher complexity measure captures this idea by considering the expec-
tation of the term appearing in Equation 26.3 with respect to a random choice of σ.
Formally, let F ◦ S be the set of all possible evaluations a function f ∈ F can achieve
on a sample S, namely,
F ◦ S ={( f (z 1 ),..., f (z m )) : f ∈ F}.
1
Let the variables in σ be distributed i.i.d. according to P[σ i = 1] = P[σ i =−1] = .
2
Then, the Rademacher complexity of F with respect to S is defined as follows:
m
def 1
R(F ◦ S) = E sup σ i f (z i ) . (26.4)
m
m σ∼{±1}
f ∈F
i=1
m
More generally, given a set of vectors, A ⊂ R , we define
m
def 1
R(A) = E sup σ i a i . (26.5)
m σ a∈A
i=1
The following lemma bounds the expected value of the representativeness of S
by twice the expected Rademacher complexity.
Lemma 26.2.
E [ Rep (F, S)] ≤ 2 E R(F ◦ S).
m D m
S∼D S∼D
Proof. Let S ={z ,...,z } be another i.i.d. sample. Clearly, for all f ∈ F, L D ( f ) =
1 m
(
E S [L S f )]. Therefore, for every f ∈ F we have
L D ( f ) − L S ( f ) = E[L S ( f )] − L S ( f ) = E[L S ( f ) − L S ( f )].
S S
Taking supremum over f ∈ F of both sides, and using the fact that the supremum
of expectation is smaller than expectation of the supremum we obtain
(
sup (L D ( f ) − L S ( f )) = sup E[L S f ) − L S ( f )]
f ∈F S
f ∈F
≤ E sup (L S ( f ) − L S ( f )) .
S f ∈F