Page 344 - Understanding Machine Learning

P. 344

Rademacher Complexities
326

refer to S 1 as a validation set and to S 2 as a training set. We can then estimate
the representativeness of S by

( f )). (26.2)
sup (L S 1 ( f ) − L S 2
f ∈F
m
This can be written more compactly by deﬁning σ = (σ 1 ,...,σ m ) ∈{±1} to be a
vector such that S 1 ={z i : σ i = 1} and S 2 ={z i : σ i =−1}. Then, if we further assume
that |S 1 |=|S 2 | then Equation (26.2) can be rewritten as

m
2
sup σ i f (z i ). (26.3)
m f ∈F
i=1
The Rademacher complexity measure captures this idea by considering the expec-
tation of the term appearing in Equation 26.3 with respect to a random choice of σ.
Formally, let F ◦ S be the set of all possible evaluations a function f ∈ F can achieve
on a sample S, namely,

F ◦ S ={( f (z 1 ),..., f (z m )) : f ∈ F}.

1
Let the variables in σ be distributed i.i.d. according to P[σ i = 1] = P[σ i =−1] = .
2
Then, the Rademacher complexity of F with respect to S is deﬁned as follows:
m
def 1
R(F ◦ S) = E sup σ i f (z i ) . (26.4)
m
m σ∼{±1}
f ∈F
i=1
m
More generally, given a set of vectors, A ⊂ R , we deﬁne
m
def 1
R(A) = E sup σ i a i . (26.5)
m σ a∈A
i=1
The following lemma bounds the expected value of the representativeness of S
by twice the expected Rademacher complexity.
Lemma 26.2.
E [ Rep (F, S)] ≤ 2 E R(F ◦ S).
m D m
S∼D S∼D

Proof. Let S ={z ,...,z } be another i.i.d. sample. Clearly, for all f ∈ F, L D ( f ) =
1 m
(
E S [L S f )]. Therefore, for every f ∈ F we have

L D ( f ) − L S ( f ) = E[L S ( f )] − L S ( f ) = E[L S ( f ) − L S ( f )].

S S
Taking supremum over f ∈ F of both sides, and using the fact that the supremum
of expectation is smaller than expectation of the supremum we obtain
(
sup (L D ( f ) − L S ( f )) = sup E[L S f ) − L S ( f )]
f ∈F S
f ∈F

≤ E sup (L S ( f ) − L S ( f )) .

S f ∈F

339 340 341 342 343 344 345 346 347 348 349