Page 356 - Understanding Machine Learning
P. 356
Covering Numbers
338
27.1.1 Properties
The following lemma is immediate from the definition.
m
m
Lemma 27.2. For any A ⊂ R ,scalar c > 0, and vector a 0 ∈ R , we have
∀r > 0, N(r,{ca + a 0 : a ∈ A}) ≤ N(cr, A).
Next, we derive a contraction principle.
Lemma 27.3. For each i ∈ [m],let φ i : R → R be a ρ-Lipschitz function; namely, for
m
all α,β ∈ R we have |φ i (α) − φ i (β)|≤ ρ |α − β|.For a ∈ R let φ(a) denote the vector
(φ 1 (a 1 ),...,φ m (a m )).Let φ ◦ A ={φ(a): a ∈ A}. Then,
N(ρ r,φ ◦ A) ≤ N(r, A).
Proof. Define B = φ ◦ A.Let A be an r-cover of A and define B = φ ◦ A . Then, for
all a ∈ A there exists a ∈ A with a − a ≤ r. So,
2 2 2 2 2
φ(a) − φ(a ) = (φ i (a i ) − φ i (a )) ≤ ρ (a i − a ) ≤ (ρr) .
i
i
i i
Hence, B is an (ρ r)-cover of B.
27.2 FROM COVERING TO RADEMACHER COMPLEXITY VIA CHAINING
The following lemma bounds the Rademacher complexity of A basedonthe cov-
ering numbers N(r, A). This technique is called Chaining and is attributed to
Dudley.
Lemma 27.4. Let c = min ¯ a max a∈A a − ¯ a . Then, for any integer M > 0,
M
c2 −M 6c −k
R(A) ≤ √ + 2 log(N(c2 −k , A)).
m m
k=1
Proof. Let ¯ a be a minimizer of the objective function given in the definition of c.
On the basis of Lemma 26.6, we can analyze the Rademacher complexity assuming
that ¯ a = 0.
Consider the set B 0 ={0} and note that it is a c-cover of A.Let B 1 ,..., B M
−k
∗
be sets such that each B k corresponds to a minimal (c2 )-cover of A.Let a =
argmax σ,a (where if there is more than one maximizer, choose one in an arbi-
a∈A
trary way, and if a maximizer does not exist, choose a such that σ,a is close
∗
∗
enough to the supremum). Note that a is a function of σ. For each k,let b (k) be the
∗
∗
nearest neighbor of a in B k (hence b (k) is also a function of σ). Using the triangle
inequality,
−k
b (k) − b (k−1) ≤ b (k) − a + a − b (k−1) ≤ c(2 −k + 2 −(k−1) ) = 3c 2 .
∗
∗
For each k define the set
−k
ˆ
B k ={(a − a ): a ∈ B k ,a ∈ B k−1 , a − a ≤ 3c2 }.