Page 80 - Understanding Machine Learning
P. 80
Nonuniform Learnability
62
and then Equation (7.3)impliesthat
L D (h) ≤ L S (h) +
n(h) (m,w(n(h)) · δ).
The SRM paradigm searches for h that minimizes this bound, as formalized in
the following pseudocode:
Structural Risk Minimization (SRM)
prior knowledge:
UC
H = H n where H n has uniform convergence with m
n H n
w : N → [0,1] where w(n) ≤ 1
n
define:
n as in Equation (7.1); n(h) as in Equation (7.4)
m
input: training set S ∼ D , confidence δ
output: h ∈ argmin L S (h) +
n(h) (m,w(n(h)) · δ)
h∈H
Unlike the ERM paradigm discussed in previous chapters, we no longer just care
about the empirical risk, L S (h), but we are willing to trade some of our bias toward
low empirical risk with a bias toward classes for which
n(h) (m,w(n(h))·δ) is smaller,
for the sake of a smaller estimation error.
Next we show that the SRM paradigm can be used for nonuniform learning of
every class, which is a countable union of uniformly converging hypothesis classes.
Theorem 7.5. Let H be a hypothesis class such that H = H n , where each H n has
n∈N
the uniform convergence property with sample complexity m UC .Let w : N → [0,1] be
H n
6
such that w(n) = . Then, H is nonuniformly learnable using the SRM rule with
2 2
n π
rate
m NUL (
,δ,h) ≤ m UC
/2, 6δ .
H H n(h) (πn(h)) 2
Proof. Let A be the SRM algorithm with respect to the weighting function w.For
every h ∈ H,
,and δ,let m ≥ m UC (
,w(n(h))δ). Using the fact that w(n) = 1,
H n(h) n
we can apply Theorem 7.4 to get that, with probability of at least 1 − δ over the
m
choice of S ∼ D , we have that for every h ∈ H,
L D (h ) ≤ L S (h ) +
n(h ) (m,w(n(h ))δ).
The preceding holds in particular for the hypothesis A(S) returned by the SRM rule.
By the definition of SRM we obtain that
L D (A(S)) ≤ min L S (h ) +
n(h ) (m,w(n(h ))δ) ≤ L S (h) +
n(h) (m,w(n(h))δ).
h
Finally, if m ≥ m UC (
/2,w(n(h))δ) then clearly
n(h) (m,w(n(h))δ) ≤
/2. In addi-
H n(h)
tion, from the uniform convergence property of each H n we have that with
probability of more than 1 − δ,
L S (h) ≤ L D (h) +
/2.
Combining all the preceding we obtain that L D (A(S)) ≤ L D (h)+
, which concludes
our proof.
Note that the previous theorem also proves Theorem 7.3.