Page 79 - Understanding Machine Learning
P. 79
7.2 Structural Risk Minimization 61
In words, we have a fixed sample size m, and we are interested in the lowest possible
upper bound on the gap between empirical and true risks achievable by using a
sample of m examples.
From the definitions of uniform convergence and
n , it follows that for every m
m
and δ, with probability of at least 1 − δ over the choice of S ∼ D we have that
∀h ∈ H n , |L D (h) − L S (h)|≤
n (m,δ). (7.2)
Let w : N → [0,1] be a function such that ∞ w(n) ≤ 1. We refer to w as a
n=1
weight function over the hypothesis classes H 1 ,H 2 ,.... Such a weight function can
reflect the importance that the learner attributes to each hypothesis class, or some
measure of the complexity of different hypothesis classes. If H is a finite union of N
hypothesis classes, one can simply assign the same weight of 1/N to all hypothesis
classes. This equal weighting corresponds to no a priori preference to any hypothesis
class. Of course, if one believes (as prior knowledge) that a certain hypothesis class is
more likely to contain the correct target function, then it should be assigned a larger
weight, reflecting this prior knowledge. When H is a (countable) infinite union of
hypothesis classes, a uniform weighting is not possible but many other weighting
6 −n
schemes may work. For example, one can choose w(n) = 2 2 or w(n) = 2 . Later
π n
in this chapter we will provide another convenient way to define weighting functions
using description languages.
The SRM rule follows a “bound minimization” approach. This means that the
goal of the paradigm is to find a hypothesis that minimizes a certain upper bound
on the true risk. The bound that the SRM rule wishes to minimize is given in the
following theorem.
Theorem 7.4. Let w : N → [0,1] be a function such that ∞ w(n) ≤ 1.Let H be a
n=1
hypothesis class that can be written as H = H n ,where for each n, H n satisfies
n∈N
the uniform convergence property with a sample complexity function m UC .Let
n
H n
be as defined in Equation (7.1). Then, for every δ ∈ (0,1) and distribution D, with
m
probability of at least 1 − δ over the choice of S ∼ D , the following bound holds
(simultaneously) for every n ∈ N and h ∈ H n .
|L D (h) − L S (h)|≤
n (m,w(n) · δ).
Therefore, for every δ ∈ (0,1) and distribution D, with probability of at least 1 − δ it
holds that
∀h ∈ H, L D (h) ≤ L S (h) + min
n (m,w(n) · δ). (7.3)
n:h∈H n
Proof. For each n define δ n = w(n)δ. Applying the assumption that uniform conver-
gence holds for all n with the rate given in Equation (7.2), we obtain that if we fix n
m
in advance, then with probability of at least 1 − δ n over the choice of S ∼ D ,
∀h ∈ H n , |L D (h) − L S (h)|≤
n (m,δ n ).
Applying the union bound over n = 1,2,..., we obtain that with probability of at
δ
least 1− n n = 1−δ n w(n) ≥ 1−δ, the preceding holds for all n, which concludes
our proof.
Denote
n(h) = min{n : h ∈ H n }, (7.4)