Page 163 - Understanding Machine Learning
P. 163
13. 4 C ontrolling the Fitting-Sta bility Tra deoff 145
We now derive bounds on the empirical risk term for the RLM rule. Recall that
2
the RLM rule is defined as A( S) = argmin w L S (w) + λ w . Fix some arbitrary
∗
vector w . We have
2
∗ 2
L S ( A( S)) ≤ L S ( A( S)) + λ A( S) ≤ L S (w ) + λ w .
∗
∗
Taking expectation of both sides with respect to S and noting that E S [ L S (w )] =
L D (w ), we obtain that
∗
∗ 2
E[ L S ( A( S))] ≤ L D (w ) + λ w . (13.16)
∗
S
Plugging this into Equation (13.15) we obtain
∗ 2
∗
E[ L D ( A( S))] ≤ L D (w ) + λ w + E[ L D ( A( S)) − L S ( A( S))].
S S
Combining the preceding with Corollary 13.6 we conclude:
Corollary 13.8. Assume that the loss function is convex and ρ-Lipschitz. Then, the
2
RLM rule with the regularization function λ w satisfies
2ρ 2
∗ 2
∗
∗
∀w , E[ L D ( A( S))] ≤ L D (w ) + λ w + .
S λm
∗
This bound is often called an oracle inequality – if we think of w as a hypothesis
with low risk, the bound tells us how many examples are needed so that A( S) will
be almost as good as w , had we known the norm of w . In practice, however, we
∗
∗
∗
usually do not know the norm of w . We therefore usually tune λ on the basis of a
validation set, as described in Chapter 11.
1
We can also easily derive a PAC-like guarantee from Corollary 13.8 for convex-
Lipschitz-bounded learning problems:
Corollary 13.9. Let (H, Z, ) be a convex-Lipschitz-bounded learning problem with
2ρ 2
parameters ρ, B. For any training set size m, let λ = . Then, the RLM rule with
2
B m
2
the regularization function λ w satisfies
8
E[ L D ( A( S))] ≤ min L D (w) + ρ B .
S w∈H m
2 2
In particular, for every
> 0, if m ≥ 8ρ B then for every distribution D,
2
E S [ L D ( A( S))] ≤ min w∈ H L D (w) +
.
The preceding corollary holds for Lipschitz loss functions. If instead the loss
function is smooth and nonnegative, then we can combine Equation (13.16) with
Corollary 13.7 to get:
Corollary 13.10. Assume that the loss function is convex, β-smooth, and nonnegative.
2 2β
Then, the RLM rule with the regularization function λ w ,for λ ≥ , satisfies the
m
1 Again, the bound below is on the expected risk, but using Exercise 13.1 it can be used to derive an
agnostic PAC learning guarantee.