Page 163 - Understanding Machine Learning

P. 163

13. 4 C ontrolling the Fitting-Sta bility Tra deoff 145

We now derive bounds on the empirical risk term for the RLM rule. Recall that
2
the RLM rule is deﬁned as A( S) = argmin w L S (w) + λ w . Fix some arbitrary
∗
vector w . We have
2
∗ 2

L S ( A( S)) ≤ L S ( A( S)) + λ A( S) ≤ L S (w ) + λ w .

∗

Taking expectation of both sides with respect to S and noting that E S [ L S (w )] =

L D (w ), we obtain that
∗
∗ 2
E[ L S ( A( S))] ≤ L D (w ) + λ w . (13.16)

∗
S
Plugging this into Equation (13.15) we obtain
∗ 2
∗
E[ L D ( A( S))] ≤ L D (w ) + λ w + E[ L D ( A( S)) − L S ( A( S))].

S S
Combining the preceding with Corollary 13.6 we conclude:
Corollary 13.8. Assume that the loss function is convex and ρ-Lipschitz. Then, the
2
RLM rule with the regularization function λ w satisﬁes
2ρ 2
∗ 2
∗

∗
∀w , E[ L D ( A( S))] ≤ L D (w ) + λ w + .

S λm
∗
This bound is often called an oracle inequality – if we think of w as a hypothesis
with low risk, the bound tells us how many examples are needed so that A( S) will
be almost as good as w , had we known the norm of w . In practice, however, we
∗
∗
∗
usually do not know the norm of w . We therefore usually tune λ on the basis of a
validation set, as described in Chapter 11.
1
We can also easily derive a PAC-like guarantee from Corollary 13.8 for convex-
Lipschitz-bounded learning problems:
Corollary 13.9. Let (H, Z, ) be a convex-Lipschitz-bounded learning problem with

2ρ 2
parameters ρ, B. For any training set size m, let λ = . Then, the RLM rule with
2
B m
2
the regularization function λ w satisﬁes

8
E[ L D ( A( S))] ≤ min L D (w) + ρ B .

S w∈H m

2 2

In particular, for every
> 0, if m ≥ 8ρ B then for every distribution D,
2

E S [ L D ( A( S))] ≤ min w∈ H L D (w) +
.

The preceding corollary holds for Lipschitz loss functions. If instead the loss
function is smooth and nonnegative, then we can combine Equation (13.16) with
Corollary 13.7 to get:
Corollary 13.10. Assume that the loss function is convex, β-smooth, and nonnegative.
2 2β
Then, the RLM rule with the regularization function λ w ,for λ ≥ , satisﬁes the
m
1 Again, the bound below is on the expected risk, but using Exercise 13.1 it can be used to derive an
agnostic PAC learning guarantee.

158 159 160 161 162 163 164 165 166 167 168