Page 162 - Understanding Machine Learning

P. 162

Regularization and Stability
144

Combining the preceding with Equation (13.14) and again using the assumption
β ≤ λm/2 yield

(i)
(A(S ),z i ) − (A(S),z i )
β
(i) (i) 2
≤ 2βØ (A(S),z i ) A(S ) − A(S) + A(S ) − A(S)
2
2 2
4β 8β
(i)

≤ + (A(S),z i ) + (A(S ),z )
λm (λm) 2
2
8β
(i)

≤ (A(S),z i ) + (A(S ),z )
λm
24β (i)

≤ (A(S),z i ) + (A(S ),z ) ,
λm
2
2
2
where in the last step we used the inequality (a + b) ≤ 3(a + b ). Taking expec-
(i)
tation with respect to S, z , i and noting that E[ (A(S),z i )] = E[ (A(S ), z )] =
E[L S (A(S))], we conclude that:
Corollary 13.7. Assume that the loss function is β-smooth and nonnegative. Then, the
2 2β
RLM rule with the regularizer λ w ,where λ ≥ , satisﬁes
m

48β
(i)
E (A(S ),z i ) − (A(S),z i ) ≤ E[L S (A(S))].
λm
Note that if for all z we have (0, z) ≤ C, for some scalar C > 0, then for every S,
2
2
L S (A(S)) ≤ L S (A(S)) + λ A(S) ≤ L S (0) + λ 0 = L S (0) ≤ C.
Hence, Corollary 13.7 also implies that

48β C
(i)
E (A(S ),z i ) − (A(S),z i ) ≤ .
λm
13.4 CONTROLLING THE FITTING-STABILITY TRADEOFF
We can rewrite the expected risk of a learning algorithm as

E[L D (A(S))] = E[L S (A(S))] + E[L D (A(S)) − L S (A(S))]. (13.15)
S S S
The ﬁrst term reﬂects how well A(S) ﬁts the training set while the second term
reﬂects the difference between the true and empirical risks of A(S). As we have
shown in Theorem 13.2, the second term is equivalent to the stability of A.Since
our goal is to minimize the risk of the algorithm, we need that the sum of both terms
will be small.
In the previous section we have bounded the stability term. We have shown
that the stability term decreases as the regularization parameter, λ, increases. On
the other hand, the empirical risk increases with λ. We therefore face a tradeoff
between ﬁtting and overﬁtting. This tradeoff is quite similar to the bias-complexity
tradeoff we discussed previously in the book.

157 158 159 160 161 162 163 164 165 166 167