Page 160 - Understanding Machine Learning
P. 160
Regularization and Stability
142
On the other hand, for any v and u,and for all i, we have
2
2
f S (v) − f S (u) = L S (v) + λ v − (L S (u) + λ u ) (13.8)
2
2
= L (i)(v) + λ v − (L (i)(u) + λ u )
S S
(v,z i ) − (u,z i ) (u,z ) − (v,z )
+ + .
m m
(i)
In particular, choosing v = A(S ), u = A(S), and using the fact that v minimizes
2
L (i)(w) + λ w , we obtain that
S
(i)
(i)
(A(S ),z i ) − (A(S),z i ) (A(S),z ) − (A(S ),z )
(i)
f S (A(S )) − f S (A(S)) ≤ + .
m m
(13.9)
Combining this with Equation (13.7) we obtain that
(i)
(i)
(A(S ),z i ) − (A(S),z i ) (A(S),z ) − (A(S ),z )
(i) 2
λ A(S ) − A(S) ≤ + .
m m
(13.10)
The two subsections that follow continue the stability analysis for either Lip-
schitz or smooth loss functions. For both families of loss functions we show that
RLM is stable and therefore it does not overfit.
13.3.1 Lipschitz Loss
If the loss function, (·,z i ), is ρ-Lipschitz, then by the definition of Lipschitzness,
(i)
(i)
(A(S ),z i ) − (A(S),z i ) ≤ ρ A(S ) − A(S) . (13.11)
Similarly,
(i)
(i)
(A(S),z ) − (A(S ),z ) ≤ ρ A(S ) − A(S) .
Plugging these inequalities into Equation (13.10) we obtain
(i)
2ρ A(S ) − A(S)
(i) 2
λ A(S ) − A(S) ≤ ,
m
which yields
2ρ
(i)
A(S ) − A(S) ≤ .
λm
Plugging the preceding back into Equation (13.11) we conclude that
2ρ 2
(i)
(A(S ),z i ) − (A(S),z i ) ≤ .
λm
Since this holds for any S, z , i we immediately obtain:
Corollary 13.6. Assume that the loss function is convex and ρ-Lipschitz. Then, the
2
RLM rule with the regularizer λ w is on-average-replace-one-stable with rate 2ρ 2 .
λm