Page 160 - Understanding Machine Learning

P. 160

Regularization and Stability
142

On the other hand, for any v and u,and for all i, we have
2
2
f S (v) − f S (u) = L S (v) + λ v − (L S (u) + λ u ) (13.8)
2
2
= L (i)(v) + λ v − (L (i)(u) + λ u )
S S

(v,z i ) − (u,z i ) (u,z ) − (v,z )
+ + .
m m
(i)
In particular, choosing v = A(S ), u = A(S), and using the fact that v minimizes
2
L (i)(w) + λ w , we obtain that
S
(i)
(i)

(A(S ),z i ) − (A(S),z i ) (A(S),z ) − (A(S ),z )

(i)
f S (A(S )) − f S (A(S)) ≤ + .
m m
(13.9)
Combining this with Equation (13.7) we obtain that
(i)
(i)
(A(S ),z i ) − (A(S),z i ) (A(S),z ) − (A(S ),z )

(i) 2
λ A(S ) − A(S) ≤ + .
m m
(13.10)
The two subsections that follow continue the stability analysis for either Lip-
schitz or smooth loss functions. For both families of loss functions we show that
RLM is stable and therefore it does not overﬁt.
13.3.1 Lipschitz Loss
If the loss function, (·,z i ), is ρ-Lipschitz, then by the deﬁnition of Lipschitzness,
(i)
(i)
(A(S ),z i ) − (A(S),z i ) ≤ ρ A(S ) − A(S) . (13.11)
Similarly,
(i)
(i)

(A(S),z ) − (A(S ),z ) ≤ ρ A(S ) − A(S) .

Plugging these inequalities into Equation (13.10) we obtain
(i)
2ρ A(S ) − A(S)
(i) 2
λ A(S ) − A(S) ≤ ,
m
which yields
2ρ
(i)
A(S ) − A(S) ≤ .
λm
Plugging the preceding back into Equation (13.11) we conclude that
2ρ 2
(i)
(A(S ),z i ) − (A(S),z i ) ≤ .
λm
Since this holds for any S, z , i we immediately obtain:

Corollary 13.6. Assume that the loss function is convex and ρ-Lipschitz. Then, the
2
RLM rule with the regularizer λ w is on-average-replace-one-stable with rate 2ρ 2 .
λm

155 156 157 158 159 160 161 162 163 164 165