Page 353 - Understanding Machine Learning

P. 353

26.4 Generalization Bounds for Predictors with Low 1 Norm 335

Let B = w 2 and consider the set H ={w : w 2 ≤ B}. By the deﬁnition of hard-
SVM and our assumption on the distribution, we have that w S ∈ H with probability
1and that L S (w S ) = 0. Therefore, using Theorem 26.12 we have that

2BR 2ln(2/δ)
L D (w S ) ≤ L S (w S ) + √ + .
m m

Remark 26.1. Theorem 26.13 implies that the sample complexity of hard-SVM
2 2
grows like R w . Using a more delicate analysis and the separability assumption,

2
2 2
it is possible to improve the bound to an order of R w .

The bound in the preceding theorem depends on w , which is unknown. In the
following we derive a bound that depends on the norm of the output of SVM; hence
it can be calculated from the training set itself. The proof is similar to the derivation
of bounds for structure risk minimization (SRM).
Theorem 26.14. Assume that the conditions of Theorem 26.13 hold. Then, with
m
probability of at least 1 − δ over the choice of S ∼ D , we have that
*
4log ( w S )
2
4R w S ln( δ )
P [y = sign( w S ,x )] ≤ √ + .
(x,y)∼D m m
i δ
Proof. For any integer i,let B i = 2 , H i ={w : w ≤ B i },and let δ i = 2 .Fix i,then
2i
using Theorem 26.12 we have that with probability of at least 1 − δ i

2B i R 2ln(2/δ i )
∀w ∈ H i , L D (w) ≤ L S (w) + √ +
m m
∞
Applying the union bound and using i=1 i ≤ δ we obtain that with probability of
δ
at least 1 − δ this holds for all i. Therefore, for all w,if we let i = log ( w ) then
2
2
w ∈ H i , B i ≤ 2 w ,and 2 = (2i) 2 ≤ (4log ( w )) 2 . Therefore,
δ i δ δ

2B i R 2ln(2/δ i )
L D (w) ≤ L S (w) + √ +
m m

4 w R 4(ln(4log ( w )) + ln(1/δ))
2
≤ L S (w) + √ + .
m m
In particular, it holds for w S , which concludes our proof.
Remark 26.2. Note that all the bounds we have derived do not depend on the dimen-
sion of w. This property is utilized when learning SVM with kernels, where the
dimension of w can be extremely large.

26.4 GENERALIZATION BOUNDS FOR PREDICTORS WITH LOW 1 NORM
In the previous section we derived generalization bounds for linear predictors with
an 2 -norm constraint. In this section we consider the following general 1 -norm
constraint formulation. Let H ={w : w 1 ≤ B} be our hypothesis class, and let

348 349 350 351 352 353 354 355 356 357 358