Page 361 - Understanding Machine Learning

P. 361

28.2 The Lower Bound for the Agnostic Case 343

D + and D − , such that for b ∈{±1} we have

1+yb
2 if x = c
D b ({(x, y)}) =
0 otherwise.
That is, all the distribution mass is concentrated on two examples (c,1) and (c,−1),
where the probability of (c,b)is 1+b
and the probability of (c,−b)is 1−b
.
2 2
Let A be an arbitrary algorithm. Any training set sampled from D b has the
form S = (c, y 1 ),...,(c, y m ). Therefore, it is fully characterized by the vector y =
m
(y 1 ,..., y m ) ∈{±1} . Upon receiving a training set S, the algorithm A returns a
hypothesis h : X →{±1}. Since the error of A w.r.t. D b only depends on h(c), we
m
can think of A as a mapping from {±1} into {±1}. Therefore, we denote by A(y)
the value in {±1} corresponding to the prediction of h(c), where h is the hypothesis
that A outputs upon receiving the training set S = (c, y 1 ),...,(c, y m ).
Note that for any hypothesis h we have

1 − h(c)b
(h) = .
L D b
2
In particular, the Bayes optimal hypothesis is h b and

1 − A(y)b
1 −

if A(y) = b
(h b ) = − =
L D b (A(y)) − L D b
2 2 0 otherwise.
m
b
Fix A.For b ∈{±1},let Y ={y ∈{0,1} : A(y) = b}. The distribution D b induces
m
a probability P b over {±1} .Hence,

b
(h b ) =
] = D b (Y ) = P b [y]1 [A(y) =b] .
P[L D b (A(y)) − L D b
y
m
+
Denote N ={y : |{i : y i = 1}| ≥ m/2} and N ={±1} \ N . Note that for any y ∈ N +
+
−
we have P + [y] ≥ P − [y] and for any y ∈ N we have P − [y] ≥ P + [y]. Therefore,
−
(h b ) =
]
max P[L D b (A(y)) − L D b
b∈{±1}

= max P b [y]1 [A(y) =b]
b∈{±1}
y
1 1
≥ P + [y]1 [A(y) =+] + P − [y]1 [A(y) =−]
2 2
y y
1
= (P + [y]1 [A(y) =+] + P − [y]1 [A(y) =−] )
2
y∈N +
1
+ (P + [y]1 [A(y) =+] + P − [y]1 [A(y) =−] )
2
y∈N −

356 357 358 359 360 361 362 363 364 365 366