Page 38 - Understanding Machine Learning
P. 38
A Gentle Start
20
Figure 2.1. Each point in the large circle represents a possible m-tuple of instances. Each
colored oval represents the set of “misleading” m-tuple of instances for some “bad” pre-
dictor h ∈ H B . The ERM can potentially overfit whenever it gets a misleading training set
S. That is, for some h ∈ H B we have L S (h) = 0. Equation (2.9) guarantees that for each
m
individual bad hypothesis, h ∈ H B ,atmost (1 −
) -fraction of the training sets would be
misleading. In particular, the larger m is, the smaller each of these colored ovals becomes.
The union bound formalizes the fact that the area representing the training sets that are
misleading with respect to some h ∈ H B (that is, the training sets in M)is atmostthe
sum of the areas of the colored ovals. Therefore, it is bounded by |H B | times the maximum
size of a colored oval. Any sample S outside the colored ovals cannot cause the ERM rule
to overfit.
Then, for any labeling function, f , and for any distribution, D, for which the realiz-
ability assumption holds (that is, for some h ∈ H, L (D, f ) (h) = 0), with probability of
at least 1 − δ over the choice of an i.i.d. sample S of size m, we have that for every
ERM hypothesis, h S , it holds that
L (D, f ) (h S ) ≤
.
The preceeding corollary tells us that for a sufficiently large m,the ERM H rule
over a finite hypothesis class will be probably (with confidence 1 −δ) approximately
(up to an error of
) correct. In the next chapter we formally define the model of
Probably Approximately Correct (PAC) learning.
2.4 EXERCISES
2.1 Overfitting of polynomial matching: We have shown that the predictor defined in
Equation (2.3) leads to overfitting. While this predictor seems to be very unnatural,
the goal of this exercise is to show that it can be described as a thresholded poly-
m d m
nomial. That is, show that given a training set S ={(x i , f (x i ))} ⊆ (R ×{0,1}) ,
i=1
there exists a polynomial p S such that h S (x) = 1 if and only if p S (x) ≥ 0, where h S
is as defined in Equation (2.3). It follows that learning the class of all thresholded
polynomials using the ERM rule may lead to overfitting.
2.2 Let H be a class of binary classifiers over a domain X.Let D be an unknown distri-
bution over X,and let f be the target hypothesis in H.Fix some h ∈ H. Show that
the expected value of L S (h) over the choice of S| x equals L (D, f ) (h), namely,
E [L S (h)] = L (D, f ) (h).
S| x ∼D m
2.3 Axis aligned rectangles: An axis aligned rectangle classifier in the plane is a classi-
fier that assigns the value 1 to a point if and only if it is inside a certain rectangle.