Page 139 - Understanding Machine Learning

P. 139

11.3 What to Do If Learning Fails 121

The estimation error of the class does depend on the sample size. Therefore,
if we have a large estimation error we can make an effort to obtain more training
examples. We can also consider reducing the hypothesis class. However, it doesn’t
make sense to enlarge the hypothesis class in that case.

Error Decomposition Using Validation
We see that understanding whether our problem is due to approximation error or
estimation error is very useful for ﬁnding the best remedy. In the previous section we
saw how to estimate L D (h S ) using the empirical risk on a validation set. However,
it is more difﬁcult to estimate the approximation error of the class. Instead, we
give a different error decomposition, one that can be estimated from the train and
validation sets.
L D (h S ) = (L D (h S ) − L V (h S )) + (L V (h S ) − L S (h S )) + L S (h S ).

The ﬁrst term, (L D (h S )− L V (h S )), can be bounded quite tightly using Theorem 11.1.
Intuitively, when the second term, (L V (h S ) − L S (h S )), is large we say that our algo-
rithm suffers from “overﬁtting” while when the empirical risk term, L S (h S ), is large
we say that our algorithm suffers from “underﬁtting.” Note that these two terms
are not necessarily good estimates of the estimation and approximation errors. To
illustrate this, consider the case in which H is a class of VC-dimension d,and D is
a distribution such that the approximation error of H with respect to D is 1/4. As
long as the size of our training set is smaller than d we will have L S (h S ) = 0 for every
ERM hypothesis. Therefore, the training risk, L S (h S ), and the approximation error,

L D (h ), can be signiﬁcantly different. Nevertheless, as we show later, the values of
L S (h S )and (L V (h S ) − L S (h S )) still provide us useful information.
Consider ﬁrst the case in which L S (h S ) is large. We can write

L S (h S ) = (L S (h S ) − L S (h )) + (L S (h ) − L D (h )) + L D (h ).

When h S is an ERM H hypothesis we have that L S (h S ) − L S (h ) ≤ 0. In addition,

since h does not depend on S,the term (L S (h ) − L D (h )) can be bounded quite
tightly (as in Theorem 11.1). The last term is the approximation error. It follows that
if L S (h S ) is large then so is the approximation error, and the remedy to the failure
of our algorithm should be tailored accordingly (as discussed previously).
Remark 11.1. It is possible that the approximation error of our class is small, yet
the value of L S (h S ) is large. For example, maybe we had a bug in our ERM imple-
mentation, and the algorithm returns a hypothesis h S that is not an ERM. It may
also be the case that ﬁnding an ERM hypothesis is computationally hard, and our
algorithm applies some heuristic trying to ﬁnd an approximate ERM. In some cases,
it is hard to know how good h S is relative to an ERM hypothesis. But, sometimes it
is possible at least to know whether there are better hypotheses. For example, in the
next chapter we will study convex learning problems in which there are optimality
conditions that can be checked to verify whether our optimization algorithm con-
verged to an ERM solution. In other cases, the solution may depend on randomness
in initializing the algorithm, so we can try different randomly selected initial points
to see whether better solutions pop out.

134 135 136 137 138 139 140 141 142 143 144