Page 139 - Understanding Machine Learning
P. 139

11.3 What to Do If Learning Fails  121


                 The estimation error of the class does depend on the sample size. Therefore,
              if we have a large estimation error we can make an effort to obtain more training
              examples. We can also consider reducing the hypothesis class. However, it doesn’t
              make sense to enlarge the hypothesis class in that case.


              Error Decomposition Using Validation
              We see that understanding whether our problem is due to approximation error or
              estimation error is very useful for finding the best remedy. In the previous section we
              saw how to estimate L D (h S ) using the empirical risk on a validation set. However,
              it is more difficult to estimate the approximation error of the class. Instead, we
              give a different error decomposition, one that can be estimated from the train and
              validation sets.
                        L D (h S ) = (L D (h S ) − L V (h S )) + (L V (h S ) − L S (h S )) + L S (h S ).

              The first term, (L D (h S )− L V (h S )), can be bounded quite tightly using Theorem 11.1.
              Intuitively, when the second term, (L V (h S ) − L S (h S )), is large we say that our algo-
              rithm suffers from “overfitting” while when the empirical risk term, L S (h S ), is large
              we say that our algorithm suffers from “underfitting.” Note that these two terms
              are not necessarily good estimates of the estimation and approximation errors. To
              illustrate this, consider the case in which H is a class of VC-dimension d,and D is
              a distribution such that the approximation error of H with respect to D is 1/4. As
              long as the size of our training set is smaller than d we will have L S (h S ) = 0 for every
              ERM hypothesis. Therefore, the training risk, L S (h S ), and the approximation error,

              L D (h ), can be significantly different. Nevertheless, as we show later, the values of
              L S (h S )and (L V (h S ) − L S (h S )) still provide us useful information.
                 Consider first the case in which L S (h S ) is large. We can write





                        L S (h S ) = (L S (h S ) − L S (h )) + (L S (h ) − L D (h )) + L D (h ).

              When h S is an ERM H hypothesis we have that L S (h S ) − L S (h ) ≤ 0. In addition,



              since h does not depend on S,the term (L S (h ) − L D (h )) can be bounded quite
              tightly (as in Theorem 11.1). The last term is the approximation error. It follows that
              if L S (h S ) is large then so is the approximation error, and the remedy to the failure
              of our algorithm should be tailored accordingly (as discussed previously).
              Remark 11.1. It is possible that the approximation error of our class is small, yet
              the value of L S (h S ) is large. For example, maybe we had a bug in our ERM imple-
              mentation, and the algorithm returns a hypothesis h S that is not an ERM. It may
              also be the case that finding an ERM hypothesis is computationally hard, and our
              algorithm applies some heuristic trying to find an approximate ERM. In some cases,
              it is hard to know how good h S is relative to an ERM hypothesis. But, sometimes it
              is possible at least to know whether there are better hypotheses. For example, in the
              next chapter we will study convex learning problems in which there are optimality
              conditions that can be checked to verify whether our optimization algorithm con-
              verged to an ERM solution. In other cases, the solution may depend on randomness
              in initializing the algorithm, so we can try different randomly selected initial points
              to see whether better solutions pop out.
   134   135   136   137   138   139   140   141   142   143   144