Page 140 - Understanding Machine Learning
P. 140
Model Selection and Validation
122
Error Error
Validation error Validation error
Train error Train error
m m
Figure 11.1. Examples of learning curves. Left: This learning curve corresponds to the
scenario in which the number of examples is always smaller than the VC dimension of the
class. Right: This learning curve corresponds to the scenario in which the approximation
error is zero and the number of examples is larger than the VC dimension of the class.
Next consider the case in which L S (h S ) is small. As we argued before, this does
not necessarily imply that the approximation error is small. Indeed, consider two
scenarios, in both of which we are trying to learn a hypothesis class of VC-dimension
d using the ERM learning rule. In the first scenario, we have a training set of m < d
examples and the approximation error of the class is high. In the second scenario,
we have a training set of m > 2d examples and the approximation error of the class
is zero. In both cases L S (h S ) = 0. How can we distinguish between the two cases?
Learning Curves
One possible way to distinguish between the two cases is by plotting learning curves.
To produce a learning curve we train the algorithm on prefixes of the data of increas-
ing sizes. For example, we can first train the algorithm on the first 10% of the
examples, then on 20% of them, and so on. For each prefix we calculate the training
error (on the prefix the algorithm is being trained on) and the validation error (on
a predefined validation set). Such learning curves can help us distinguish between
the two aforementioned scenarios. In the first scenario we expect the validation
error to be approximately 1/2 for all prefixes, as we didn’t really learn anything.
In the second scenario the validation error will start as a constant but then should
start decreasing (it must start decreasing once the training set size is larger than the
VC-dimension). An illustration of the two cases is given in Figure 11.1.
In general, as long as the approximation error is greater than zero we expect the
training error to grow with the sample size, as a larger amount of data points makes
it harder to provide an explanation for all of them. On the other hand, the validation
error tends to decrease with the increase in sample size. If the VC-dimension is
finite, when the sample size goes to infinity, the validation and train errors converge
to the approximation error. Therefore, by extrapolating the training and validation
curves we can try to guess the value of the approximation error, or at least to get a
rough estimate on an interval in which the approximation error resides.
Getting back to the problem of finding the best remedy for the failure of our
algorithm, if we observe that L S (h S ) is small while the validation error is large, then
in any case we know that the size of our training set is not sufficient for learning
the class H. We can then plot a learning curve. If we see that the validation error is
starting to decrease then the best solution is to increase the number of examples (if
we can afford to enlarge the data). Another reasonable solution is to decrease the