Page 136 - Understanding Machine Learning
P. 136

Model Selection and Validation
           118

                    To illustrate how validation is useful for model selection, consider again the
                 example of fitting a one dimensional polynomial as described in the beginning of
                 this chapter. In the following we depict the same training set, with ERM polynomi-
                 als of degree 2, 3, and 10, but this time we also depict an additional validation set
                 (marked as red, unfilled circles). The polynomial of degree 10 has minimal training
                 error, yet the polynomial of degree 3 has the minimal validation error, and hence it
                 will be chosen as the best model.

















                 11.2.3 The Model-Selection Curve

                 The model selection curve shows the training error and validation error as a function
                 of the complexity of the model considered. For example, for the polynomial fitting
                 problem mentioned previously, the curve will look like:


                               0.4                                        Train
                                                                      Validation
                               0.3

                              Error  0.2


                               0.1


                                 0
                                       2     4     6    8     10
                                                 d

                 As can be shown, the training error is monotonically decreasing as we increase the
                 polynomial degree (which is the complexity of the model in our case). On the other
                 hand, the validation error first decreases but then starts to increase, which indicates
                 that we are starting to suffer from overfitting.
                    Plotting such curves can help us understand whether we are searching the correct
                 regime of our parameter space. Often, there may be more than a single parameter
                 to tune, and the possible number of values each parameter can take might be quite
                 large. For example, in Chapter 13 we describe the concept of regularization,in which
                 the parameter of the learning algorithm is a real number. In such cases, we start
   131   132   133   134   135   136   137   138   139   140   141