Page 110 - Data Science Algorithms in a Week
P. 110

94                              Fred K. Gruber

                       Final Variation of the GA

                          Based on the results of the previous experiments, we selected the parameters shown
                       in Table 5.

                                       Table 5. Parameters in the final genetic algorithm

                        Parameters                    Value
                        Population                    10
                        Generations                   20
                        Prob. of crossover            0.95
                        Prob. of mutation             0.05
                        Fitness function              10-fold crossvalidation
                        Selection                     2-Tournament selection
                        Crossover types               Diagonal with 4 parents
                        Mutation type                 Fixed rate
                        Others                        Elitist strategy

                          The activity diagram of the final genetic algorithm is shown in Figure 18. The most
                       important difference between this final model and the one used in the previous section is
                       related to the random split of the data. Instead of using only one split of the data for the
                       complete run of the GA, every time the fitness of the population is calculated, we use a
                       different random split (see Figure 19).
                          As a result, all individuals at a particular generation are measured under the same
                       conditions. Using only one random split throughout the whole run of the GA carries the
                       danger that the generalization error estimate for one particular model may be higher than
                       for other models because of the particular random selection and not because it was really
                       better in general. Using a different random split before calculating the fitness of every
                       individual carries the same danger: an apparent difference in performance may be due to
                       the particular random order and not due to the different value of the parameters.
                          While repeating the estimate several times and getting an average would probably
                       improve  the estimate,  the increase in computational requirements  makes  this approach
                       prohibitive. For example, if we have 10 individuals and we use 10 fold crossvalidation
                       we would have to do 100 trainings per generation. If in addition, we repeat every estimate
                       10 times to get an average we would have to do 1000 trainings. Clearly, for real world
                       problems this is not a good solution.
                          Using  the  same  random  split  in  each  generation  has  an  interesting  analogy  with
                       natural evolution. In nature the environment (represented by a fitness function in GAs) is
                       likely to vary with time, however, at any particular time all individuals are competing
                       under the same conditions.
   105   106   107   108   109   110   111   112   113   114   115