Page 195 - Data Science Algorithms in a Week
P. 195

Predictive Analytics using Genetic Programming            179

                          GP  assigns  the  computer  the  ability  to  develop  computer  programs/models
                       (represented as tree structures). GP is based on the “Darwinian ideas of survival of the
                       fittest" and the operators of crossover, mutation, and reproduction (Koza, 1994; Koza et
                       al., 2003). Figure 6 (modified from Koza (1994)) where Gen is the current generation. M
                       is  the  population  size  and  i  is  the  current  individual  in  the  population.  The  initial
                       population of size M is created and Gen is initialized to 0. As stated by Ratner (2008),
                       “The process begins with a fitness function and a set of user-selectable mathematical and
                       logical functions” from which a predictive model can be formed. “A first generation of as
                       many  as  250  -  1000  models  is  randomly  generated  using  the  functions  and  variables
                       available;  the  fitness  of  each  model  is  evaluated  using  collected  data”  (Ratner,  2008).
                       Each individual i of the population is evaluated from the objective function viewpoint
                       and  their  relative  fitness  calculated.  Then,  the  highest  values  are  compared  with  the
                       objectives  of  the  project/session  and  it  can  be  decided  to  stop  if  met,  otherwise  the
                       evolutionary process will continue.
                          The  next  generations  of  models  are  created  following  the  processes  of  mating
                       (crossover),  reproduction,  and  mutation.  Crossover  (as  explained  above)  is  when  two
                       computer programs pair off. The resulting offspring are blends of the parents’ genetic
                       material. Reproduction is just the cloning of the best individuals that evolution should
                       maintain for the next generation. On the other hand, mutation in GP is considered as a
                       secondary  operator.  Piszcz  &  Soule  (2005)  have  shown  that  mutation  can  improve
                       performance when combined with crossover. There are several mutation realizations in
                       GP environments. However, the most utilized ones are: node based mutation (i.e., a rate
                       specifies the node’s probability – Figure 7) and tree based mutation (i.e., a rate gives the
                       frequency that individuals are selected). When applying mutation in GP, “the mutation
                       rate is set to 1/C, where C denotes the size of the tree” (Piszcz & Soule, 2005). After a
                       number of generations, GP provides a predictive model adapted to the objective.
                          GP is considered a non-statistical methodology. Its major use is predictive modeling.
                       However,  GP  can  also  be  used  for  knowledge  discovery  as  explained  by  Koza  et  al.
                       (2003).  Unfortunately,  the  current  articles  in  predictive  analytics  mention  genetic
                       algorithms but not GP.


                                                       CASE STUDY

                          NASA's  Space  Shuttle  was  the  first  orbital  spacecraft  that  was  a  reusable  launch
                       vehicle. At launch, it consisted of the following major systems: external tank, solid rocket
                       boosters, and orbiter (Figure 8). After 2003, there were three orbiters: Atlantis, Discovery
                       and Endeavour. Discovery completed its final mission on March 9, 2011 and Endeavour
   190   191   192   193   194   195   196   197   198   199   200