Page 195 - Data Science Algorithms in a Week
P. 195
Predictive Analytics using Genetic Programming 179
GP assigns the computer the ability to develop computer programs/models
(represented as tree structures). GP is based on the “Darwinian ideas of survival of the
fittest" and the operators of crossover, mutation, and reproduction (Koza, 1994; Koza et
al., 2003). Figure 6 (modified from Koza (1994)) where Gen is the current generation. M
is the population size and i is the current individual in the population. The initial
population of size M is created and Gen is initialized to 0. As stated by Ratner (2008),
“The process begins with a fitness function and a set of user-selectable mathematical and
logical functions” from which a predictive model can be formed. “A first generation of as
many as 250 - 1000 models is randomly generated using the functions and variables
available; the fitness of each model is evaluated using collected data” (Ratner, 2008).
Each individual i of the population is evaluated from the objective function viewpoint
and their relative fitness calculated. Then, the highest values are compared with the
objectives of the project/session and it can be decided to stop if met, otherwise the
evolutionary process will continue.
The next generations of models are created following the processes of mating
(crossover), reproduction, and mutation. Crossover (as explained above) is when two
computer programs pair off. The resulting offspring are blends of the parents’ genetic
material. Reproduction is just the cloning of the best individuals that evolution should
maintain for the next generation. On the other hand, mutation in GP is considered as a
secondary operator. Piszcz & Soule (2005) have shown that mutation can improve
performance when combined with crossover. There are several mutation realizations in
GP environments. However, the most utilized ones are: node based mutation (i.e., a rate
specifies the node’s probability – Figure 7) and tree based mutation (i.e., a rate gives the
frequency that individuals are selected). When applying mutation in GP, “the mutation
rate is set to 1/C, where C denotes the size of the tree” (Piszcz & Soule, 2005). After a
number of generations, GP provides a predictive model adapted to the objective.
GP is considered a non-statistical methodology. Its major use is predictive modeling.
However, GP can also be used for knowledge discovery as explained by Koza et al.
(2003). Unfortunately, the current articles in predictive analytics mention genetic
algorithms but not GP.
CASE STUDY
NASA's Space Shuttle was the first orbital spacecraft that was a reusable launch
vehicle. At launch, it consisted of the following major systems: external tank, solid rocket
boosters, and orbiter (Figure 8). After 2003, there were three orbiters: Atlantis, Discovery
and Endeavour. Discovery completed its final mission on March 9, 2011 and Endeavour