Page 201 - Data Science Algorithms in a Week
P. 201
Predictive Analytics using Genetic Programming 185
attributes X and Y. Therefore, GP can contribute not only to a complete solution but also
providing synthetic attributes.
Deciles
The historical data is randomly split in two groups: one to build the model and the
other to test and confirm the accuracy of the prediction model. The approach of using two
groups of data can be used in a variety of AI algorithms to find the best set of predictors.
The majority of the schemes in machine learning use the confusion matrix as a way
to measure the performance using the test data. The confusion matrix finds the number of
“individuals” for which the prediction was accurate. On the other hand, with the decile
table it’s possible to identify the specific individuals which have better performance. The
decile tables measures the accuracy of a predictive model versus a prediction without
modeling (Ratner, 2011).
The decile table is use to score the test sample on a scale of 1 to 100 based upon the
characteristics identified by the algorithm, depending on the problem context. The list of
individuals in the test sample is then rank ordered by score and split into 10 groups,
called deciles. The top 10 percent of scores was decile one, the next 10 percent was decile
two, and so forth. Decile separates and orders the individuals on an ordinal scale. Each
decile has a number of individuals; it is the 10% of the total size of the sample test. Then
the actual number of responses in each decile is listed. Then, other analysis such as
response rate, cumulative response rate, and predictability (based on the cumulative
response rate) can be performed. The performance in each decile can be used as an
objective function for machine learning algorithms.
Genetic Programming Software Environment
The GenIQ System (Ratner, 2008; 2009), based on GP, is utilized to provide
predictive models. GenIQ lets the data define the model, performs variable selection, and
then specifies the model equation.
The GenIQ System develops the model by performing generations of models so as to
optimize the decile table. As explained by Ratner [16] “Operationally, optimizing the
decile table is creating the best possible descending ranking of the target variable
(outcome) values. Thus, GenIQs prediction is that of identifying individuals, who are
most-likely to least-likely to respond (for a binary outcome), or who contribute large
profits to small profits (for a continuous outcome).”
We decided to use a file with information about thermography and some selected
flights from Atlantis, Discovery, and Endeavour from the different databases available in