Page 201 - Data Science Algorithms in a Week
P. 201

Predictive Analytics using Genetic Programming            185

                       attributes X and Y. Therefore, GP can contribute not only to a complete solution but also
                       providing synthetic attributes.


                       Deciles

                          The historical data is randomly split in two groups: one to build the model and the
                       other to test and confirm the accuracy of the prediction model. The approach of using two
                       groups of data can be used in a variety of AI algorithms to find the best set of predictors.
                          The majority of the schemes in machine learning use the confusion matrix as a way
                       to measure the performance using the test data. The confusion matrix finds the number of
                       “individuals” for which the prediction was accurate. On the other hand, with the decile
                       table it’s possible to identify the specific individuals which have better performance. The
                       decile  tables  measures  the  accuracy  of  a  predictive model  versus  a  prediction  without
                       modeling (Ratner, 2011).
                          The decile table is use to score the test sample on a scale of 1 to 100 based upon the
                       characteristics identified by the algorithm, depending on the problem context. The list of
                       individuals  in  the  test  sample  is  then  rank  ordered  by  score  and  split  into  10  groups,
                       called deciles. The top 10 percent of scores was decile one, the next 10 percent was decile
                       two, and so forth. Decile separates and orders the individuals on an ordinal scale. Each
                       decile has a number of individuals; it is the 10% of the total size of the sample test. Then
                       the  actual  number  of  responses  in  each  decile  is  listed.  Then,  other  analysis  such  as
                       response  rate,  cumulative  response  rate,  and  predictability  (based  on  the  cumulative
                       response  rate)  can  be  performed.  The  performance  in  each  decile  can  be  used  as  an
                       objective function for machine learning algorithms.


                       Genetic Programming Software Environment

                          The  GenIQ  System  (Ratner,  2008;  2009),  based  on  GP,  is  utilized  to  provide
                       predictive models. GenIQ lets the data define the model, performs variable selection, and
                       then specifies the model equation.
                          The GenIQ System develops the model by performing generations of models so as to
                       optimize  the  decile  table.  As  explained  by  Ratner  [16]  “Operationally,  optimizing  the
                       decile  table  is  creating  the  best  possible  descending  ranking  of  the  target  variable
                       (outcome)  values.  Thus,  GenIQs  prediction  is  that  of  identifying  individuals,  who  are
                       most-likely  to  least-likely  to  respond  (for  a  binary  outcome),  or  who  contribute  large
                       profits to small profits (for a continuous outcome).”
                          We  decided to  use  a  file with  information  about  thermography  and  some  selected
                       flights from Atlantis, Discovery, and Endeavour from the different databases available in
   196   197   198   199   200   201   202   203   204   205   206