Page 190 - Data Science Algorithms in a Week
P. 190

174               Luis Rabelo, Edgar Gutierrez, Sayli Bhide et al.

                                 b.  Experiments and Visits – The different experiments and the data must
                                     be  understood  by  the  data  science  team.  How  do  they  relate  to  each
                                     other? How was the equipment/surveys calibrated/designed? Who are the
                                     owners of the data?
                                 c.  Organizational/Cultural/Political  and  the  ecosystem  –  The  problem
                                     ecosystem  must  be  investigated.  Do  the  participants  understand  the
                                     goals/objectives  and  procedures  of  the  data  science  task?  Is  there  an
                                     institutional  culture  of  sharing  ideas,  information,  and  data?  Is  top
                                     management championing the data science team?
                          2.  Gather Information from current databases/files and servers/clusters: This
                              step  is  very  important.  Complex  problems  in  large/global  organizations  have
                              distributed  databases,  servers  and  other  types  of  repositories  of  data  and
                              information in different formats, different computing/IT platforms, unstructured,
                              structured, and levels of details and accuracy.
                          3.  Develop map of databases and clusters from the different points in the life-
                              cycle:  It  is  important  to  have  a  clear  picture  and  guidance  of  the  different
                              variables,  experiments  and  data  available.  A  map  of  this is  very  important  for
                              providing the flexibility to integrate different databases and clusters, and create
                              new ones. Enterprise data hubs and ontologies are very important (if budget and
                              sophistication  of  the  project  permit) to increase agility,  capacity  planning,  and
                              interoperability.
                          4.  Develop  map  of  “models”  (analytical  and  Empirical)  from  the  different
                              points  in  the  life-cycle:  Usually,  this  step  is  totally  forgotten  from  the  data
                              science task (it was difficult to find an article on data mining/data science with
                              this philosophy). The traditional data miners go directly to the database to start
                              playing with the data and the variables. Not only are the results from experiments
                              very important for the data mining task but so are previously developed models
                              based on statistics, non-statistical techniques, finite element analysis, simulations,
                              and first principle models. These models have important information. We must
                              be able to explore their fusion with the predictive models to be developed by the
                              data science task.
                          5.  Build  databases  from  current  ones  (if  required):  Now  that  we  know  the
                              goals/objectives,  of  the  different  environments,  we  can  create  comprehensive
                              databases with the relevant data and variables. Different procedures can be used
                              to  start  preparing  the  data  for  the  modeling  efforts  by  the  advanced  analytics
                              team..
                          6.  Knowledge Discovery and Predictive Modeling: Develop the different models,
                              discovery of relationships, according to the goals/objectives of the data science
                              task.  It  is  important  to  explore  the  information  fusion  of  the  different  models
                              developed.
                          7.  Deployment of the models developed: This not only includes the development
                              of a user interface but also includes the interpretation of the models’ answers in
                              the  corresponding  technical  language.  An  integrity  management  plan  must  be
                              implemented with the appropriate documentation.
   185   186   187   188   189   190   191   192   193   194   195