Page 190 - Data Science Algorithms in a Week
P. 190
174 Luis Rabelo, Edgar Gutierrez, Sayli Bhide et al.
b. Experiments and Visits – The different experiments and the data must
be understood by the data science team. How do they relate to each
other? How was the equipment/surveys calibrated/designed? Who are the
owners of the data?
c. Organizational/Cultural/Political and the ecosystem – The problem
ecosystem must be investigated. Do the participants understand the
goals/objectives and procedures of the data science task? Is there an
institutional culture of sharing ideas, information, and data? Is top
management championing the data science team?
2. Gather Information from current databases/files and servers/clusters: This
step is very important. Complex problems in large/global organizations have
distributed databases, servers and other types of repositories of data and
information in different formats, different computing/IT platforms, unstructured,
structured, and levels of details and accuracy.
3. Develop map of databases and clusters from the different points in the life-
cycle: It is important to have a clear picture and guidance of the different
variables, experiments and data available. A map of this is very important for
providing the flexibility to integrate different databases and clusters, and create
new ones. Enterprise data hubs and ontologies are very important (if budget and
sophistication of the project permit) to increase agility, capacity planning, and
interoperability.
4. Develop map of “models” (analytical and Empirical) from the different
points in the life-cycle: Usually, this step is totally forgotten from the data
science task (it was difficult to find an article on data mining/data science with
this philosophy). The traditional data miners go directly to the database to start
playing with the data and the variables. Not only are the results from experiments
very important for the data mining task but so are previously developed models
based on statistics, non-statistical techniques, finite element analysis, simulations,
and first principle models. These models have important information. We must
be able to explore their fusion with the predictive models to be developed by the
data science task.
5. Build databases from current ones (if required): Now that we know the
goals/objectives, of the different environments, we can create comprehensive
databases with the relevant data and variables. Different procedures can be used
to start preparing the data for the modeling efforts by the advanced analytics
team..
6. Knowledge Discovery and Predictive Modeling: Develop the different models,
discovery of relationships, according to the goals/objectives of the data science
task. It is important to explore the information fusion of the different models
developed.
7. Deployment of the models developed: This not only includes the development
of a user interface but also includes the interpretation of the models’ answers in
the corresponding technical language. An integrity management plan must be
implemented with the appropriate documentation.