Page 107 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 107
100 Appendix E
Particularly important is the continuing development of AI, machine learning, and deep
learning. Neural networks, first introduced in the 1980s, lie at the core. These highly parame-
terized models and algorithms, inspired by the architecture of the human brain, can often
develop powerful predictive models when they are trained with enough high‐quality data.
We are also encouraged by recent technical advances in statistical learning. This area
is motivated by increasing abilities to make large quantities of measurements automatically,
producing “wide data sets.” While there may be huge quantities of data, the number of
independent variables still far exceeds the number of observations. In text analytics, for
example, a document is represented by counts of words in a dictionary. This leads to document‐
term matrices with 20,000 columns, one for each distinct word in the vocabulary. Each
document is represented by a row, and a cell contains the number of times a word appears
in the document. Most entries are zero.
In many studies, data scientists have hundreds of independent variables that can be used
as predictors in a regression model (say). It is likely that a subset will do the job well,
whereas including all independent variables will degrade the model. Hence, identifying a
good subset of variables is essential. Statistical learning, advanced by Efron, Friedman,
Hastie, Tibshirani, and others, has made it possible to handle such data sets. Efron and
Hastie (2016) provide a beautiful account of computer‐age statistical inference from past,
present, and future perspectives. Statistical learning describes a wide range of computer‐
intensive data analytic algorithms now on the data scientist’s workbench. Short descriptions
of some specific methods follow.
First is the least absolute shrinkage and selection operator (LASSO). It is a regression‐based
method that performs both variable selection and regularization to enhance the prediction
accuracy and interpretability of the statistical model. Other approaches based on decision trees
are random forests and boosting. Decision trees create a model that predicts the value of a target
variable based on several input variables. Trees are “learned” by splitting the data into subsets
based on an input variable. This is repeated on each derived subset in a recursive manner called
“recursive partitioning.” The recursion is complete when the subset at a node has all the same
value of the target variable or when splitting no longer adds value to the predictions. In random
forests, one grows many decision trees to randomized versions of the training data and averages
them. In boosting, one repeatedly grows trees and builds up an additive model consisting of a
sum of trees. For more, see Hastie et al. (2009) and James et al. (2013). Further, for a contem-
poraneous discussion of current trends in data science trends, see https://mathesia.com/home/
Mathesia_Outlook_2019.pdf.
It is clear that the great data scientists we discussed in Chapter 2 have exciting futures as
they continually learn and leverage new statistical and other methods.