Page 39 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 39
26 The Real Work of Data Science
• The variation in data quality is enormous, with individual data quality evaluations on a
0–100% scale, ranging from 0 to 99%. Still, deeper analyses revealed no important industry
(e.g. health care, tech), data type (e.g. people data, customer data), organization size, or
public/private differences.
Thus, no sector, government agency, or department is immune to the ravages of extremely
poor data quality. And importantly, these results do not include issues such as duplicates,
inconsistencies between systems, and so forth. Further, they focus only on the most recent and
important data that is under an organization’s control. They do not include older or less‐used
data or data not under the organization’s control. Thus, bad as they are, these results reflect an
upper bound on what you have to deal with in data science.
“The right data” standards are murkier.
Data Quality and the Internet of Things After all, what is just right for one analysis
may not suit another. Data scientists usually
Those who deal with automated measurement consider anything they can get their hands
(e.g. the Internet of Things (IoT]) are some- on, but that may not be good enough, as
times tempted to dismiss these results, reports on bias in data used for facial recog-
thinking they stem from human error. Doing nition (Lohr 2018) and criminal justice
so is unwise. First, while some errors are (Tashea 2017) attest.
human related, most are not. Second, our We don’t know of any definitive study
experience, although anecdotal, convinces on data presentation either. Still, issues
us that automated measurement is no better, come up from time to time. For example,
although the failure modes may be different. handwritten notes and local acronyms
For example, a meter in the electric grid may have complicated IBM’s efforts to apply
simply shut down, or sand may clog an AI (e.g. Watson) to cancer treatment (Ross
anemometer and cause intermittent failures. 2017).
So until a specific device is proven correct, Importantly, increasingly complex prob-
you should assume it produces data of no lems demand not just more data but more
higher quality. diverse, comprehensive data, and with this
come more quality problems. For example,
dealing with subtle differences in the definitions of data from different sources is increasingly
challenging.
Of course, the caustic observation “garbage in, garbage out” has plagued analytics and
decision‐making for generations. The concern today is “big garbage in, big garbage out.”
Data scientists bear special responsibility here; after all, the caliber of your recommendations
depends on high‐quality data!
AI and some predictive analyses exacerbate our concerns. Bad data can rear its ugly head
twice – first in the historical data used in training a predictive model (PM) and second in the
new data used by that PM going forward. Consider an organization seeking productivity gains
with its machine learning efforts. Although the data science team that developed the PM may
have done a solid job cleansing the training data, the PM will still be compromised by bad data
going forward. Again, it takes people, lots of them, to find and correct the errors. This in turns
subverts the hoped‐for productivity gains.
Finally, there is the possibility of “cascades.” A cascade occurs when a minor error in one
prediction or decision grows larger in subsequent steps. The financial crisis that started in late
2007 is one example. Erred data in mortgage applications led to incorrect predictions of