Page 38 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)

Page 38 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat

P. 38

Sorry, but You Can’t Trust the Data

This chapter is about data quality, which is a plague on data scientists and CAOs (and practically
everyone else for that matter). It takes up to 80% of data scientists’ time (Wilder‐James 2016)
and is the problem they complain about most (Kaggle 2017). Worse, you can never be sure
you’ve found all the errors and, worse still, the issues grow larger and more impactful with AI.
The data needed for data science can come from primary sources (created with similar objec-
tives) or secondary sources (someone else collected it, often for a different objective, and
processed it before it reaches you).
While there is no panacea, we can:
1. help you understand the extent of the problem;
2. provide some structure for dealing with immediate issues;
3. advise you to push your company get in front of recurring data quality issues.

Most Data Is Untrustworthy

Without delving too deeply into details, to be judged of high quality, data must meet three
distinct criteria (Redman 2016):

• It must be “right:” correct, properly labeled, de‐deduped, and so forth.
• It must be “the right data:” unbiased, comprehensive, relevant to the task at hand.
• It must be “(re)presented in the right way.” For example, people can’t read bar codes, locally
used acronyms may confuse others, and so forth.

Regarding the first criteria, the most comprehensive study of data quality statistics that we
know of was conducted in Ireland in 2014–2016 (Nagle et al. 2017). It made use of the “Friday
Afternoon Measurement” (summarized in the next section), focused on the most important
and recent data, and concluded the following:

• On average, 47% of newly created data records have at least one critical (e.g. work‐impacting) error.
• Only 3% of the data quality evaluations can be rated “acceptable” using the loosest‐possible
standard.

The Real Work of Data Science: Turning Data into Information, Better Decisions, and Stronger Organizations,
First Edition. Ron S. Kenett and Thomas C. Redman.
© 2019 Ron S. Kenett and Thomas C. Redman. Published 2019 by John Wiley & Sons Ltd.
Companion website: www.wiley.com/go/kenett-redman/datascience

33 34 35 36 37 38 39 40 41 42 43