Page 38 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 38

6







             Sorry, but You Can’t Trust the Data






             This chapter is about data quality, which is a plague on data scientists and CAOs (and practically
             everyone else for that matter). It takes up to 80% of data scientists’ time (Wilder‐James 2016)
             and is the problem they complain about most (Kaggle 2017). Worse, you can never be sure
             you’ve found all the errors and, worse still, the issues grow larger and more impactful with AI.
             The data needed for data science can come from primary sources (created with similar objec-
             tives) or secondary sources (someone else collected it, often for a different objective, and
             processed it before it reaches you).
               While there is no panacea, we can:
             1.  help you understand the extent of the problem;
             2.  provide some structure for dealing with immediate issues;
             3.  advise you to push your company get in front of recurring data quality issues.

             Most Data Is Untrustworthy

             Without delving too deeply into details, to be judged of high quality, data must meet three
             distinct criteria (Redman 2016):

                • It must be “right:” correct, properly labeled, de‐deduped, and so forth.
                • It must be “the right data:” unbiased, comprehensive, relevant to the task at hand.
                • It must be “(re)presented in the right way.” For example, people can’t read bar codes, locally
               used acronyms may confuse others, and so forth.

               Regarding the first criteria, the most comprehensive study of data quality statistics that we
             know of was conducted in Ireland in 2014–2016 (Nagle et al. 2017). It made use of the “Friday
             Afternoon Measurement” (summarized in the next section), focused on the most important
             and recent data, and concluded the following:

                • On average, 47% of newly created data records have at least one critical (e.g. work‐impacting) error.
                • Only 3% of the data quality evaluations can be rated “acceptable” using the loosest‐possible
               standard.


             The Real Work of Data Science: Turning Data into Information, Better Decisions, and Stronger Organizations,
             First Edition. Ron S. Kenett and Thomas C. Redman.
             © 2019 Ron S. Kenett and Thomas C. Redman. Published 2019 by John Wiley & Sons Ltd.
             Companion website: www.wiley.com/go/kenett-redman/datascience
   33   34   35   36   37   38   39   40   41   42   43