Page 42 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 42

Sorry, but You Can’t Trust the Data                                      29


               Once you’ve identified the data that you can trust or use with caution, integrate disparate
             data sources. You need to do three things:

                • Identification: verify that the Courtney Smith in one data set is the same Courtney Smith in
               others.
                • Alignment of units of measure and data definitions: make sure Courtney’s purchases
               and prices paid, expressed in “pallets” and “dollars” in one set, are aligned with “units” and
               “euros” in another.
                • De‐duplication: check that the Courtney Smith record does not appear multiple times in
               different ways (say, as C. Smith or Courtney E. Smith).

               At this point, you’re ready to do your analyses. When possible, conduct analyses using the
             “trusted data” and the “use with caution data” in parallel. Pay particular attention when you
             get different results based on “use with caution” and “trusted” data. Both great insights and
             great traps lie here. When a result looks intriguing, isolate the data and repeat the steps above,
             making more detailed measurements, scrubbing the data, and improving wash routines.
             As you do so, develop a feel for how deeply you should trust this data. Please note that this
             comparison is not possible if the only data you truly trust is the scrubbed 1,000‐record sample
             and you’re using AI. One thousand trusted records is simply not enough to train a PM.
               Maintain an audit trail, from original data, to steps you take to deal with quality issues, to
             the data you use in final analyses. This is simply good practice, although we find that some
             skip it.
               Understanding where you can trust the data allows you to push that data to its limits. Data
             doesn’t have to be perfect to yield new insights, but you must exercise caution by under-
             standing where the flaws lie, working around errors, cleaning them up, and backing off when
             the data simply isn’t good enough.
               No matter what, don’t get overconfident. Even with all these steps, the data will not be
             perfect, as cleansing neither detects nor corrects all the errors. Finally, be transparent about the
             weaknesses in the data and how these might impact your analyses


             Getting in Front of Tomorrow’s Data Quality Issues
             Of course, today’s data quality problems are bad enough. Even worse is failing to take steps
             so they don’t recur. If you’re experiencing a 20% error rate now, you can feel confident that
             you will experience a 20% error rate in the future (within statistical limits, of course). And
             growing data volumes mean more data errors, and more cleanup, in the future.
               Worst of all, bad data affects everything your entire organization does. After all, much of the
             data you use in analytics is used by others in basic operations. For example, an incorrect
             address may slow one of your analyses, but it also means that someone’s package was not
             delivered, on time or at all. While there is considerable company‐to‐company variation, best
             estimates are the data quality costs a typical organization 20% of revenue (Redman 2017c).
               The only solution is to find and eliminate root causes of error (Redman 2016). Leading
             efforts to attack data quality across the entire company is beyond the scope for most data
             scientists today. Still, data scientists may have the best, widest view of data quality issues.
             Summarizing them, including their costs and what should be done about them, though, must
             be in the data scientists’ and CAOs’ wheelhouses!
   37   38   39   40   41   42   43   44   45   46   47