Page 42 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 42
Sorry, but You Can’t Trust the Data 29
Once you’ve identified the data that you can trust or use with caution, integrate disparate
data sources. You need to do three things:
• Identification: verify that the Courtney Smith in one data set is the same Courtney Smith in
others.
• Alignment of units of measure and data definitions: make sure Courtney’s purchases
and prices paid, expressed in “pallets” and “dollars” in one set, are aligned with “units” and
“euros” in another.
• De‐duplication: check that the Courtney Smith record does not appear multiple times in
different ways (say, as C. Smith or Courtney E. Smith).
At this point, you’re ready to do your analyses. When possible, conduct analyses using the
“trusted data” and the “use with caution data” in parallel. Pay particular attention when you
get different results based on “use with caution” and “trusted” data. Both great insights and
great traps lie here. When a result looks intriguing, isolate the data and repeat the steps above,
making more detailed measurements, scrubbing the data, and improving wash routines.
As you do so, develop a feel for how deeply you should trust this data. Please note that this
comparison is not possible if the only data you truly trust is the scrubbed 1,000‐record sample
and you’re using AI. One thousand trusted records is simply not enough to train a PM.
Maintain an audit trail, from original data, to steps you take to deal with quality issues, to
the data you use in final analyses. This is simply good practice, although we find that some
skip it.
Understanding where you can trust the data allows you to push that data to its limits. Data
doesn’t have to be perfect to yield new insights, but you must exercise caution by under-
standing where the flaws lie, working around errors, cleaning them up, and backing off when
the data simply isn’t good enough.
No matter what, don’t get overconfident. Even with all these steps, the data will not be
perfect, as cleansing neither detects nor corrects all the errors. Finally, be transparent about the
weaknesses in the data and how these might impact your analyses
Getting in Front of Tomorrow’s Data Quality Issues
Of course, today’s data quality problems are bad enough. Even worse is failing to take steps
so they don’t recur. If you’re experiencing a 20% error rate now, you can feel confident that
you will experience a 20% error rate in the future (within statistical limits, of course). And
growing data volumes mean more data errors, and more cleanup, in the future.
Worst of all, bad data affects everything your entire organization does. After all, much of the
data you use in analytics is used by others in basic operations. For example, an incorrect
address may slow one of your analyses, but it also means that someone’s package was not
delivered, on time or at all. While there is considerable company‐to‐company variation, best
estimates are the data quality costs a typical organization 20% of revenue (Redman 2017c).
The only solution is to find and eliminate root causes of error (Redman 2016). Leading
efforts to attack data quality across the entire company is beyond the scope for most data
scientists today. Still, data scientists may have the best, widest view of data quality issues.
Summarizing them, including their costs and what should be done about them, though, must
be in the data scientists’ and CAOs’ wheelhouses!