Page 40 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 40
Sorry, but You Can’t Trust the Data 27
default rates. These, in turn, impacted the performance of securities (e.g. collateralized debt
obligations) build on mortgages. And on and on. We are especially concerned that, as AI
technologies penetrate organizations, the output of one model will feed the next, and the next,
and so on, even crossing company boundaries.
It is tempting to ignore these issues, trust the data, and jump into your work. If the data falls
short, you can always go back and deal with quality issues. After all, “innocent until proven
guilty.”
But both the “facts” (i.e. quality is low) and the “consequences” (“garbage in, garbage out”)
advise against it. And decision‐makers are well aware of data quality issues. In a recent survey,
only 16% agreed that they trust the data (Harvard Business Review 2013). Therefore, we
recommend (and experienced data scientists know) that you adopt the position that “the data
is not to be trusted, until proven otherwise.”
Dealing with Immediate Issues
Not surprisingly, we recommend an all‐out attack on data quality from the very beginning.
This section focuses on dealing with immediate issues and the next on getting in front longer
term.
The first step, as we discussed in detail in Chapter 5, is to visit the places where the data is
created. There is so much going on that you can’t understand in any other way.
Second, evaluate quality for yourself. If the data was created in accordance with a first‐rate
data quality program, you can trust it. Such programs feature clear accountabilities for
managers to create data correctly, input controls, and make efforts to find and eliminate the
root causes of error (Redman 2015). You won’t have to opine whether the data is good – data
quality statistics will tell you. You’ll find a human being who will be happy to explain what
you may expect and answer your questions. If the data quality stats look good and the
conversation goes well, trust the data. Please note that this is the “gold standard” against
which the other steps below should be calibrated.
You should also develop your own data quality statistics, the “Friday Afternoon
Measurement” (Redman 2016), as used in the study noted above. Briefly, you lay out 10 or
15 important data elements for 100 data records on a spreadsheet (best if you do so for the
100 most recently created records). If the new data involves customer purchases, such data
elements may include “customer name,” “purchased item,” and “price.” Then work record
by record, taking a hard look at each data element. The obvious errors will jump out at
you – customer names will be misspelled, the purchased item will be an item you don’t sell,
the price may be missing. Mark these obvious errors with a red pen. Then simply count up
the fraction of records with no errors. In many cases you’ll see quite a bit of red – don’t trust
this data! If you see only a little red – say, less than 5% of records with an obvious error – you
can use this data with caution.
Look, too, at patterns of the errors. If, for example, there are 25 total errors, 24 of which
occur in the price, eliminate that data element going forward. But if the rest of the data looks
pretty good, use it with caution.
Third, work through the “rinse, wash, scrub” cycles. “Rinse” replaces obvious errors
with “missing value” or corrects them if doing so is very easy; “scrub” involves deep study,
even making corrections one at a time, by hand, if necessary; and “wash” occupies a middle
ground.