Page 40 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 40

Sorry, but You Can’t Trust the Data                                      27


             default rates. These, in turn, impacted the performance of securities (e.g. collateralized debt
             obligations) build on mortgages. And on and on. We are especially concerned that, as AI
             technologies penetrate organizations, the output of one model will feed the next, and the next,
             and so on, even crossing company boundaries.
               It is tempting to ignore these issues, trust the data, and jump into your work. If the data falls
             short, you can always go back and deal with quality issues. After all, “innocent until proven
             guilty.”
               But both the “facts” (i.e. quality is low) and the “consequences” (“garbage in, garbage out”)
             advise against it. And decision‐makers are well aware of data quality issues. In a recent survey,
             only 16% agreed that they trust the data (Harvard Business Review 2013). Therefore, we
             recommend (and experienced data scientists know) that you adopt the position that “the data
             is not to be trusted, until proven otherwise.”

             Dealing with Immediate Issues

             Not surprisingly, we recommend an all‐out attack on data quality from the very beginning.
             This section focuses on dealing with immediate issues and the next on getting in front longer
             term.
               The first step, as we discussed in detail in Chapter 5, is to visit the places where the data is
             created. There is so much going on that you can’t understand in any other way.
               Second, evaluate quality for yourself. If the data was created in accordance with a first‐rate
             data quality program, you can trust it. Such programs feature clear accountabilities for
             managers to create data correctly, input controls, and make efforts to find and eliminate the
             root causes of error (Redman 2015). You won’t have to opine whether the data is good – data
             quality statistics will tell you. You’ll find a human being who will be happy to explain what
             you may expect and answer your questions. If the data quality stats look good and the
             conversation goes well, trust the data. Please note that this is the “gold standard” against
             which the other steps below should be calibrated.
               You should also develop your own data quality statistics, the “Friday  Afternoon
             Measurement” (Redman 2016), as used in the study noted above. Briefly, you lay out 10 or
             15 important data elements for 100 data records on a spreadsheet (best if you do so for the
             100 most recently created records). If the new data involves customer purchases, such data
             elements may include “customer name,” “purchased item,” and “price.” Then work record
             by record, taking a hard look at each data element. The obvious errors will jump out at
             you – customer names will be misspelled, the purchased item will be an item you don’t sell,
             the price may be missing. Mark these obvious errors with a red pen. Then simply count up
             the fraction of records with no errors. In many cases you’ll see quite a bit of red – don’t trust
             this data! If you see only a little red – say, less than 5% of records with an obvious error – you
             can use this data with caution.
               Look, too, at patterns of the errors. If, for example, there are 25 total errors, 24 of which
             occur in the price, eliminate that data element going forward. But if the rest of the data looks
             pretty good, use it with caution.
               Third, work through the “rinse, wash, scrub” cycles. “Rinse” replaces obvious errors
             with “missing value” or corrects them if doing so is very easy; “scrub” involves deep study,
             even making corrections one at a time, by hand, if necessary; and “wash” occupies a  middle
             ground.
   35   36   37   38   39   40   41   42   43   44   45