Page 41 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 41

28                                                  The Real Work of Data Science


             Even if time is short, scrub a small random sample (say, 1,000 records), making it as
           pristine as you possibly can. Your goal is to arrive at a sample of data you know you can
           trust. Employ all possible means of scrubbing and be ruthless! Eliminate erred data
           records and data elements that you cannot correct, and mark data as “uncertain” when
           applicable.
             When you are done, take a hard look. When the scrubbing has gone really well (and
           you’ll know it if it does), you’ve created a data set that rates high on the trustworthy scale.
           It’s OK to move forward using this data. Sometimes the scrubbing is less satisfying. If
           you’ve done the best you can, but still feel uncertain, put this data in the “use with cau-
           tion” category. If the scrubbing goes poorly  –  for example, too many prices just look
           wrong, and you can’t make corrections – you must rate this data, and all like it, as untrust-
           worthy. The sample strongly suggests none of the data should be used in any analyses,
           going forward.
             After the initial scrub, move on to the second cleaning exercise: washing the remaining data
           that was not in the scrubbing sample. Because scrubbing can be a time‐consuming, manual
           process, the wash allows you to make corrections using more automatic processes. For
           example, one wash technique involves “imputing” missing values by statistical means
           (see Wikipedia 2018a). Multiple imputation is a statistical technique for analyzing incomplete
           data sets, that is, data sets for which some entries are missing. Application of the technique
           requires three steps: imputation, analysis, and pooling (Rubin 1987), An up‐to‐date account of
           multiple imputation, as well as code and examples using the mice package in R, can be found
           in van Buuren (2012). If the washing goes well, put this data into the “use with caution”
           category.
             The following flow chart (Figure 6.1) summarizes this logic.





                         Was the data
                          created in  Yes
                         accordance                                      Trust
              Raw data                                                  this data
                         with a first-rate
                         DQ program?
                                                            Yes
                          Can you
                                                       Did the
                         identify data of  Yes  “Scrub” a  Yes  “scrubbing” go  Somewhat  Use this data
                          high quality  small sample                        with caution
                         through your                  well?
                          research?
                                                                                Yes
                                                       No
                      No
                                                              “Wash” the
                                                               remaining     Did the
                                                              data using    “washing” go
                                                              automated       well?
                                                              techniques
                                                                                No
                                                                             Do not
                                                                             trust this
                                                                              data
                     Figure 6.1  Process for evaluating data’s trustworthiness. DQ: data quality.
   36   37   38   39   40   41   42   43   44   45   46