Page 39 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 39

26                                                  The Real Work of Data Science


              • The variation in data quality is enormous, with individual data quality evaluations on a
             0–100% scale, ranging from 0 to 99%. Still, deeper analyses revealed no important industry
             (e.g. health care, tech), data type (e.g. people data, customer data), organization size, or
             public/private differences.

             Thus, no sector, government agency, or department is immune to the ravages of extremely
           poor data quality. And importantly, these results do not include issues such as duplicates,
           inconsistencies between systems, and so forth. Further, they focus only on the most recent and
           important data that is under an organization’s control. They do not include older or less‐used
           data or data not under the organization’s control. Thus, bad as they are, these results reflect an
           upper bound on what you have to deal with in data science.
                                                     “The right data” standards are murkier.
            Data Quality and the Internet of Things  After all, what is just right for one analysis
                                                   may not suit another. Data scientists usually
            Those who deal with automated measurement   consider anything they can get their hands
            (e.g. the Internet of Things (IoT]) are some-  on, but that may not be good enough, as
            times tempted to dismiss these results,   reports on bias in data used for facial recog-
            thinking they stem from human error. Doing   nition (Lohr 2018) and criminal justice
            so is unwise. First, while some errors are   (Tashea 2017) attest.
            human related, most are not. Second, our   We don’t know of any definitive study
            experience, although anecdotal, convinces   on data presentation either. Still, issues
            us that automated measurement is no better,   come up from time to time. For example,
            although the failure modes may be different.   handwritten  notes  and  local  acronyms
            For example, a meter in the electric grid may   have  complicated IBM’s efforts to apply
            simply shut down, or sand may clog an   AI (e.g. Watson) to cancer treatment (Ross
            anemometer and cause intermittent failures.   2017).
            So until a specific device is proven correct,   Importantly, increasingly complex prob-
            you should assume it produces data of no   lems demand not just more data but more
            higher quality.                        diverse, comprehensive data, and with this
                                                   come more quality problems. For example,
           dealing with subtle differences in the definitions of data from different sources is increasingly
           challenging.
             Of course, the caustic observation “garbage in, garbage out” has plagued analytics and
           decision‐making for generations. The concern today is “big garbage in, big garbage out.”
           Data scientists bear special responsibility here; after all, the caliber of your recommendations
           depends on high‐quality data!
             AI and some predictive analyses exacerbate our concerns. Bad data can rear its ugly head
           twice – first in the historical data used in training a predictive model (PM) and second in the
           new data used by that PM going forward. Consider an organization seeking productivity gains
           with its machine learning efforts. Although the data science team that developed the PM may
           have done a solid job cleansing the training data, the PM will still be compromised by bad data
           going forward. Again, it takes people, lots of them, to find and correct the errors. This in turns
           subverts the hoped‐for productivity gains.
             Finally, there is the possibility of “cascades.” A cascade occurs when a minor error in one
           prediction or decision grows larger in subsequent steps. The financial crisis that started in late
           2007 is one example. Erred data in mortgage applications led to incorrect predictions of
   34   35   36   37   38   39   40   41   42   43   44