Page 26 - The Real Work Of Data Science Turning Data Into Information, Better Decisions, And Stronger Organizations by Ron S. Kenett, Thomas C. Redman (z-lib.org)_Neat
P. 26
The Difference Between a Good Data Scientist and a Great One 11
It rarely works out that way. As Jeff Hooper, of Bell Labs, liked to say, “Data do not give
up their secrets easily. They must be tortured to confess.”
This is a really big deal. Even under the best of circumstances, too much data is poorly
defined and simply wrong, and most turns out to be irrelevant to the problem at hand.
Staring through this noisy data is arduous, frustrating work. Even good data scientists may
move on to the next problem. Great data scientists stick with it.
Great data scientists also persist in making themselves heard. Dealing with a recalci-
trant bureaucracy can be even more frustrating than dealing with noisy data. Continuing
the vignette from above, the intern spent his summer defending his discovery. Whichever
group made the error took great offense, even attacking him personally. Others reacted
with glee as they celebrated the ignorance of their peers. And he was caught in the
middle. Great data scientists know how to handle such situations, persisting through
thick and thin.
4. Finally, they have raw statistical muscle. The abilities to access and analyze data using all
the newest tools (including classic packages and newer ones such as machine learning)
are obviously important. But these can learned – of bigger concern is the ability to bring
statistical rigor to bear. At the risk of oversimplifying, there are two kinds of analyses –
descriptive and predictive. Descriptive analyses are tough enough. But the really profitable
analyses involve prediction, which is inherently uncertain (Shmueli 2010).
Great data scientists embrace uncertainty. They recognize when a prediction rests on
solid foundations and when it is merely wishful thinking. They are simply outstanding in
describing what has to go right for the prediction to hold, what will really foul it up, and
what are the unknowns that keep them awake at night. When they can, they quantify the
uncertainty, and they are good at suggesting simple experiments to confirm or deny
assumptions, reduce uncertainty, explore the next set of questions, etc.
To say this in a different way, there are some who opine that, for big data, it is enough to
understand “correlation” without getting into the complexities of “causation.” There are
surely some problems for which this is true. But not the really important ones! Understanding
causation leads to better predictions. The great data scientists will work to establish the
causative links.
This requires them to generalize on a higher level. Focusing only on the data at hand can
lead to “overfitting,” leading to models that are too complex for future use. Scientific
generalization invokes domain‐specific knowledge, general principles, and intuition, far
beyond cross‐validations or comparison of training‐set and hold‐out‐set results (Kenett and
Shmueli 2016a).
To be clear, this ability is not “that certain quantitative knack.” It is trained, sophisti-
cated, disciplined inferential horsepower, practiced and honed by both success and failure.
Some of this is covered in data science curricula (De Veaux et al. 2017; Coleman and
Kenett 2017). Most is not.
Implications
To conclude, the real work of a data scientist is to continually become more effective. You
probably cannot teach yourself “that certain quantitative knack.” But you can work to develop
outside interests, read extensively, build a wider, more diverse network, develop a thick skin,
and study statistical inference. You should start doing so immediately.