The 9 Pitfalls of Data Science
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 11)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780198844396, 9780191879937

Author(s):  
Gary Smith ◽  
Jay Cordes

Researchers seeking fame and funding may be tempted to go on fishing expeditions (p-hacking) or to torture the data to find novel, provocative results that will be picked up by the popular media. Provocative findings are provocative because they are novel and unexpected, and they are often novel and unexpected because they are simply not true. The publication effect (or the file drawer effect) keeps the failures hidden and have created a replication crisis. Research that gets reported in the popular media is often wrong—which fools people and undermines the credibility of scientific research.


Author(s):  
Gary Smith ◽  
Jay Cordes

Computer software, particularly deep neural networks and Monte Carlo simulations, are extremely useful for the specific tasks that they have been designed to do, and they will get even better, much better. However, we should not assume that computers are smarter than us just because they can tell us the first 2000 digits of pi or show us a street map of every city in the world. One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They can’t pass simple tests like the Winograd Schema Challenge because they do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters.


Author(s):  
Gary Smith ◽  
Jay Cordes

Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. For example, analysts often assume a normal distribution and disregard the fat tails that warn of “black swans.” Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use explanatory variables that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise.


Author(s):  
Gary Smith ◽  
Jay Cordes

We are predisposed to discount the role of luck in our lives—to believe that successes are earned and failures deserved. We misinterpret the temporary as permanent and invent theories to explain noise. We overreact when the unexpected happens, and are too quick to make the unexpected the new expected. The key to understanding regression toward the mean is to look behind the data—to recognize that when we see something remarkable, luck was most likely involved and, so, the underlying phenomenon is not as remarkable as it seems. Not to be confused with the gambler’s fallacy where good luck is followed by bad luck, regression toward the mean states that extremely good luck is generally followed by less extreme luck. The Sports Illustrated jinx is nothing more than this. Whenever there is uncertainty, people often make flawed decisions due to an insufficient appreciation of regression toward the mean.


Author(s):  
Gary Smith ◽  
Jay Cordes

The traditional statistical analysis of data follows what has come to be known as the scientific method: collecting reliable data to test plausible theories. Data mining goes in the other direction, analyzing data without being motivated or encumbered by theories. The fundamental problem with data mining is simple: We think that data patterns are unusual and therefore meaningful. Patterns are, in fact, inevitable and therefore meaningless. This is why data mining is not usually knowledge discovery, but noise discovery. Finding correlations is easy. Good data scientists are not seduced by discovered patterns because they don’t put data before theory. They do not commit Texas Sharpshooter Fallacies or fall into the Feynman Trap.


Author(s):  
Gary Smith ◽  
Jay Cordes

Good data scientists consider the reliability of the data, while data clowns don’t. Reported data sometimes systematically misrepresent the phenomena being recorded. Data can be deformed by extremely unusual data—outliers—which can be clerical errors, measurement errors, or flukes that can mislead us if not corrected. Other times, outliers are valuable data. We should always consider if data are skewed by unusual events or distorted by unreported “silent data.” If something is surprising about top-ranked groups, look at the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make clowns out of anyone.


Author(s):  
Gary Smith ◽  
Jay Cordes

In the 1970s banks began selling mortgages to public and private mortgage funds that sell shares to investors. In the late 1990s and early 2000s, many mortgages to “subprime” borrowers with low credit ratings and modest income were approved because banks and mortgage brokers made money by making loans and then selling them, and didn’t care if borrowers defaulted. Matters were complicated by financial engineering and compliant rating agencies. The Great Recession resulted from many people falling into several of the pitfalls of data science. They fooled themselves, they worshipped mathematics, they used bad data, they tortured data, and they did harm.


Author(s):  
Gary Smith ◽  
Jay Cordes

There is a hierarchy of predictive value that can be extracted from data. At the top of the hierarchy are causal relationships that can be confirmed with a randomized and controlled experiment or a natural experiment. Next best is to establish known or hypothesized relationships ahead of time and then test them and estimate their relative importance. One notch lower are associations found in historical data that are tested on fresh data after considering whether or not they make sense. At the bottom of the hierarchy, with little or no value, are associations found in historical data that are not confirmed by expert opinion or tested with fresh data. Data scientists who use a “correlations are enough” approach should remember that the more data and the more searches, the more likely it is that a discovered statistical relationship is coincidental and useless.


Author(s):  
Gary Smith ◽  
Jay Cordes

A 2012 article in the Harvard Business Review named data scientist the “sexiest job of the 21st century.” Governments and businesses are scrambling to hire data scientists, and workers are clamoring to become data scientists, or at least label themselves as such. Many colleges and universities now offer data science degrees, but their curricula differ wildly. Many businesses have data science divisions, but few restrictions on what they do. Many people say they are data scientists, but may have simply taken some online programming courses and don’t know what they don’t know. The result is that the analyses produced by data scientists are sometimes spectacular and, other times, disastrous. In a rush to learn the technical skills, the crucial principles of data science are often neglected....


Author(s):  
Gary Smith ◽  
Jay Cordes

An unfortunate reality in the age of big data is Big Brother monitoring us incessantly. Big Brother is indeed watching, but it is big business as well as big government collecting detailed information about everything we do so that they can predict our actions and manipulate our behavior. Big business and big government monitor our credit cards, checking accounts, computers, and telephones, watch us on surveillance cameras, and purchase data from firms dedicated to finding out everything they can about each and every one of us. Good data scientists proceed cautiously, respectful of our rights and our privacy. The Golden Rule applies to data science: treat others as you would like to be treated.


Sign in / Sign up

Export Citation Format

Share Document