The 9 Pitfalls of Data Science

Torturing Data

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0006 ◽

2019 ◽

pp. 111-126

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Scientific Research ◽

Popular Media ◽

Replication Crisis ◽

File Drawer ◽

File Drawer Effect

Researchers seeking fame and funding may be tempted to go on fishing expeditions (p-hacking) or to torture the data to find novel, provocative results that will be picked up by the popular media. Provocative findings are provocative because they are novel and unexpected, and they are often novel and unexpected because they are simply not true. The publication effect (or the file drawer effect) keeps the failures hidden and have created a replication crisis. Research that gets reported in the popular media is often wrong—which fools people and undermines the credibility of scientific research.

Download Full-text

Worshiping Computers

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0005 ◽

2019 ◽

pp. 85-110

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Neural Networks ◽

Monte Carlo ◽

Monte Carlo Simulations ◽

Common Sense ◽

Deep Neural Networks ◽

Computer Software ◽

Square Roots ◽

The World ◽

The Way

Computer software, particularly deep neural networks and Monte Carlo simulations, are extremely useful for the specific tasks that they have been designed to do, and they will get even better, much better. However, we should not assume that computers are smarter than us just because they can tell us the first 2000 digits of pi or show us a street map of every city in the world. One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They can’t pass simple tests like the Winograd Schema Challenge because they do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters.

Download Full-text

Worshiping Math

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0004 ◽

2019 ◽

pp. 65-84

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Data Mining ◽

Normal Distribution ◽

Common Sense ◽

Fat Tails ◽

Explanatory Variables ◽

Good Data ◽

Black Swans ◽

The People ◽

Invaluable Tool ◽

Mining Tools

Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. For example, analysts often assume a normal distribution and disregard the fat tails that warn of “black swans.” Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use explanatory variables that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise.

Download Full-text

Being Surprised by Regression Toward the Mean

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0009 ◽

2019 ◽

pp. 173-196

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Good Luck ◽

Gambler’S Fallacy ◽

The Mean ◽

Regression Toward The Mean ◽

Gambler's Fallacy ◽

Sports Illustrated

We are predisposed to discount the role of luck in our lives—to believe that successes are earned and failures deserved. We misinterpret the temporary as permanent and invent theories to explain noise. We overreact when the unexpected happens, and are too quick to make the unexpected the new expected. The key to understanding regression toward the mean is to look behind the data—to recognize that when we see something remarkable, luck was most likely involved and, so, the underlying phenomenon is not as remarkable as it seems. Not to be confused with the gambler’s fallacy where good luck is followed by bad luck, regression toward the mean states that extremely good luck is generally followed by less extreme luck. The Sports Illustrated jinx is nothing more than this. Whenever there is uncertainty, people often make flawed decisions due to an insufficient appreciation of regression toward the mean.

Download Full-text

Putting Data Before Theory

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0003 ◽

2019 ◽

pp. 33-64

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Data Mining ◽

Statistical Analysis ◽

Knowledge Discovery ◽

Scientific Method ◽

Fundamental Problem ◽

Reliable Data ◽

The Other ◽

Good Data

The traditional statistical analysis of data follows what has come to be known as the scientific method: collecting reliable data to test plausible theories. Data mining goes in the other direction, analyzing data without being motivated or encumbered by theories. The fundamental problem with data mining is simple: We think that data patterns are unusual and therefore meaningful. Patterns are, in fact, inevitable and therefore meaningless. This is why data mining is not usually knowledge discovery, but noise discovery. Finding correlations is easy. Good data scientists are not seduced by discovered patterns because they don’t put data before theory. They do not commit Texas Sharpshooter Fallacies or fall into the Feynman Trap.

Download Full-text

Using Bad Data

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0002 ◽

2019 ◽

pp. 3-32

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Selection Bias ◽

Measurement Errors ◽

Valuable Data ◽

Good Data ◽

Self Selection ◽

Recorded Data ◽

Survivorship Bias ◽

Reported Data ◽

Bad Data

Good data scientists consider the reliability of the data, while data clowns don’t. Reported data sometimes systematically misrepresent the phenomena being recorded. Data can be deformed by extremely unusual data—outliers—which can be clerical errors, measurement errors, or flukes that can mislead us if not corrected. Other times, outliers are valuable data. We should always consider if data are skewed by unusual events or distorted by unreported “silent data.” If something is surprising about top-ranked groups, look at the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make clowns out of anyone.

Download Full-text

Case Study

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0011 ◽

2019 ◽

pp. 229-240

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Great Recession ◽

Financial Engineering ◽

Data Science ◽

Credit Ratings ◽

Public And Private ◽

Rating Agencies ◽

The Great Recession ◽

Bad Data

In the 1970s banks began selling mortgages to public and private mortgage funds that sell shares to investors. In the late 1990s and early 2000s, many mortgages to “subprime” borrowers with low credit ratings and modest income were approved because banks and mortgage brokers made money by making loans and then selling them, and didn’t care if borrowers defaulted. Matters were complicated by financial engineering and compliant rating agencies. The Great Recession resulted from many people falling into several of the pitfalls of data science. They fooled themselves, they worshipped mathematics, they used bad data, they tortured data, and they did harm.

Download Full-text

Confusing Correlation with Causation

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0008 ◽

2019 ◽

pp. 155-172

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

No Value ◽

Expert Opinion ◽

Predictive Value ◽

Natural Experiment ◽

Historical Data ◽

Controlled Experiment ◽

Relative Importance ◽

Statistical Relationship ◽

Causal Relationships

There is a hierarchy of predictive value that can be extracted from data. At the top of the hierarchy are causal relationships that can be confirmed with a randomized and controlled experiment or a natural experiment. Next best is to establish known or hypothesized relationships ahead of time and then test them and estimate their relative importance. One notch lower are associations found in historical data that are tested on fresh data after considering whether or not they make sense. At the bottom of the hierarchy, with little or no value, are associations found in historical data that are not confirmed by expert opinion or tested with fresh data. Data scientists who use a “correlations are enough” approach should remember that the more data and the more searches, the more likely it is that a discovered statistical relationship is coincidental and useless.

Download Full-text

Introduction

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0001 ◽

2019 ◽

pp. 1-2

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

21St Century ◽

Data Science ◽

Technical Skills ◽

Harvard Business Review ◽

Colleges And Universities ◽

Data Scientist ◽

Programming Courses

A 2012 article in the Harvard Business Review named data scientist the “sexiest job of the 21st century.” Governments and businesses are scrambling to hire data scientists, and workers are clamoring to become data scientists, or at least label themselves as such. Many colleges and universities now offer data science degrees, but their curricula differ wildly. Many businesses have data science divisions, but few restrictions on what they do. Many people say they are data scientists, but may have simply taken some online programming courses and don’t know what they don’t know. The result is that the analyses produced by data scientists are sometimes spectacular and, other times, disastrous. In a rush to learn the technical skills, the crucial principles of data science are often neglected....

Download Full-text

Doing Harm

The 9 Pitfalls of Data Science ◽

10.1093/oso/9780198844396.003.0010 ◽

2019 ◽

pp. 197-228

Author(s):

Gary Smith ◽

Jay Cordes

Keyword(s):

Big Data ◽

Data Science ◽

Credit Cards ◽

Golden Rule ◽

Big Business ◽

Good Data ◽

Big Brother ◽

Surveillance Cameras ◽

Do So

An unfortunate reality in the age of big data is Big Brother monitoring us incessantly. Big Brother is indeed watching, but it is big business as well as big government collecting detailed information about everything we do so that they can predict our actions and manipulate our behavior. Big business and big government monitor our credit cards, checking accounts, computers, and telephones, watch us on surveillance cameras, and purchase data from firms dedicated to finding out everything they can about each and every one of us. Good data scientists proceed cautiously, respectful of our rights and our privacy. The Golden Rule applies to data science: treat others as you would like to be treated.

Download Full-text

The 9 Pitfalls of Data Science
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

Torturing Data

Worshiping Computers

Worshiping Math

Being Surprised by Regression Toward the Mean

Putting Data Before Theory

Using Bad Data

Case Study

Confusing Correlation with Causation

Introduction

Doing Harm

Export Citation Format

The 9 Pitfalls of Data ScienceLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

Torturing Data

Worshiping Computers

Worshiping Math

Being Surprised by Regression Toward the Mean

Putting Data Before Theory

Using Bad Data

Case Study

Confusing Correlation with Causation

Introduction

Doing Harm

The 9 Pitfalls of Data Science
Latest Publications