The Phantom Pattern Problem
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780198864165, 9780191896514

2020 ◽  
pp. 181-206
Author(s):  
Gary Smith ◽  
Jay Cordes

Patterns need not be combinations of numbers. For example, employees—ranging from clerks to CEOs—who do their jobs extremely well are often less successful when they are promoted to new positions—a disappointment immortalized by the Peter Principle: “managers rise to the level of their incompetence.” Patterns in observational data can be misleading because of self-selection bias, in that observed differences among people making different choices may be due to the type of people making such choices. When compelled to use observational data, it is important that the theories to be tested are specified before looking at the data. Otherwise, we are likely to be fooled by phantom patterns.



2020 ◽  
pp. 137-152
Author(s):  
Gary Smith ◽  
Jay Cordes

Attempts to replicate reported studies often fail because the research relied on data mining—searching through data for patterns without any pre-specified, coherent theories. The perils of data mining can be exacerbated by data torturing—slicing, dicing, and otherwise mangling data to create patterns. If there is no underlying reason for a pattern, it is likely to disappear when someone attempts to replicate the study. Big data and powerful computers are part of the problem, not the solution, in that they can easily identify an essentially unlimited number of phantom patterns and relationships, which vanish when confronted with fresh data. If a researcher will benefit from a claim, it is likely to be biased. If a claim sounds implausible, it is probably misleading. If the statistical evidence sounds too good to be true, it probably is.



Author(s):  
Gary Smith ◽  
Jay Cordes

We are hard-wired to notice, seek, and be influenced by patterns. Sometimes these turn out to be useful; other times, they dupe and deceive us. Our affinity for patterns is powerful—no doubt, aided and abetted by selective recall and confirmation bias. We remember when a pattern persists and confirms our belief, and we forget or explain away times when it doesn’t. We are still under the spell of silly superstitions and captivated by numerical coincidences. We still think that some numbers are lucky, and others unlucky, even though the numbers deemed lucky and unlucky vary from culture to culture. We still think some numbers are special and notice them all around us. We still turn numerical patterns into laws and extrapolate flukes into confident predictions. The allure of patterns is hard to ignore. The temptation is hard to resist.



2020 ◽  
pp. 121-136
Author(s):  
Gary Smith ◽  
Jay Cordes

The Internet provides a firehose of data that researchers can use to understand and predict people’s behavior. However, unless A/B tests are used, these data are not from randomized controlled trials that allow us to rule out confounding influences. In addition, the people using the Internet in general, and social media in particular, are surely unrepresentative and their activities should be used cautiously for drawing conclusions about the general population. Things we read or see on the Internet are not necessarily true. Things we do on the Internet are not necessarily informative. An unrestrained scrutiny of searches, updates, tweets, hashtags, images, videos, or captions is certain to turn up an essentially unlimited number of phantom patterns that are entirely coincidental, and completely worthless.



Author(s):  
Gary Smith ◽  
Jay Cordes

Patterns are inevitable and we should not be surprised by them. Streaks, clusters, and correlations are the norm, not the exception. In a large number of coin flips, there are likely to be coincidental clusters of heads and tails. In nationwide data on cancer, crime, or test scores, there are likely to be flukey clusters. When the data are separated into smaller geographic units like cities, the most extreme results are likely to be found in the smallest cities. In athletic competitions between well-matched teams, the outcome of a small number of games is almost meaningless. Our challenge is to overcome our inherited inclination to think that all patterns are meaningful; for example, thinking that clustering in large data sets or differences among small data sets must be something real that needs to be explained. Often, it is just meaningless happenstance.



2020 ◽  
pp. 101-120
Author(s):  
Gary Smith ◽  
Jay Cordes

The scientific method tests theories with data. Data mining dispenses with theories and searches through data for patterns, often aided by torturing the data with rearrangements, manipulations, and omissions. It is tempting to believe that big data increases the power of data mining. However, the paradox of big data is that the greater the amount of data we ransack for patterns, the more likely it is that what we find will be worthless or misleading. Reserving some of the data for out-of-sample tests is a good idea. However, data mining with in-sample and out-of-sample data is still data mining, and it is still subject to the same pitfalls. Some nuisance variables are likely to survive both the in-sample and out-of-sample tests, even though they are useless, and some true variables are likely to be overlooked and discarded. Data do not speak for themselves, and up is not always up.



2020 ◽  
pp. 153-180
Author(s):  
Gary Smith ◽  
Jay Cordes
Keyword(s):  
The Past ◽  

Data are undeniably useful for answering many interesting and important questions, but data alone are not enough. Data without theory has been the source of a large (and growing) number of data miscues, missteps, and mishaps. We should resist the temptation to believe that data can answer all questions, and that more data means more reliable answers. Data can have errors and omissions or be irrelevant. In addition, patterns discovered in the past will vanish in the future unless there is an underlying reason for the pattern. Backtesting models in the stock market is particularly pernicious because it is so easy to find coincidental patterns that turn out to be expensive mistakes. This endemic problem has now spread far and wide because there are so much data that can be used by academic, business, and government researchers to discover phantom patterns.



Author(s):  
Gary Smith ◽  
Jay Cordes

Pattern recognition prowess served our ancestors well. Today, we are hard-wired to notice patterns, and this innate search for patterns can help us understand our world and make better decisions. However, patterns are not always infallible. Sometimes, they are an illusion (like images of the Virgin Mary on a grilled cheese sandwich). Sometimes, they are a meaningless coincidence whose importance we exaggerate (like giving birth at 7:11 on July 11). Sometimes, they are harmful (like buying lottery tickets based on a serendipitous pattern). We are tossed and turned by a deluge of data. The number of possible patterns that can be identified relative to the number that are genuinely useful has grown exponentially—which means that the chances that a discovered pattern is useful is rapidly approaching zero. We can easily be fooled by phantom patterns.



2020 ◽  
pp. 207-216
Author(s):  
Gary Smith ◽  
Jay Cordes

In the eighteenth century, a Presbyterian minister named Thomas Bayes wrestled with a daunting question—the probability that God exists. Reverend Bayes was not only a minister, he had studied logic and, most likely, mathematics at the University of Edinburgh (Presbyterians were not allowed to attend English universities)....



Author(s):  
Gary Smith ◽  
Jay Cordes

Coincidental correlations are useless for making predictions. In order to predict something, it has to be predictable; there must be an underlying causal structure—a real reason for the correlation. Correlation without causation mean predictions without hope. Causation can be demonstrated by a randomized controlled trial (RCT) in which there is both a treatment group and a control group, and in which the subjects are randomly assigned to the two groups. There should also be enough data to draw meaningful conclusions. A/B tests are essentially RCTs for the Internet. Unfortunately, we often cannot do RCTs. We have to make do with observational data. A valid study specifies the theory to be tested before looking at the data. Finding a pattern after looking at the data is treacherous, and likely to end badly—with a worthless, temporary coincidental correlation.



Sign in / Sign up

Export Citation Format

Share Document