emagnification: A tool for estimating effect-size magnification and performing design calculations in epidemiological studies

Artificial effect-size magnification (ESM) may occur in underpowered studies, where effects are reported only because they or their associated p-values have passed some threshold. Ioannidis (2008, Epidemiology 19: 640–648) and Gelman and Carlin (2014, Perspectives on Psychological Science 9: 641–651) have suggested that the plausibility of findings for a specific study can be evaluated by computation of ESM, which requires statistical simulation. In this article, we present a new command called emagnification that allows straightforward implementation of such simulations in Stata. The commands automate these simulations for epidemiological studies and enable the user to assess ESM routinely for published studies using user-selected, study-specific inputs that are commonly reported in published literature. The intention of the command is to allow a wider community to use ESMs as a tool for evaluating the reliability of reported effect sizes and to put an observed statistically significant effect size into a fuller context with respect to potential implications for study conclusions.

Download Full-text

There are no ‘Small’ or ‘Large’ Effects: A Reply to Götz et al. (2021)

10.31234/osf.io/6s8bj ◽

2021 ◽

Author(s):

Maximilian Primbs ◽

Charlotte Rebecca Pennington ◽

Daniel Lakens ◽

Miguel Alejandro Silan ◽

Dwayne Sean Noah Lieck ◽

...

Keyword(s):

Best Practices ◽

Empirical Evidence ◽

Effect Size ◽

Effect Sizes ◽

P Values ◽

Psychological Science ◽

Size Estimates

Götz et al. (2021) argue that small effects are the indispensable foundation for a cumulative psychological science. Whilst we applaud their efforts to bring this important discussion to the forefront, we argue that their core arguments do not hold up under scrutiny, and if left uncorrected have the potential to undermine best practices in reporting and interpreting effect size estimates. Their article can be used as a convenient blanket defense to justify ‘small’ effects as meaningful. In our reply, we first argue that comparisons between psychological science and genetics are fundamentally flawed because these disciplines have vastly different goals and methodology. Second, we argue that p-values, not effect sizes, are the main currency for publication in psychology, meaning that any biases in the literature are caused by this pressure to publish statistically significant results, not a pressure to publish large effects. Third, we contend that claims regarding small effects as important and consequential must be supported by empirical evidence, or at least require a falsifiable line of reasoning. Finally, we propose that researchers should evaluate effect sizes in relative, not absolute terms, and provide several approaches of how this can be achieved.

Download Full-text

The Ouroboros of Psychological Methodology: The Case of Effect Sizes (Mechanical Objectivity vs. Expertise)

Review of General Psychology ◽

10.1037/gpr0000154 ◽

2018 ◽

Vol 22 (4) ◽

pp. 469-476 ◽

Cited By ~ 4

Author(s):

Ian J. Davidson

Keyword(s):

Effect Size ◽

Null Hypothesis ◽

Significance Test ◽

Effect Sizes ◽

Circular Path ◽

Bottom Up ◽

Psychological Science ◽

Null Hypothesis Significance Test ◽

Theoretical Tool ◽

Mechanical Objectivity

The reporting and interpretation of effect sizes is often promoted as a panacea for the ramifications of institutionalized statistical rituals associated with the null-hypothesis significance test. Mechanical objectivity—conflating the use of a method with the obtainment of truth—is a useful theoretical tool for understanding the possible failure of effect size reporting ( Porter, 1995 ). This article helps elucidate the ouroboros of psychological methodology. This is the cycle of improved tools to produce trustworthy knowledge, leading to their institutionalization and adoption as forms of thinking, leading to methodologists eventually admonishing researchers for relying too heavily on rituals, finally leading to the production of more new improved quantitative tools that may follow along this circular path. Despite many critiques and warnings, research psychologists’ superficial adoption of effect sizes might preclude expert interpretation much like in the null-hypothesis significance test as widely received. One solution to this situation is bottom-up: promoting a balance of mechanical objectivity and expertise in the teaching of methods and research. This would require the acceptance and encouragement of expert interpretation within psychological science.

Download Full-text

Estimating the Reproducibility of Psychological Science

10.31219/osf.io/447b3 ◽

2016 ◽

Cited By ~ 4

Author(s):

Brian A. Nosek ◽

Johanna Cohoon ◽

Mallory Kidwell ◽

Jeffrey Robert Spies

Keyword(s):

Confidence Interval ◽

Effect Size ◽

Effect Sizes ◽

Original Result ◽

Psychological Science ◽

Substantial Decline

Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Download Full-text

Evaluating the practical relevance of observed effect sizes in psychological research

10.31234/osf.io/g3vtr ◽

2021 ◽

Author(s):

Farid Anvari ◽

Rogier Kievit ◽

Daniel Lakens ◽

Andrew K Przybylski ◽

Leonid Tiokhin ◽

...

Keyword(s):

Effect Size ◽

Psychological Research ◽

Effect Sizes ◽

Practical Relevance ◽

Psychological Science ◽

The World

Psychological researchers currently lack guidance for how to evaluate the practical relevance of observed effect sizes, i.e. whether a finding will have impact when translated to a different context of application. Although psychologists have recently highlighted theoretical justifications for why small effect sizes might be practically relevant, such justifications are simplistic and fail to provide the information necessary for evaluation and falsification. Claims about whether an observed effect size is practically relevant need to consider both the mechanisms amplifying and counteracting practical relevance, as well as the assumptions underlying each mechanism at play. To provide guidance for systematically evaluating whether an observed effect size is practically relevant, we present examples of widely applicable mechanisms and the key assumptions needed for justifying whether an observed effect size can be expected to generalize to different contexts. Routine use of these mechanisms to justify claims about practical relevance has the potential to make researchers’ claims about generalizability substantially more transparent. This transparency can help move psychological science towards a more rigorous assessment of when psychological findings can be applied in the world.

Download Full-text

Corrigendum: Beyond Purity: Moral Disgust Toward Bad Character

Psychological Science ◽

10.1177/0956797617736343 ◽

2017 ◽

Vol 28 (12) ◽

pp. 1871-1871

Keyword(s):

Effect Size ◽

T Test ◽

Effect Sizes ◽

Psychological Science ◽

Paired Samples ◽

Moral Disgust ◽

The One ◽

D Value

Original article: Giner-Sorolla, R., & Chapman, H. A. (2017). Beyond purity: Moral disgust toward bad character. Psychological Science, 28, 80–91. doi:10.1177/0956797616673193 In this article, some effect sizes in the Results section for Study 1 were reported incorrectly and are now being corrected. In the section titled Manipulation Checks: Act and Character Ratings, we reported a d value of 0.32 for the one-sample t test comparing participants’ act ratings with the midpoint of the scale; the correct value is 0.30. The sentence should read as follows: Follow-up one-sample t tests using the midpoint of the scale as a test value (because participants compared John with Robert) indicated that the cat beater’s actions were judged to be less wrong than the woman beater’s actions, t(86) = −2.82, p = .006, d = 0.30. In the section titled Emotion Ratings, we reported a d value of 0.42 for the paired-samples t test comparing relative ratings of facial disgust and facial anger; the correct value is 0.34. In addition, the effect-size statistic is dz rather than d. The sentence should read as follows: As predicted, a paired-samples t test indicated that relative facial-disgust ratings ( M = 4.36, SE = 0.21) were significantly different from relative facial-anger ratings ( M = 3.63, SE = 0.20), t(86) = −3.12, p = .002, dz = 0.34; this indicates that the cat-beater and woman-beater scenarios differentially evoked disgust and anger. Later in that section, we reported a d value of 0.21 for the one-sample t test comparing ratings of facial disgust with the midpoint of the scale; the correct value is 0.20. In the same sentence, we reported a d value of 0.21 for the one-sample t test comparing ratings of facial anger with the midpoint of the scale; the correct value is 0.19. The sentence should read as follows: Follow-up one-sample t tests against the midpoint of the scale showed trends in the predicted directions, with higher disgust for the cat beater compared with the woman beater, t(86) = 1.7, p = .088, d = 0.20, and higher anger for the woman beater compared with the cat beater, t(86) = −1.82, p = .072, d = 0.19 (see Fig. 1). These errors do not affect the significance of the results or the overall conclusions for Study 1.

Download Full-text

A Comparison of Published Effect Sizes in Communication Research: Evaluation Against Population Effects and Cohen’s Conventions

10.31219/osf.io/w86b5 ◽

2020 ◽

Author(s):

Eric Jamaal Cooks ◽

Scott Parrott ◽

Danielle Deavours

Keyword(s):

Effect Size ◽

Research Evaluation ◽

Effect Sizes ◽

Communication Research ◽

Psychological Science ◽

Population Effects

Interpretations of effect size are typically made either through comparison against previous studies or established benchmarks. This study examines the distribution of published effects among studies with and without preregistration in a set of 22 communication journals. Building from previous research in psychological science, 440 effects were randomly drawn from past publications without preregistration and compared against 35 effects from preregistered studies, and against Cohen’s conventions for effect size. Reported effects from studies without preregistration (median r = .33) were larger compared to those with a preregistration plan (median r = 0.24). The magnitude of effects from studies without preregistration was greater across conventions for “small,” and “large” effects. Differences were also found based on communication subdiscipline. These findings suggest that studies without preregistration may overestimate population effects, and that global conventions may not be applicable in communication science.

Download Full-text

Clinically Meaningful Change

Methodology ◽

10.1027/1614-2241/a000168 ◽

2019 ◽

Vol 15 (3) ◽

pp. 97-105

Author(s):

Rodrigo Ferrer ◽

Antonio Pardo

Keyword(s):

Effect Size ◽

False Negative ◽

False Negative Rate ◽

Point Of View ◽

Skewed Distribution ◽

Effect Sizes ◽

False Negatives ◽

Large Size ◽

Before And After ◽

Post Test

Abstract. In a recent paper, Ferrer and Pardo (2014) tested several distribution-based methods designed to assess when test scores obtained before and after an intervention reflect a statistically reliable change. However, we still do not know how these methods perform from the point of view of false negatives. For this purpose, we have simulated change scenarios (different effect sizes in a pre-post-test design) with distributions of different shapes and with different sample sizes. For each simulated scenario, we generated 1,000 samples. In each sample, we recorded the false-negative rate of the five distribution-based methods with the best performance from the point of view of the false positives. Our results have revealed unacceptable rates of false negatives even with effects of very large size, starting from 31.8% in an optimistic scenario (effect size of 2.0 and a normal distribution) to 99.9% in the worst scenario (effect size of 0.2 and a highly skewed distribution). Therefore, our results suggest that the widely used distribution-based methods must be applied with caution in a clinical context, because they need huge effect sizes to detect a true change. However, we made some considerations regarding the effect size and the cut-off points commonly used which allow us to be more precise in our estimates.

Download Full-text

How Do We Choose Our Giants? Perceptions of Replicability in Psychological Science

Advances in Methods and Practices in Psychological Science ◽

10.1177/25152459211018199 ◽

2021 ◽

Vol 4 (2) ◽

pp. 251524592110181

Author(s):

Manikya Alister ◽

Raine Vickers-Jones ◽

David K. Sewell ◽

Timothy Ballard

Keyword(s):

Individual Differences ◽

Sample Size ◽

Statistical Methods ◽

Scientific Progress ◽

Research Practices ◽

Specific Study ◽

Psychological Science ◽

Average Confidence

Judgments regarding replicability are vital to scientific progress. The metaphor of “standing on the shoulders of giants” encapsulates the notion that progress is made when new discoveries build on previous findings. Yet attempts to build on findings that are not replicable could mean a great deal of time, effort, and money wasted. In light of the recent “crisis of confidence” in psychological science, the ability to accurately judge the replicability of findings may be more important than ever. In this Registered Report, we examine the factors that influence psychological scientists’ confidence in the replicability of findings. We recruited corresponding authors of articles published in psychology journals between 2014 and 2018 to complete a brief survey in which they were asked to consider 76 specific study attributes that might bear on the replicability of a finding (e.g., preregistration, sample size, statistical methods). Participants were asked to rate the extent to which information regarding each attribute increased or decreased their confidence in the finding being replicated. We examined the extent to which each research attribute influenced average confidence in replicability. We found evidence for six reasonably distinct underlying factors that influenced these judgments and individual differences in the degree to which people’s judgments were influenced by these factors. The conclusions reveal how certain research practices affect other researchers’ perceptions of robustness. We hope our findings will help encourage the use of practices that promote replicability and, by extension, the cumulative progress of psychological science.

Download Full-text