A Comparison of Published Effect Sizes in Communication Research: Evaluation Against Population Effects and Cohen’s Conventions

2020 ◽  
Author(s):  
Eric Jamaal Cooks ◽  
Scott Parrott ◽  
Danielle Deavours

Interpretations of effect size are typically made either through comparison against previous studies or established benchmarks. This study examines the distribution of published effects among studies with and without preregistration in a set of 22 communication journals. Building from previous research in psychological science, 440 effects were randomly drawn from past publications without preregistration and compared against 35 effects from preregistered studies, and against Cohen’s conventions for effect size. Reported effects from studies without preregistration (median r = .33) were larger compared to those with a preregistration plan (median r = 0.24). The magnitude of effects from studies without preregistration was greater across conventions for “small,” and “large” effects. Differences were also found based on communication subdiscipline. These findings suggest that studies without preregistration may overestimate population effects, and that global conventions may not be applicable in communication science.

2018 ◽  
Vol 22 (4) ◽  
pp. 469-476 ◽  
Author(s):  
Ian J. Davidson

The reporting and interpretation of effect sizes is often promoted as a panacea for the ramifications of institutionalized statistical rituals associated with the null-hypothesis significance test. Mechanical objectivity—conflating the use of a method with the obtainment of truth—is a useful theoretical tool for understanding the possible failure of effect size reporting ( Porter, 1995 ). This article helps elucidate the ouroboros of psychological methodology. This is the cycle of improved tools to produce trustworthy knowledge, leading to their institutionalization and adoption as forms of thinking, leading to methodologists eventually admonishing researchers for relying too heavily on rituals, finally leading to the production of more new improved quantitative tools that may follow along this circular path. Despite many critiques and warnings, research psychologists’ superficial adoption of effect sizes might preclude expert interpretation much like in the null-hypothesis significance test as widely received. One solution to this situation is bottom-up: promoting a balance of mechanical objectivity and expertise in the teaching of methods and research. This would require the acceptance and encouragement of expert interpretation within psychological science.


Author(s):  
David J. Miller ◽  
James T. Nguyen ◽  
Matteo Bottai

Artificial effect-size magnification (ESM) may occur in underpowered studies, where effects are reported only because they or their associated p-values have passed some threshold. Ioannidis (2008, Epidemiology 19: 640–648) and Gelman and Carlin (2014, Perspectives on Psychological Science 9: 641–651) have suggested that the plausibility of findings for a specific study can be evaluated by computation of ESM, which requires statistical simulation. In this article, we present a new command called emagnification that allows straightforward implementation of such simulations in Stata. The commands automate these simulations for epidemiological studies and enable the user to assess ESM routinely for published studies using user-selected, study-specific inputs that are commonly reported in published literature. The intention of the command is to allow a wider community to use ESMs as a tool for evaluating the reliability of reported effect sizes and to put an observed statistically significant effect size into a fuller context with respect to potential implications for study conclusions.


2016 ◽  
Author(s):  
Brian A. Nosek ◽  
Johanna Cohoon ◽  
Mallory Kidwell ◽  
Jeffrey Robert Spies

Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.


2021 ◽  
Author(s):  
Farid Anvari ◽  
Rogier Kievit ◽  
Daniel Lakens ◽  
Andrew K Przybylski ◽  
Leonid Tiokhin ◽  
...  

Psychological researchers currently lack guidance for how to evaluate the practical relevance of observed effect sizes, i.e. whether a finding will have impact when translated to a different context of application. Although psychologists have recently highlighted theoretical justifications for why small effect sizes might be practically relevant, such justifications are simplistic and fail to provide the information necessary for evaluation and falsification. Claims about whether an observed effect size is practically relevant need to consider both the mechanisms amplifying and counteracting practical relevance, as well as the assumptions underlying each mechanism at play. To provide guidance for systematically evaluating whether an observed effect size is practically relevant, we present examples of widely applicable mechanisms and the key assumptions needed for justifying whether an observed effect size can be expected to generalize to different contexts. Routine use of these mechanisms to justify claims about practical relevance has the potential to make researchers’ claims about generalizability substantially more transparent. This transparency can help move psychological science towards a more rigorous assessment of when psychological findings can be applied in the world.


2017 ◽  
Vol 28 (12) ◽  
pp. 1871-1871

Original article: Giner-Sorolla, R., & Chapman, H. A. (2017). Beyond purity: Moral disgust toward bad character. Psychological Science, 28, 80–91. doi:10.1177/0956797616673193 In this article, some effect sizes in the Results section for Study 1 were reported incorrectly and are now being corrected. In the section titled Manipulation Checks: Act and Character Ratings, we reported a d value of 0.32 for the one-sample t test comparing participants’ act ratings with the midpoint of the scale; the correct value is 0.30. The sentence should read as follows: Follow-up one-sample t tests using the midpoint of the scale as a test value (because participants compared John with Robert) indicated that the cat beater’s actions were judged to be less wrong than the woman beater’s actions, t(86) = −2.82, p = .006, d = 0.30. In the section titled Emotion Ratings, we reported a d value of 0.42 for the paired-samples t test comparing relative ratings of facial disgust and facial anger; the correct value is 0.34. In addition, the effect-size statistic is dz rather than d. The sentence should read as follows: As predicted, a paired-samples t test indicated that relative facial-disgust ratings ( M = 4.36, SE = 0.21) were significantly different from relative facial-anger ratings ( M = 3.63, SE = 0.20), t(86) = −3.12, p = .002, dz = 0.34; this indicates that the cat-beater and woman-beater scenarios differentially evoked disgust and anger. Later in that section, we reported a d value of 0.21 for the one-sample t test comparing ratings of facial disgust with the midpoint of the scale; the correct value is 0.20. In the same sentence, we reported a d value of 0.21 for the one-sample t test comparing ratings of facial anger with the midpoint of the scale; the correct value is 0.19. The sentence should read as follows: Follow-up one-sample t tests against the midpoint of the scale showed trends in the predicted directions, with higher disgust for the cat beater compared with the woman beater, t(86) = 1.7, p = .088, d = 0.20, and higher anger for the woman beater compared with the cat beater, t(86) = −1.82, p = .072, d = 0.19 (see Fig. 1). These errors do not affect the significance of the results or the overall conclusions for Study 1.


2021 ◽  
Author(s):  
Maximilian Primbs ◽  
Charlotte Rebecca Pennington ◽  
Daniel Lakens ◽  
Miguel Alejandro Silan ◽  
Dwayne Sean Noah Lieck ◽  
...  

Götz et al. (2021) argue that small effects are the indispensable foundation for a cumulative psychological science. Whilst we applaud their efforts to bring this important discussion to the forefront, we argue that their core arguments do not hold up under scrutiny, and if left uncorrected have the potential to undermine best practices in reporting and interpreting effect size estimates. Their article can be used as a convenient blanket defense to justify ‘small’ effects as meaningful. In our reply, we first argue that comparisons between psychological science and genetics are fundamentally flawed because these disciplines have vastly different goals and methodology. Second, we argue that p-values, not effect sizes, are the main currency for publication in psychology, meaning that any biases in the literature are caused by this pressure to publish statistically significant results, not a pressure to publish large effects. Third, we contend that claims regarding small effects as important and consequential must be supported by empirical evidence, or at least require a falsifiable line of reasoning. Finally, we propose that researchers should evaluate effect sizes in relative, not absolute terms, and provide several approaches of how this can be achieved.


Methodology ◽  
2019 ◽  
Vol 15 (3) ◽  
pp. 97-105
Author(s):  
Rodrigo Ferrer ◽  
Antonio Pardo

Abstract. In a recent paper, Ferrer and Pardo (2014) tested several distribution-based methods designed to assess when test scores obtained before and after an intervention reflect a statistically reliable change. However, we still do not know how these methods perform from the point of view of false negatives. For this purpose, we have simulated change scenarios (different effect sizes in a pre-post-test design) with distributions of different shapes and with different sample sizes. For each simulated scenario, we generated 1,000 samples. In each sample, we recorded the false-negative rate of the five distribution-based methods with the best performance from the point of view of the false positives. Our results have revealed unacceptable rates of false negatives even with effects of very large size, starting from 31.8% in an optimistic scenario (effect size of 2.0 and a normal distribution) to 99.9% in the worst scenario (effect size of 0.2 and a highly skewed distribution). Therefore, our results suggest that the widely used distribution-based methods must be applied with caution in a clinical context, because they need huge effect sizes to detect a true change. However, we made some considerations regarding the effect size and the cut-off points commonly used which allow us to be more precise in our estimates.


2021 ◽  
pp. 174077452098487
Author(s):  
Brian Freed ◽  
Brian Williams ◽  
Xiaolu Situ ◽  
Victoria Landsman ◽  
Jeehyoung Kim ◽  
...  

Background: Blinding aims to minimize biases from what participants and investigators know or believe. Randomized controlled trials, despite being the gold standard to evaluate treatment effect, do not generally assess the success of blinding. We investigated the extent of blinding in back pain trials and the associations between participant guesses and treatment effects. Methods: We did a review with PubMed/OvidMedline, 2000–2019. Eligibility criteria were back pain trials with data available on treatment effect and participants’ guess of treatment. For blinding, blinding index was used as chance-corrected measure of excessive correct guess (0 for random guess). For treatment effects, within- or between-arm effect sizes were used. Analyses of investigators’ guess/blinding or by treatment modality were performed exploratorily. Results: Forty trials (3899 participants) were included. Active and sham treatment groups had mean blinding index of 0.26 (95% confidence interval: 0.12, 0.41) and 0.01 (−0.11, 0.14), respectively, meaning 26% of participants in active treatment believed they received active treatment, whereas only 1% in sham believed they received sham treatment, beyond chance, that is, random guess. A greater belief of receiving active treatment was associated with a larger within-arm effect size in both arms, and ideal blinding (namely, “random guess,” and “wishful thinking” that signifies both groups believing they received active treatment) showed smaller effect sizes, with correlation of effect size and summary blinding indexes of 0.35 ( p = 0.028) for between-arm comparison. We observed uniformly large sham treatment effects for all modalities, and larger correlation for investigator’s (un)blinding, 0.53 ( p = 0.046). Conclusion: Participants in active treatments in back pain trials guessed treatment identity more correctly, while those in sham treatments tended to display successful blinding. Excessive correct guesses (that could reflect weaker blinding and/or noticeable effects) by participants and investigators demonstrated larger effect sizes. Blinding and sham treatment effects on back pain need due consideration in individual trials and meta-analyses.


2013 ◽  
Vol 2013 ◽  
pp. 1-9 ◽  
Author(s):  
Liansheng Larry Tang ◽  
Michael Caudy ◽  
Faye Taxman

Multiple meta-analyses may use similar search criteria and focus on the same topic of interest, but they may yield different or sometimes discordant results. The lack of statistical methods for synthesizing these findings makes it challenging to properly interpret the results from multiple meta-analyses, especially when their results are conflicting. In this paper, we first introduce a method to synthesize the meta-analytic results when multiple meta-analyses use the same type of summary effect estimates. When meta-analyses use different types of effect sizes, the meta-analysis results cannot be directly combined. We propose a two-step frequentist procedure to first convert the effect size estimates to the same metric and then summarize them with a weighted mean estimate. Our proposed method offers several advantages over existing methods by Hemming et al. (2012). First, different types of summary effect sizes are considered. Second, our method provides the same overall effect size as conducting a meta-analysis on all individual studies from multiple meta-analyses. We illustrate the application of the proposed methods in two examples and discuss their implications for the field of meta-analysis.


Author(s):  
H. S. Styn ◽  
S. M. Ellis

The determination of significance of differences in means and of relationships between variables is of importance in many empirical studies. Usually only statistical significance is reported, which does not necessarily indicate an important (practically significant) difference or relationship. With studies based on probability samples, effect size indices should be reported in addition to statistical significance tests in order to comment on practical significance. Where complete populations or convenience samples are worked with, the determination of statistical significance is strictly speaking no longer relevant, while the effect size indices can be used as a basis to judge significance. In this article attention is paid to the use of effect size indices in order to establish practical significance. It is also shown how these indices are utilized in a few fields of statistical application and how it receives attention in statistical literature and computer packages. The use of effect sizes is illustrated by a few examples from the research literature.


Sign in / Sign up

Export Citation Format

Share Document