scholarly journals Statistics Notes: Interaction 2: compare effect sizes not P values

BMJ ◽  
1996 ◽  
Vol 313 (7060) ◽  
pp. 808-808 ◽  
Author(s):  
J. N S Matthews ◽  
D. G Altman
Keyword(s):  
2021 ◽  
Vol 103 (3) ◽  
pp. 43-47
Author(s):  
David Steiner

Education leaders know that they should use research when choosing interventions for their schools, but they don’t always know how to read the research that is available. David Steiner explains some of the reasons that reading research is a low priority for educators on the front lines and offers some guidance for determining whether research results are meaningful without an extensive background in statistics. Ideally, education decision makers should look for randomized control trials with high effect sizes and low p-values.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2019 ◽  
Vol 33 (1) ◽  
pp. 50-55
Author(s):  
Daniela Dunkler ◽  
Maria Haller ◽  
Rainer Oberbauer ◽  
Georg Heinze
Keyword(s):  

2018 ◽  
Vol 8 (1) ◽  
pp. 3-19 ◽  
Author(s):  
Yuanyuan Zhou ◽  
Susan Troncoso Skidmore

Historically, ANOVA has been the most prevalent statistical method used in educational and psychological research and today ANOVA continues to be widely used.  A comprehensive review published in 1998 examined several APA journals and discovered persistent concerns in ANOVA reporting practices.  The present authors examined all articles published in 2012 in three APA journals (Journal of Applied Psychology, Journal of Counseling Psychology, and Journal of Personality and Social Psychology) to review ANOVA reporting practices including p values and effect sizes.  Results indicated that ANOVA continues to be prevalent in the reviewed journals as a test of the primary research question, as well as to test conditional assumptions prior to the primary analysis.  Still, ANOVA reporting practices are essentially unchanged from what was previously reported.  However, effect size reporting has improved.


2016 ◽  
Vol 156 (6) ◽  
pp. 978-980 ◽  
Author(s):  
Peter M. Vila ◽  
Melanie Elizabeth Townsend ◽  
Neel K. Bhatt ◽  
W. Katherine Kao ◽  
Parul Sinha ◽  
...  

There is a lack of reporting effect sizes and confidence intervals in the current biomedical literature. The objective of this article is to present a discussion of the recent paradigm shift encouraging the use of reporting effect sizes and confidence intervals. Although P values help to inform us about whether an effect exists due to chance, effect sizes inform us about the magnitude of the effect (clinical significance), and confidence intervals inform us about the range of plausible estimates for the general population mean (precision). Reporting effect sizes and confidence intervals is a necessary addition to the biomedical literature, and these concepts are reviewed in this article.


2009 ◽  
Vol 217 (1) ◽  
pp. 15-26 ◽  
Author(s):  
Geoff Cumming ◽  
Fiona Fidler

Most questions across science call for quantitative answers, ideally, a single best estimate plus information about the precision of that estimate. A confidence interval (CI) expresses both efficiently. Early experimental psychologists sought quantitative answers, but for the last half century psychology has been dominated by the nonquantitative, dichotomous thinking of null hypothesis significance testing (NHST). The authors argue that psychology should rejoin mainstream science by asking better questions – those that demand quantitative answers – and using CIs to answer them. They explain CIs and a range of ways to think about them and use them to interpret data, especially by considering CIs as prediction intervals, which provide information about replication. They explain how to calculate CIs on means, proportions, correlations, and standardized effect sizes, and illustrate symmetric and asymmetric CIs. They also argue that information provided by CIs is more useful than that provided by p values, or by values of Killeen’s prep, the probability of replication.


Author(s):  
Michael D. Jennions ◽  
Christopher J. Lortie ◽  
Julia Koricheva

This chapter begins with a brief review of why effect sizes and their variances are more informative than P-values. It then discusses how meta-analysis promotes “effective thinking” that can change approaches to several commonplace problems. Specifically, it addresses the issues of (1) exemplar studies versus average trends, (2) resolving “conflict” between specific studies, (3) presenting results, (4) deciding on the level at which to replicate studies, (5) understanding the constraints imposed by low statistical power, and (6) asking broad-scale questions that cannot be resolved in a single study. The chapter focuses on estimating effect sizes as a key outcome of meta-analysis, but acknowledges that other outcomes might be of more interest in other situations.


2020 ◽  
Author(s):  
Antonia Krefeld-Schwalb ◽  
Benjamin Scheibehenne

Following vital discussion around the replicability of published findings, researchers demanded increased efforts to improve research practices in empirical social science. Consequentially, journals publishing consumer research implemented new measures to increase the replicability of published work. Nonetheless, no systematic empirical analysis on a large sample has investigated whether published consumer research has changed along with the discussion. To address this need, we surveyed three indicators for the replicability of published consumer research over time. We used text mining to quantify sample sizes, effect sizes, and the distribution of published p-values from a sample of N = 923 articles published between 2011 and 2018 in the Journal of Marketing Research, the Journal of Consumer Psychology, and the Journal of Consumer Research. To test the developments over time, we focused on a subsample of hand-coded articles and identified central hypothesis tests herein. Results show a trend toward increased sample sizes and decreased effect sizes across all three journals in the subset as well as the entire set of articles.


Sign in / Sign up

Export Citation Format

Share Document