scholarly journals Out with .05, in with Replication and Measurement: Isolating and Working with the Particular Effect Sizes that are Troublesome for Inferential Statistics

2017 ◽  
Vol 144 (4) ◽  
pp. 309-316
Author(s):  
Michael T. Bradley ◽  
Andrew Brand
2020 ◽  
Author(s):  
Zoltan Dienes

Obtaining evidence that something does not exist requires knowing how big it would be were it to exist. Testing a theory that predicts an effect thus entails specifying the range of effect sizes consistent with the theory, in order to know when the evidence counts against the theory. Indeed, a theoretically relevant effect size must be specified for power calculations, equivalence testing, and Bayes factors in order that the inferential statistics test the theory. Specifying relevant effect sizes for power, or the equivalence region for equivalence testing, or the scale factor for Bayes factors, is necessary for many journal formats, such as registered reports, and should be necessary for all articles that use hypothesis testing. Yet there is little systematic advice on how to approach this problem. This article offers some principles and practical advice for specifying theoretically relevant effect sizes for hypothesis testing.


1998 ◽  
Vol 21 (2) ◽  
pp. 222-223
Author(s):  
Bruce A. Thyer

Chow's defense of NHSTP is masterful. His dismissal of including effect sizes (ES) is misplaced, and his failure to discuss the additional practice of reporting proportions of variance explained (PVE) is an important omission. Reporting the results of inferential statistics will be greatly enhanced by including ES and PVE when results are first determined to be statistically significant.


Author(s):  
Valentin Amrhein ◽  
David Trafimow ◽  
Sander Greenland

Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to biased reported effect sizes. Nonetheless, considerable non-replication is to be expected even without selective reporting, and generalizations from single studies are rarely if ever warranted. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. A general perception of a "replication crisis" may thus reflect failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Acknowledging this uncertainty could help reduce the allure of selective reporting: Since a small P-value could be large in a replication study, and a large P-value could be small, there is simply no need to selectively report studies based on statistical results. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what problems occurred, what data were obtained, what analysis methods were used and why, and what output those methods produced.


1986 ◽  
Vol 17 (2) ◽  
pp. 83-99 ◽  
Author(s):  
Ray Hembree ◽  
Donald J. Dessart

The findings of 79 research reports were integrated by meta-analysis to assess the effects of calculators on student achievement and attitude. Effect sizes were derived by the method invented by Glass and tested for consistency and significance with inferential statistics provided by Hedges. At all grades but Grade 4, a use of calculators in concert with traditional mathematics instruction apparently improves the average student's basic skills with paper and pencil, both in working exercises and in problem solving. Sustained calculator use in Grade 4 appears to hinder the development of basic skills in average students. Across all grade and ability levels, students using calculators possess a better attitude toward mathematics and an especially better self-concept in mathematics than students not using calculators.


2018 ◽  
Author(s):  
Valentin Amrhein ◽  
David Trafimow ◽  
Sander Greenland

Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to biased effect sizes. Nonetheless, considerable non-replication is to be expected even without selective reporting, and generalizations from single studies are rarely if ever warranted. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. A general perception of a "replication crisis" may thus reflect failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Acknowledging this uncertainty could help reduce the allure of selective reporting: Since a small P-value could be large in a replication study, and a large P-value could be small, there is simply no need to selectively report studies based on statistical results. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what data resulted, what analysis methods were used and why, and what problems occurred.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Zoltan Dienes

Obtaining evidence that something does not exist requires knowing how big it would be were it to exist. Testing a theory that predicts an effect thus entails specifying the range of effect sizes consistent with the theory, in order to know when the evidence counts against the theory. Indeed, a theoretically relevant effect size must be specified for power calculations, equivalence testing, and Bayes factors in order that the inferential statistics test the theory. Specifying relevant effect sizes for power, or the equivalence region for equivalence testing, or the scale factor for Bayes factors, is necessary for many journal formats, such as registered reports, and should be necessary for all articles that use hypothesis testing. Yet there is little systematic advice on how to approach this problem. This article offers some principles and practical advice for specifying theoretically relevant effect sizes for hypothesis testing.


1981 ◽  
Vol 41 (4) ◽  
pp. 993-1000 ◽  
Author(s):  
David L. Ronis

It is often interesting to compare the size of treatment effects in analysis of variance designs. Many researchers, however, draw the conclusion that one independent variable has more impact than another without testing the null hypothesis that their impact is equal. Most often, investigators compute the proportion of variance each factor accounts for and infer population characteristics from these values. Because such analyses are based on descriptive rather than inferential statistics, they never justify the conclusion that one factor has more impact than the other. This paper presents a novel technique for testing the relative magnitude of effects. It is recommended that researchers interested in comparing effect sizes apply this technique rather than basing their conclusions solely on descriptive statistics.


Author(s):  
Valentin Amrhein ◽  
David Trafimow ◽  
Sander Greenland

Statistical inference often fails to replicate. One reason is that many results may be selected for drawing inference because some threshold of a statistic like the P-value was crossed, leading to biased reported effect sizes. Nonetheless, considerable non-replication is to be expected even without selective reporting, and generalizations from single studies are rarely if ever warranted. Honestly reported results must vary from replication to replication because of varying assumption violations and random variation; excessive agreement itself would suggest deeper problems, such as failure to publish results in conflict with group expectations or desires. A general perception of a "replication crisis" may thus reflect failure to recognize that statistical tests not only test hypotheses, but countless assumptions and the entire environment in which research takes place. Because of all the uncertain and unknown assumptions that underpin statistical inferences, we should treat inferential statistics as highly unstable local descriptions of relations between assumptions and data, rather than as generalizable inferences about hypotheses or models. And that means we should treat statistical results as being much more incomplete and uncertain than is currently the norm. Acknowledging this uncertainty could help reduce the allure of selective reporting: Since a small P-value could be large in a replication study, and a large P-value could be small, there is simply no need to selectively report studies based on statistical results. Rather than focusing our study reports on uncertain conclusions, we should thus focus on describing accurately how the study was conducted, what problems occurred, what data were obtained, what analysis methods were used and why, and what output those methods produced.


2020 ◽  
Vol 29 (3) ◽  
pp. 1574-1595
Author(s):  
Chaleece W. Sandberg ◽  
Teresa Gray

Purpose We report on a study that replicates previous treatment studies using Abstract Semantic Associative Network Training (AbSANT), which was developed to help persons with aphasia improve their ability to retrieve abstract words, as well as thematically related concrete words. We hypothesized that previous results would be replicated; that is, when abstract words are trained using this protocol, improvement would be observed for both abstract and concrete words in the same context-category, but when concrete words are trained, no improvement for abstract words would be observed. We then frame the results of this study with the results of previous studies that used AbSANT to provide better evidence for the utility of this therapeutic technique. We also discuss proposed mechanisms of AbSANT. Method Four persons with aphasia completed one phase of concrete word training and one phase of abstract word training using the AbSANT protocol. Effect sizes were calculated for each word type for each phase. Effect sizes for this study are compared with the effect sizes from previous studies. Results As predicted, training abstract words resulted in both direct training and generalization effects, whereas training concrete words resulted in only direct training effects. The reported results are consistent across studies. Furthermore, when the data are compared across studies, there is a distinct pattern of the added benefit of training abstract words using AbSANT. Conclusion Treatment for word retrieval in aphasia is most often aimed at concrete words, despite the usefulness and pervasiveness of abstract words in everyday conversation. We show the utility of AbSANT as a means of improving not only abstract word retrieval but also concrete word retrieval and hope this evidence will help foster its application in clinical practice.


Sign in / Sign up

Export Citation Format

Share Document