scholarly journals Scientific Self-Correction: The Bayesian Way

2019 ◽  
Author(s):  
Felipe Romero ◽  
Jan Sprenger

The enduring replication crisis in many scientific disciplines casts doubt on the ability of science to self-correct its findings and to produce reliable knowledge. Amongst a variety of possible methodological, social, and statistical reforms to address the crisis, we focus on replacing null hypothesis significance testing (NHST) with Bayesian inference. On the basis of a simulation study for meta-analytic aggregation of effect sizes, we study the relative advantages of this Bayesian reform, and its interaction with widespread limitations in experimental research. Moving to Bayesian statistics will not solve the replication crisis single-handely, but would eliminate important sources of effect size overestimation for the conditions we study.

2019 ◽  
Author(s):  
Jan Sprenger

The replication crisis poses an enormous challenge to the epistemic authority of science and the logic of statistical inference in particular. Two prominent features of Null Hypothesis Significance Testing (NHST) arguably contribute to the crisis: the lack of guidance for interpreting non-significant results and the impossibility of quantifying support for the null hypothesis. In this paper, I argue that also popular alternatives to NHST, such as confidence intervals and Bayesian inference, do not lead to a satisfactory logic of evaluating hypothesis tests. As an alternative, I motivate and explicate the concept of corroboration of the null hypothesis. Finally I show how degrees of corroboration give an interpretation to non-significant results, combat publication bias and mitigate the replication crisis.


2015 ◽  
Vol 37 (4) ◽  
pp. 449-461 ◽  
Author(s):  
Andreas Ivarsson ◽  
Mark B. Andersen ◽  
Andreas Stenling ◽  
Urban Johnson ◽  
Magnus Lindwall

Null hypothesis significance testing (NHST) is like an immortal horse that some researchers have been trying to beat to death for over 50 years but without any success. In this article we discuss the flaws in NHST, the historical background in relation to both Fisher’s and Neyman and Pearson’s statistical ideas, the common misunderstandings of what p < 05 actually means, and the 2010 APA publication manual’s clear, but most often ignored, instructions to report effect sizes and to interpret what they all mean in the real world. In addition, we discuss how Bayesian statistics can be used to overcome some of the problems with NHST. We then analyze quantitative articles published over the past three years (2012–2014) in two top-rated sport and exercise psychology journals to determine whether we have learned what we should have learned decades ago about our use and meaningful interpretations of statistics.


2018 ◽  
Vol 47 (1) ◽  
pp. 435-453 ◽  
Author(s):  
Erik Otárola-Castillo ◽  
Melissa G. Torquato

Null hypothesis significance testing (NHST) is the most common statistical framework used by scientists, including archaeologists. Owing to increasing dissatisfaction, however, Bayesian inference has become an alternative to these methods. In this article, we review the application of Bayesian statistics to archaeology. We begin with a simple example to demonstrate the differences in applying NHST and Bayesian inference to an archaeological problem. Next, we formally define NHST and Bayesian inference, provide a brief historical overview of their development, and discuss the advantages and limitations of each method. A review of Bayesian inference and archaeology follows, highlighting the applications of Bayesian methods to chronological, bioarchaeological, zooarchaeological, ceramic, lithic, and spatial analyses. We close by considering the future applications of Bayesian statistics to archaeological research.


2009 ◽  
Vol 217 (1) ◽  
pp. 15-26 ◽  
Author(s):  
Geoff Cumming ◽  
Fiona Fidler

Most questions across science call for quantitative answers, ideally, a single best estimate plus information about the precision of that estimate. A confidence interval (CI) expresses both efficiently. Early experimental psychologists sought quantitative answers, but for the last half century psychology has been dominated by the nonquantitative, dichotomous thinking of null hypothesis significance testing (NHST). The authors argue that psychology should rejoin mainstream science by asking better questions – those that demand quantitative answers – and using CIs to answer them. They explain CIs and a range of ways to think about them and use them to interpret data, especially by considering CIs as prediction intervals, which provide information about replication. They explain how to calculate CIs on means, proportions, correlations, and standardized effect sizes, and illustrate symmetric and asymmetric CIs. They also argue that information provided by CIs is more useful than that provided by p values, or by values of Killeen’s prep, the probability of replication.


2021 ◽  
Author(s):  
Mark Rubin

Scientists often adjust their significance threshold (alpha level) during null hypothesis significance testing in order to take into account multiple testing and multiple comparisons. This alpha adjustment has become particularly relevant in the context of the replication crisis in science. The present article considers the conditions in which this alpha adjustment is appropriate and the conditions in which it is inappropriate. A distinction is drawn between three types of multiple testing: disjunction testing, conjunction testing, and individual testing. It is argued that alpha adjustment is only appropriate in the case of disjunction testing, in which at least one test result must be significant in order to reject the associated joint null hypothesis. Alpha adjustment is inappropriate in the case of conjunction testing, in which all relevant results must be significant in order to reject the joint null hypothesis. Alpha adjustment is also inappropriate in the case of individual testing, in which each individual result must be significant in order to reject each associated individual null hypothesis. The conditions under which each of these three types of multiple testing is warranted are examined. It is concluded that researchers should not automatically (mindlessly) assume that alpha adjustment is necessary during multiple testing. Illustrations are provided in relation to joint studywise hypotheses and joint multiway ANOVAwise hypotheses.


2021 ◽  
Author(s):  
Erik Otarola-Castillo ◽  
Meissa G Torquato ◽  
Caitlin E. Buck

Archaeologists often use data and quantitative statistical methods to evaluate their ideas. Although there are various statistical frameworks for decision-making in archaeology and science in general, in this chapter, we provide a simple explanation of Bayesian statistics. To contextualize the Bayesian statistical framework, we briefly compare it to the more widespread null hypothesis significance testing (NHST) approach. We also provide a simple example to illustrate how archaeologists use data and the Bayesian framework to compare hypotheses and evaluate their uncertainty. We then review how archaeologists have applied Bayesian statistics to solve research problems related to radiocarbon dating and chronology, lithic, ceramic, zooarchaeological, bioarchaeological, and spatial analyses. Because recent work has reviewed Bayesian applications in archaeology from the 1990s up to 2017, this work considers the relevant literature published since 2017.


2020 ◽  
Author(s):  
Thomas Edward Gladwin

This evolving document is my combination essay-tutorial-manifesto on foundational concepts of statistics for experimental research, primarily meant to help strengthen statistical thinking using programming and simulated experiments to make concepts concrete, rather than formal mathematics. It further aims to explain and justify the role of null hypothesis significance testing in experimental research. It’s not an introductory textbook, but more something to read alongside or after undergraduate modules. It also provides an introduction to data analysis and simulation using Python and NumPy.


Author(s):  
Freddy A. Paniagua

Ferguson (2015) observed that the proportion of studies supporting the experimental hypothesis and rejecting the null hypothesis is very high. This paper argues that the reason for this scenario is that researchers in the behavioral sciences have learned that the null hypothesis can always be rejected if one knows the statistical tricks to reject it (e.g., the probability of rejecting the null hypothesis increases with p = 0.05 compare to p = 0.01). Examples of the advancement of science without the need to formulate the null hypothesis are also discussed, as well as alternatives to null hypothesis significance testing-NHST (e.g., effect sizes), and the importance to distinguish the statistical significance from the practical significance of results.  


2021 ◽  
Vol 15 ◽  
Author(s):  
Ruslan Masharipov ◽  
Irina Knyazeva ◽  
Yaroslav Nikolaev ◽  
Alexander Korotkov ◽  
Michael Didur ◽  
...  

Classical null hypothesis significance testing is limited to the rejection of the point-null hypothesis; it does not allow the interpretation of non-significant results. This leads to a bias against the null hypothesis. Herein, we discuss statistical approaches to ‘null effect’ assessment focusing on the Bayesian parameter inference (BPI). Although Bayesian methods have been theoretically elaborated and implemented in common neuroimaging software packages, they are not widely used for ‘null effect’ assessment. BPI considers the posterior probability of finding the effect within or outside the region of practical equivalence to the null value. It can be used to find both ‘activated/deactivated’ and ‘not activated’ voxels or to indicate that the obtained data are not sufficient using a single decision rule. It also allows to evaluate the data as the sample size increases and decide to stop the experiment if the obtained data are sufficient to make a confident inference. To demonstrate the advantages of using BPI for fMRI data group analysis, we compare it with classical null hypothesis significance testing on empirical data. We also use simulated data to show how BPI performs under different effect sizes, noise levels, noise distributions and sample sizes. Finally, we consider the problem of defining the region of practical equivalence for BPI and discuss possible applications of BPI in fMRI studies. To facilitate ‘null effect’ assessment for fMRI practitioners, we provide Statistical Parametric Mapping 12 based toolbox for Bayesian inference.


Sign in / Sign up

Export Citation Format

Share Document