Scientific Self-Correction: The Bayesian Way

Mapping Intimacies ◽

10.31234/osf.io/daw3q ◽

2019 ◽

Author(s):

Felipe Romero ◽

Jan Sprenger

Keyword(s):

Bayesian Inference ◽

Bayesian Statistics ◽

Experimental Research ◽

Null Hypothesis ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Scientific Disciplines ◽

Replication Crisis ◽

Reliable Knowledge

The enduring replication crisis in many scientific disciplines casts doubt on the ability of science to self-correct its findings and to produce reliable knowledge. Amongst a variety of possible methodological, social, and statistical reforms to address the crisis, we focus on replacing null hypothesis significance testing (NHST) with Bayesian inference. On the basis of a simulation study for meta-analytic aggregation of effect sizes, we study the relative advantages of this Bayesian reform, and its interaction with widespread limitations in experimental research. Moving to Bayesian statistics will not solve the replication crisis single-handely, but would eliminate important sources of effect size overestimation for the conditions we study.

Download Full-text

Degrees of Corroboration: An Antidote to the Replication Crisis

10.31234/osf.io/fdkqg ◽

2019 ◽

Author(s):

Jan Sprenger

Keyword(s):

Bayesian Inference ◽

Statistical Inference ◽

Confidence Intervals ◽

Publication Bias ◽

Null Hypothesis ◽

Significance Testing ◽

Epistemic Authority ◽

Hypothesis Tests ◽

Null Hypothesis Significance Testing ◽

Replication Crisis

The replication crisis poses an enormous challenge to the epistemic authority of science and the logic of statistical inference in particular. Two prominent features of Null Hypothesis Significance Testing (NHST) arguably contribute to the crisis: the lack of guidance for interpreting non-significant results and the impossibility of quantifying support for the null hypothesis. In this paper, I argue that also popular alternatives to NHST, such as confidence intervals and Bayesian inference, do not lead to a satisfactory logic of evaluating hypothesis tests. As an alternative, I motivate and explicate the concept of corroboration of the null hypothesis. Finally I show how degrees of corroboration give an interpretation to non-significant results, combat publication bias and mitigate the replication crisis.

Download Full-text

Things We Still Haven’t Learned (So Far)

Journal of Sport and Exercise Psychology ◽

10.1123/jsep.2015-0015 ◽

2015 ◽

Vol 37 (4) ◽

pp. 449-461 ◽

Cited By ~ 19

Author(s):

Andreas Ivarsson ◽

Mark B. Andersen ◽

Andreas Stenling ◽

Urban Johnson ◽

Magnus Lindwall

Keyword(s):

Bayesian Statistics ◽

Null Hypothesis ◽

Historical Background ◽

Effect Sizes ◽

Significance Testing ◽

Exercise Psychology ◽

Null Hypothesis Significance Testing ◽

The Past ◽

The Common ◽

Sport And Exercise

Null hypothesis significance testing (NHST) is like an immortal horse that some researchers have been trying to beat to death for over 50 years but without any success. In this article we discuss the flaws in NHST, the historical background in relation to both Fisher’s and Neyman and Pearson’s statistical ideas, the common misunderstandings of what p < 05 actually means, and the 2010 APA publication manual’s clear, but most often ignored, instructions to report effect sizes and to interpret what they all mean in the real world. In addition, we discuss how Bayesian statistics can be used to overcome some of the problems with NHST. We then analyze quantitative articles published over the past three years (2012–2014) in two top-rated sport and exercise psychology journals to determine whether we have learned what we should have learned decades ago about our use and meaningful interpretations of statistics.

Download Full-text

Bayesian Statistics in Archaeology

Annual Review of Anthropology ◽

10.1146/annurev-anthro-102317-045834 ◽

2018 ◽

Vol 47 (1) ◽

pp. 435-453 ◽

Cited By ~ 8

Author(s):

Erik Otárola-Castillo ◽

Melissa G. Torquato

Keyword(s):

Bayesian Inference ◽

Bayesian Statistics ◽

Bayesian Methods ◽

Null Hypothesis ◽

Significance Testing ◽

Archaeological Research ◽

Historical Overview ◽

Spatial Analyses ◽

Null Hypothesis Significance Testing ◽

Statistical Framework

Null hypothesis significance testing (NHST) is the most common statistical framework used by scientists, including archaeologists. Owing to increasing dissatisfaction, however, Bayesian inference has become an alternative to these methods. In this article, we review the application of Bayesian statistics to archaeology. We begin with a simple example to demonstrate the differences in applying NHST and Bayesian inference to an archaeological problem. Next, we formally define NHST and Bayesian inference, provide a brief historical overview of their development, and discuss the advantages and limitations of each method. A review of Bayesian inference and archaeology follows, highlighting the applications of Bayesian methods to chronological, bioarchaeological, zooarchaeological, ceramic, lithic, and spatial analyses. We close by considering the future applications of Bayesian statistics to archaeological research.

Download Full-text

Null hypothesis significance testing and effect sizes: can we ‘effect’ everything … or … anything?

Current Opinion in Pharmacology ◽

10.1016/j.coph.2019.12.001 ◽

2020 ◽

Vol 51 ◽

pp. 68-77 ◽

Cited By ~ 1

Author(s):

David P Lovell

Keyword(s):

Null Hypothesis ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing

Download Full-text

Confidence Intervals

Zeitschrift für Psychologie / Journal of Psychology ◽

10.1027/0044-3409.217.1.15 ◽

2009 ◽

Vol 217 (1) ◽

pp. 15-26 ◽

Cited By ~ 43

Author(s):

Geoff Cumming ◽

Fiona Fidler

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Null Hypothesis ◽

Prediction Intervals ◽

Effect Sizes ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

P Values ◽

Standardized Effect Sizes ◽

Mainstream Science

Most questions across science call for quantitative answers, ideally, a single best estimate plus information about the precision of that estimate. A confidence interval (CI) expresses both efficiently. Early experimental psychologists sought quantitative answers, but for the last half century psychology has been dominated by the nonquantitative, dichotomous thinking of null hypothesis significance testing (NHST). The authors argue that psychology should rejoin mainstream science by asking better questions – those that demand quantitative answers – and using CIs to answer them. They explain CIs and a range of ways to think about them and use them to interpret data, especially by considering CIs as prediction intervals, which provide information about replication. They explain how to calculate CIs on means, proportions, correlations, and standardized effect sizes, and illustrate symmetric and asymmetric CIs. They also argue that information provided by CIs is more useful than that provided by p values, or by values of Killeen’s prep, the probability of replication.

Download Full-text

When to adjust alpha during multiple testing: A consideration of disjunction, conjunction, and individual testing.

10.31234/osf.io/qc9kf ◽

2021 ◽

Author(s):

Mark Rubin

Keyword(s):

Present Article ◽

Null Hypothesis ◽

Multiple Testing ◽

Multiple Comparisons ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Significance Threshold ◽

Alpha Level ◽

Replication Crisis ◽

Test Result

Scientists often adjust their significance threshold (alpha level) during null hypothesis significance testing in order to take into account multiple testing and multiple comparisons. This alpha adjustment has become particularly relevant in the context of the replication crisis in science. The present article considers the conditions in which this alpha adjustment is appropriate and the conditions in which it is inappropriate. A distinction is drawn between three types of multiple testing: disjunction testing, conjunction testing, and individual testing. It is argued that alpha adjustment is only appropriate in the case of disjunction testing, in which at least one test result must be significant in order to reject the associated joint null hypothesis. Alpha adjustment is inappropriate in the case of conjunction testing, in which all relevant results must be significant in order to reject the joint null hypothesis. Alpha adjustment is also inappropriate in the case of individual testing, in which each individual result must be significant in order to reject each associated individual null hypothesis. The conditions under which each of these three types of multiple testing is warranted are examined. It is concluded that researchers should not automatically (mindlessly) assume that alpha adjustment is necessary during multiple testing. Illustrations are provided in relation to joint studywise hypotheses and joint multiway ANOVAwise hypotheses.

Download Full-text

The Bayesian Inferential Paradigm in Archaeology

10.31235/osf.io/eb5gs ◽

2021 ◽

Author(s):

Erik Otarola-Castillo ◽

Meissa G Torquato ◽

Caitlin E. Buck

Keyword(s):

Bayesian Statistics ◽

Null Hypothesis ◽

Radiocarbon Dating ◽

Relevant Literature ◽

Significance Testing ◽

Simple Explanation ◽

Spatial Analyses ◽

Null Hypothesis Significance Testing ◽

Statistical Framework ◽

Research Problems

Archaeologists often use data and quantitative statistical methods to evaluate their ideas. Although there are various statistical frameworks for decision-making in archaeology and science in general, in this chapter, we provide a simple explanation of Bayesian statistics. To contextualize the Bayesian statistical framework, we briefly compare it to the more widespread null hypothesis significance testing (NHST) approach. We also provide a simple example to illustrate how archaeologists use data and the Bayesian framework to compare hypotheses and evaluate their uncertainty. We then review how archaeologists have applied Bayesian statistics to solve research problems related to radiocarbon dating and chronology, lithic, ceramic, zooarchaeological, bioarchaeological, and spatial analyses. Because recent work has reviewed Bayesian applications in archaeology from the 1990s up to 2017, this work considers the relevant literature published since 2017.

Download Full-text

Understanding Significance Testing as the Quantification of Contradiction: Conceptual and Practical Foundations of Statistics in the Context of Experimental Research

10.31234/osf.io/kmrft ◽

2020 ◽

Cited By ~ 1

Author(s):

Thomas Edward Gladwin

Keyword(s):

Data Analysis ◽

Experimental Research ◽

Null Hypothesis ◽

Significance Testing ◽

Statistical Thinking ◽

Null Hypothesis Significance Testing ◽

Introductory Textbook ◽

Formal Mathematics ◽

Foundations Of Statistics

This evolving document is my combination essay-tutorial-manifesto on foundational concepts of statistics for experimental research, primarily meant to help strengthen statistical thinking using programming and simulated experiments to make concepts concrete, rather than formal mathematics. It further aims to explain and justify the role of null hypothesis significance testing in experimental research. It’s not an introductory textbook, but more something to read alongside or after undergraduate modules. It also provides an introduction to data analysis and simulation using Python and NumPy.

Download Full-text

The null hypothesis is always rejected with statistical tricks: Why do you need it?

Revista Interamericana de Psicología/Interamerican Journal of Psychology ◽

10.30849/rip/ijp.v53i1.1166 ◽

2019 ◽

Vol 53 (1) ◽

pp. 17-27

Author(s):

Freddy A. Paniagua

Keyword(s):

Null Hypothesis ◽

Statistical Significance ◽

Effect Sizes ◽

Practical Significance ◽

Significance Testing ◽

Behavioral Sciences ◽

Null Hypothesis Significance Testing ◽

Very High

Ferguson (2015) observed that the proportion of studies supporting the experimental hypothesis and rejecting the null hypothesis is very high. This paper argues that the reason for this scenario is that researchers in the behavioral sciences have learned that the null hypothesis can always be rejected if one knows the statistical tricks to reject it (e.g., the probability of rejecting the null hypothesis increases with p = 0.05 compare to p = 0.01). Examples of the advancement of science without the need to formulate the null hypothesis are also discussed, as well as alternatives to null hypothesis significance testing-NHST (e.g., effect sizes), and the importance to distinguish the statistical significance from the practical significance of results.

Download Full-text

Providing Evidence for the Null Hypothesis in Functional Magnetic Resonance Imaging Using Group-Level Bayesian Inference

Frontiers in Neuroinformatics ◽

10.3389/fninf.2021.738342 ◽

2021 ◽

Vol 15 ◽

Author(s):

Ruslan Masharipov ◽

Irina Knyazeva ◽

Yaroslav Nikolaev ◽

Alexander Korotkov ◽

Michael Didur ◽

...

Keyword(s):

Bayesian Inference ◽

Null Hypothesis ◽

Group Analysis ◽

Statistical Parametric Mapping ◽

Simulated Data ◽

Significance Testing ◽

Null Hypothesis Significance Testing ◽

Effect Assessment ◽

Null Effect ◽

Region Of Practical Equivalence

Classical null hypothesis significance testing is limited to the rejection of the point-null hypothesis; it does not allow the interpretation of non-significant results. This leads to a bias against the null hypothesis. Herein, we discuss statistical approaches to ‘null effect’ assessment focusing on the Bayesian parameter inference (BPI). Although Bayesian methods have been theoretically elaborated and implemented in common neuroimaging software packages, they are not widely used for ‘null effect’ assessment. BPI considers the posterior probability of finding the effect within or outside the region of practical equivalence to the null value. It can be used to find both ‘activated/deactivated’ and ‘not activated’ voxels or to indicate that the obtained data are not sufficient using a single decision rule. It also allows to evaluate the data as the sample size increases and decide to stop the experiment if the obtained data are sufficient to make a confident inference. To demonstrate the advantages of using BPI for fMRI data group analysis, we compare it with classical null hypothesis significance testing on empirical data. We also use simulated data to show how BPI performs under different effect sizes, noise levels, noise distributions and sample sizes. Finally, we consider the problem of defining the region of practical equivalence for BPI and discuss possible applications of BPI in fMRI studies. To facilitate ‘null effect’ assessment for fMRI practitioners, we provide Statistical Parametric Mapping 12 based toolbox for Bayesian inference.

Download Full-text