scholarly journals Perhaps Psychology’s Replication Crisis is a Theoretical Crisis that is Only Masquerading as a Statistical One

Author(s):  
Christopher D. Green

The “replication crisis” may well be the single most important challenge facing empirical psychological research today. It appears that highly trained scientists, often without understanding the potentially dire long-term implications, have been mishandling standard statistical procedures in the service of attaining statistical “significance.” Exacerbating the problem, most academic journals do not publish research that has not produced a “significant” result. This toxic combination has resulted in journals apparently publishing many Type I errors and declining to publish many true failures to reject H0. In response, there has been an urgent call from some psychologists that studies be registered in advance so that their rationales, hypotheses, variables, sample sizes, and statistical analyses are recorded in advance, leaving less room for post hoc manipulation. In this chapter, I argue that this “open science” approach, though laudable, will prove insufficient because the null hypothesis significance test (NHST) is a poor criterion for scientific truth, even when it is handled correctly. The root of the problem is that, whatever statistical problems psychology may have, the discipline never developed the theoretical maturity required. For decades it has been satisfied testing weak theories that predict, at best, only the direction of the effect, rather than the size of effect. Indeed, uncritical acceptance of NHST by the discipline may have served to stunt psychology’s theoretical growth by giving researchers a way of building a successful career without having to develop models that make precise predictions. Improving our statistical “hygiene” would be a good thing, to be sure, but it is unlikely to resolve psychology’s growing credibility problem until our theoretical practices mature considerably.

1998 ◽  
Vol 21 (2) ◽  
pp. 207-208 ◽  
Author(s):  
Lester E. Krueger

Chow pays lip service (but not much more!) to Type I errors and thus opts for a hard (all-or-none) .05 level of significance (Superego of Neyman/Pearson theory; Gigerenzer 1993). Most working scientists disregard Type I errors and thus utilize a soft .05 level (Ego of Fisher; Gigerenzer 1993), which lets them report gradations of significance (e.g., p < .001).


2021 ◽  
Author(s):  
Benjamin Erb ◽  
Christoph Bösch ◽  
Cornelia Herbert ◽  
Frank Kargl ◽  
Christian Montag

The open science movement has taken up the important challenge to increase transparency of statistical analyses, to facilitate reproducibility of studies, and to enhance reusability of data sets. To counter the replication crisis in the psychological and related sciences, the movement also urges researchers to publish their primary data sets alongside their articles. While such data publications represent a desirable improvement in terms of transparency and are also helpful for future research (e.g., subsequent meta-analyses or replication studies), we argue that such a procedure can worsen existing privacy issues that are insufficiently considered so far in this context. Recent advances in de-anonymization and re-identification techniques render privacy protection increasingly difficult, as prevalent anonymization mechanisms for handling participants' data might no longer be adequate. When exploiting publicly shared primary data sets, data from multiple studies can be linked with contextual data and eventually, participants can be de-anonymized. Such attacks can either re-identify specific individuals of interest, or they can be used to de-anonymize entire participant cohorts. The threat of de-anonymization attacks can endanger the perceived confidentiality of responses by participants, and ultimately, lower the overall trust of potential participants into the research process due to privacy concerns.


2020 ◽  
Vol 16 (11) ◽  
pp. e1008286
Author(s):  
Howard Bowman ◽  
Joseph L. Brooks ◽  
Omid Hajilou ◽  
Alexia Zoumpoulaki ◽  
Vladimir Litvak

There has been considerable debate and concern as to whether there is a replication crisis in the scientific literature. A likely cause of poor replication is the multiple comparisons problem. An important way in which this problem can manifest in the M/EEG context is through post hoc tailoring of analysis windows (a.k.a. regions-of-interest, ROIs) to landmarks in the collected data. Post hoc tailoring of ROIs is used because it allows researchers to adapt to inter-experiment variability and discover novel differences that fall outside of windows defined by prior precedent, thereby reducing Type II errors. However, this approach can dramatically inflate Type I error rates. One way to avoid this problem is to tailor windows according to a contrast that is orthogonal (strictly parametrically orthogonal) to the contrast being tested. A key approach of this kind is to identify windows on a fully flattened average. On the basis of simulations, this approach has been argued to be safe for post hoc tailoring of analysis windows under many conditions. Here, we present further simulations and mathematical proofs to show exactly why the Fully Flattened Average approach is unbiased, providing a formal grounding to the approach, clarifying the limits of its applicability and resolving published misconceptions about the method. We also provide a statistical power analysis, which shows that, in specific contexts, the fully flattened average approach provides higher statistical power than Fieldtrip cluster inference. This suggests that the Fully Flattened Average approach will enable researchers to identify more effects from their data without incurring an inflation of the false positive rate.


2020 ◽  
Vol 13 (1) ◽  
Author(s):  
Riko Kelter

Abstract Objectives The data presented herein represents the simulated datasets of a recently conducted larger study which investigated the behaviour of Bayesian indices of significance and effect size as alternatives to traditional p-values. The study considered the setting of Student’s and Welch’s two-sample t-test often used in medical research. It investigated the influence of the sample size, noise, the selected prior hyperparameters and the sensitivity to type I errors. The posterior indices used included the Bayes factor, the region of practical equivalence, the probability of direction, the MAP-based p-value and the e-value in the Full Bayesian Significance Test. The simulation study was conducted in the statistical programming language R. Data description The R script files for simulation of the datasets used in the study are presented in this article. These script files can both simulate the raw datasets and run the analyses. As researchers may be faced with different effect sizes, noise levels or priors in their domain than the ones studied in the original paper, the scripts extend the original results by allowing to recreate all analyses of interest in different contexts. Therefore, they should be relevant to other researchers.


2019 ◽  
Author(s):  
Matthew C. Makel ◽  
Kendal N. Smith ◽  
Matthew McBee ◽  
Scott J. Peters ◽  
Erin Miller

Concerns about the replication crisis and false findings have spread through a number of fields, including educational and psychological research. In some pockets, education has begun to adopt open science reforms that have proven useful in other fields. These include preregistration, open materials and data, and registered reports. These reforms are necessary and offer education research a path to increased credibility and social impact. But they all operate at the level of individual researchers’ behavior. In this paper, we discuss models of large-scale collaborative research practices and how they can be applied to educational research. The combination of large-scale collaboration with open and transparent research practices offers education researchers an exciting new method for falsifying theories, verifying what we know, resolving disagreements, and exploring new questions.


1975 ◽  
Vol 19 (2) ◽  
pp. 209-212
Author(s):  
Lyle Hamm

The purpose of this project was to determine what effects, if any, lighting and density of product had upon the inspection process. Therefore, there were three levels of lighting and three different groupings, i.e., one, two, or three resistors. Four different subjects were used, each completing the test twice. The inspection procedure consisted of identifying bad resistors on a moving belt conveyor. Subjects were required only to call out the number of bad resistors out of each group. It was not necessary for the subject to sort out the good from the bad. The color coding of the resistors provided for the discrimination. The subjects' performances were scored three ways: Type I errors in which a subject accepted a bad resistor; Type II errors in which a good resistor was accepted; and the combination or sum of the two. The results of the experiment showed a definite difference in the effect of the lowest light level as compared to the higher two. The number of resistors in a group displayed consistent statistical significance only between the first and last levels; i.e., those containing one and three resistors, respectively.


Methodology ◽  
2015 ◽  
Vol 11 (3) ◽  
pp. 110-115 ◽  
Author(s):  
Rand R. Wilcox ◽  
Jinxia Ma

Abstract. The paper compares methods that allow both within group and between group heteroscedasticity when performing all pairwise comparisons of the least squares lines associated with J independent groups. The methods are based on simple extension of results derived by Johansen (1980) and Welch (1938) in conjunction with the HC3 and HC4 estimators. The probability of one or more Type I errors is controlled using the improvement on the Bonferroni method derived by Hochberg (1988) . Results are illustrated using data from the Well Elderly 2 study, which motivated this paper.


2019 ◽  
Vol 227 (4) ◽  
pp. 261-279 ◽  
Author(s):  
Frank Renkewitz ◽  
Melanie Keiner

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.


2019 ◽  
Author(s):  
Jennifer L Tackett ◽  
Josh Miller

As psychological research comes under increasing fire for the crisis of replicability, attention has turned to methods and practices that facilitate (or hinder) a more replicable and veridical body of empirical evidence. These trends have focused on “open science” initiatives, including an emphasis on replication, transparency, and data sharing. Despite this broader movement in psychology, clinical psychologists and psychiatrists have been largely absent from the broader conversation on documenting the extent of existing problems as well as generating solutions to problematic methods and practices in our area (Tackett et al., 2017). The goal of the current special section was to bring together psychopathology researchers to explore these and related areas as they pertain to the types of research conducted in clinical psychology and allied disciplines.


Sign in / Sign up

Export Citation Format

Share Document