Publication bias and statistical power in gerontological psychology

2018 ◽  
Author(s):  
Christopher Brydges

Objectives: Research has found evidence of publication bias, questionable research practices (QRPs), and low statistical power in published psychological journal articles. Isaacowitz’s (2018) editorial in the Journals of Gerontology Series B, Psychological Sciences called for investigation of these issues in gerontological research. The current study presents meta-research findings based on published research to explore if there is evidence of these practices in gerontological research. Method: 14,481 test statistics and p values were extracted from articles published in eight top gerontological psychology journals since 2000. Frequentist and Bayesian caliper tests were used to test for publication bias and QRPs (specifically, p-hacking and incorrect rounding of p values). A z-curve analysis was used to estimate average statistical power across studies.Results: Strong evidence of publication bias was observed, and average statistical power was approximately .70 – below the recommended .80 level. Evidence of p-hacking was mixed. Evidence of incorrect rounding of p values was inconclusive.Discussion: Gerontological research is not immune to publication bias, QRPs, and low statistical power. Researchers, journals, institutions, and funding bodies are encouraged to adopt open and transparent research practices, and using Registered Reports as an alternative article type to minimize publication bias and QRPs, and increase statistical power.

2019 ◽  
Vol 6 (12) ◽  
pp. 190738 ◽  
Author(s):  
Jerome Olsen ◽  
Johanna Mosen ◽  
Martin Voracek ◽  
Erich Kirchler

The replicability of research findings has recently been disputed across multiple scientific disciplines. In constructive reaction, the research culture in psychology is facing fundamental changes, but investigations of research practices that led to these improvements have almost exclusively focused on academic researchers. By contrast, we investigated the statistical reporting quality and selected indicators of questionable research practices (QRPs) in psychology students' master's theses. In a total of 250 theses, we investigated utilization and magnitude of standardized effect sizes, along with statistical power, the consistency and completeness of reported results, and possible indications of p -hacking and further testing. Effect sizes were reported for 36% of focal tests (median r = 0.19), and only a single formal power analysis was reported for sample size determination (median observed power 1 − β = 0.67). Statcheck revealed inconsistent p -values in 18% of cases, while 2% led to decision errors. There were no clear indications of p -hacking or further testing. We discuss our findings in the light of promoting open science standards in teaching and student supervision.


2021 ◽  
Author(s):  
Kleber Neves ◽  
Pedro Batista Tan ◽  
Olavo Bohrer Amaral

Diagnostic screening models for the interpretation of null hypothesis significance test (NHST) results have been influential in highlighting the effect of selective publication on the reproducibility of the published literature, leading to John Ioannidis’ much-cited claim that most published research findings are false. These models, however, are typically based on the assumption that hypotheses are dichotomously true or false, without considering that effect sizes for different hypotheses are not the same. To address this limitation, we develop a simulation model that overcomes this by modeling effect sizes explicitly using different continuous distributions, while retaining other aspects of previous models such as publication bias and the pursuit of statistical significance. Our results show that the combination of selective publication, bias, low statistical power and unlikely hypotheses consistently leads to high proportions of false positives, irrespective of the effect size distribution assumed. Using continuous effect sizes also allows us to evaluate the degree of effect size overestimation and prevalence of estimates with the wrong signal in the literature, showing that the same factors that drive false-positive results also lead to errors in estimating effect size direction and magnitude. Nevertheless, the relative influence of these factors on different metrics varies depending on the distribution assumed for effect sizes. The model is made available as an R ShinyApp interface, allowing one to explore features of the literature in various scenarios.


Author(s):  
Holly L. Storkel ◽  
Frederick J. Gallun

Purpose: This editorial introduces the new registered reports article type for the Journal of Speech, Language, and Hearing Research . The goal of registered reports is to create a structural solution to address issues of publication bias toward results that are unexpected and sensational, questionable research practices that are used to produce novel results, and a peer-review process that occurs at the end of the research process when changes in fundamental design are difficult or impossible to implement. Conclusion: Registered reports can be a positive addition to scientific publications by addressing issues of publication bias, questionable research practices, and the late influence of peer review. This article type does so by requiring reviewers and authors to agree in advance that the experimental design is solid, the questions are interesting, and the results will be publishable regardless of the outcome. This procedure ensures that replication studies and null results make it into the published literature and that authors are not incentivized to alter their analyses based on the results that they obtain. Registered reports represent an ongoing commitment to research integrity and finding structural solutions to structural problems inherent in a research and publishing landscape in which publications are such a high-stakes aspect of individual and institutional success.


2021 ◽  
pp. 39-55
Author(s):  
R. Barker Bausell

This chapter explores three empirical concepts (the p-value, the effect size, and statistical power) integral to the avoidance of false positive scientific. Their relationship to reproducibility is explained in a nontechnical manner without formulas or statistical jargon, with p-values and statistical power presented in terms of probabilities from zero to 1.0 with the values of most interest to scientists being 0.05 (synonymous with a positive, hence, publishable result) and 0.80 (the most commonly recommended probability that a positive result will be obtained if the hypothesis that generated it was correct and the study will be properly designed and conducted). Unfortunately many scientists circumvent both by artifactually inflating the 0.05 criterion, overstating the available statistical power, and engaging in a number of other questionable research practices. These issues are discussed via statistical models from the genetic and psychological fields and then extended to a number of different p-values, statistical power levels, effect sizes, and the prevalence of “true,” effects expected to exist in the research literature. Among the basic conclusions of these modeling efforts are that employing more stringent p-values and larger sample sizes constitute the most effective statistical approaches for increasing the reproducibility of published results in all empirically based scientific literatures. This chapter thus lays the necessary foundation for understanding and appreciating the effects of appropriate p-values, sufficient statistical power, reaslistic effect sizes, and the avoidance of questionable research practices upon the production of reproducible results.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S849-S850
Author(s):  
Christopher Brydges ◽  
Laura Gaeta

Abstract A recent published systematic review (Hein et al., 2019) found that consumption of blueberries could improve memory, executive function, and psychomotor function in healthy children and adults, as well as adults with mild cognitive impairment. However, attention to questionable research practices (QRPs; such as selective reporting of results and/or performing analyses on data until statistical significance is achieved) has grown in recent years. The purpose of this study was to examine the results of the studies included in the review for potential publication bias and/or QRPs. p-curve and the test of insufficient variance (TIVA) were conducted on the 22 reported p values to test for evidential value of the published research, and publication bias and QRPs, respectively. The p-curve analyses revealed that the studies did not contain any evidential value for the effect of blueberries on cognitive ability, and the TIVAs suggested that there was evidence of publication bias and/or QRPs in the studies. Although these findings do not indicate that there is no relationship between blueberries and cognitive ability, more high-quality research that is pre-registered and appropriately powered is needed to determine whether a relationship exists at all, and if so, the strength of the evidence to support this association.


2019 ◽  
Vol 227 (4) ◽  
pp. 261-279 ◽  
Author(s):  
Frank Renkewitz ◽  
Melanie Keiner

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.


2019 ◽  
Author(s):  
Gregory Francis ◽  
Evelina Thunell

Based on findings from six experiments, Dallas, Liu & Ubel (2019) concluded that placing calorie labels to the left of menu items influences consumers to choose lower calorie food options. Contrary to previously reported findings, they suggested that calorie labels do influence food choices, but only when placed to the left because they are in this case read first. If true, these findings have important implications for the design of menus and may help address the obesity pandemic. However, an analysis of the reported results indicates that they seem too good to be true. We show that if the effect sizes in Dallas et al. (2019) are representative of the populations, a replication of the six studies (with the same sample sizes) has a probability of only 0.014 of producing uniformly significant outcomes. Such a low success rate suggests that the original findings might be the result of questionable research practices or publication bias. We therefore caution readers and policy makers to be skeptical about the results and conclusions reported by Dallas et al. (2019).


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2020 ◽  
Vol 7 (4) ◽  
pp. 181351 ◽  
Author(s):  
Sarahanne M. Field ◽  
E.-J. Wagenmakers ◽  
Henk A. L. Kiers ◽  
Rink Hoekstra ◽  
Anja F. Ernst ◽  
...  

The crisis of confidence has undermined the trust that researchers place in the findings of their peers. In order to increase trust in research, initiatives such as preregistration have been suggested, which aim to prevent various questionable research practices. As it stands, however, no empirical evidence exists that preregistration does increase perceptions of trust. The picture may be complicated by a researcher's familiarity with the author of the study, regardless of the preregistration status of the research. This registered report presents an empirical assessment of the extent to which preregistration increases the trust of 209 active academics in the reported outcomes, and how familiarity with another researcher influences that trust. Contrary to our expectations, we report ambiguous Bayes factors and conclude that we do not have strong evidence towards answering our research questions. Our findings are presented along with evidence that our manipulations were ineffective for many participants, leading to the exclusion of 68% of complete datasets, and an underpowered design as a consequence. We discuss other limitations and confounds which may explain why the findings of the study deviate from a previously conducted pilot study. We reflect on the benefits of using the registered report submission format in light of our results. The OSF page for this registered report and its pilot can be found here: http://dx.doi.org/10.17605/OSF.IO/B3K75 .


2021 ◽  
Author(s):  
Taym Alsalti

Concern has been mounting over the reproducibility of findings in psychology and other empiri-cal sciences. Large scale replication attempts found worrying results. The high rate of false find-ings in the published research has been partly attributed to scientists’ engagement in questionable research practices (QRPs). I discuss reasons and solutions for this problem. Employing a content analysis of empirical studies published in the years 2007 and 2017, I found a decrease in the prevalence of QRPs in the investigated decade. I subsequently discuss possible explanations for the improvement as well as further potential contributors to the high rate of false findings in sci-ence. Most scientists agree that a change towards more open and transparent scientific practice on part of both scientists and publishers is necessary. Debate exists as to how this should be achieved.


Sign in / Sign up

Export Citation Format

Share Document