scholarly journals DO BLUEBERRIES ACTUALLY IMPROVE COGNITIVE PERFORMANCE? AN ANALYSIS OF PUBLICATION BIAS IN PUBLISHED RESEARCH

2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S849-S850
Author(s):  
Christopher Brydges ◽  
Laura Gaeta

Abstract A recent published systematic review (Hein et al., 2019) found that consumption of blueberries could improve memory, executive function, and psychomotor function in healthy children and adults, as well as adults with mild cognitive impairment. However, attention to questionable research practices (QRPs; such as selective reporting of results and/or performing analyses on data until statistical significance is achieved) has grown in recent years. The purpose of this study was to examine the results of the studies included in the review for potential publication bias and/or QRPs. p-curve and the test of insufficient variance (TIVA) were conducted on the 22 reported p values to test for evidential value of the published research, and publication bias and QRPs, respectively. The p-curve analyses revealed that the studies did not contain any evidential value for the effect of blueberries on cognitive ability, and the TIVAs suggested that there was evidence of publication bias and/or QRPs in the studies. Although these findings do not indicate that there is no relationship between blueberries and cognitive ability, more high-quality research that is pre-registered and appropriately powered is needed to determine whether a relationship exists at all, and if so, the strength of the evidence to support this association.

2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


2018 ◽  
Author(s):  
Christopher Brydges

Objectives: Research has found evidence of publication bias, questionable research practices (QRPs), and low statistical power in published psychological journal articles. Isaacowitz’s (2018) editorial in the Journals of Gerontology Series B, Psychological Sciences called for investigation of these issues in gerontological research. The current study presents meta-research findings based on published research to explore if there is evidence of these practices in gerontological research. Method: 14,481 test statistics and p values were extracted from articles published in eight top gerontological psychology journals since 2000. Frequentist and Bayesian caliper tests were used to test for publication bias and QRPs (specifically, p-hacking and incorrect rounding of p values). A z-curve analysis was used to estimate average statistical power across studies.Results: Strong evidence of publication bias was observed, and average statistical power was approximately .70 – below the recommended .80 level. Evidence of p-hacking was mixed. Evidence of incorrect rounding of p values was inconclusive.Discussion: Gerontological research is not immune to publication bias, QRPs, and low statistical power. Researchers, journals, institutions, and funding bodies are encouraged to adopt open and transparent research practices, and using Registered Reports as an alternative article type to minimize publication bias and QRPs, and increase statistical power.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


Author(s):  
Abhaya Indrayan

Background: Small P-values have been conventionally considered as evidence to reject a null hypothesis in empirical studies. However, there is widespread criticism of P-values now and the threshold we use for statistical significance is questioned.Methods: This communication is on contrarian view and explains why P-value and its threshold are still useful for ruling out sampling fluctuation as a source of the findings.Results: The problem is not with P-values themselves but it is with their misuse, abuse, and over-use, including the dominant role they have assumed in empirical results. False results may be mostly because of errors in design, invalid data, inadequate analysis, inappropriate interpretation, accumulation of Type-I error, and selective reporting, and not because of P-values per se.Conclusion: A threshold of P-values such as 0.05 for statistical significance is helpful in making a binary inference for practical application of the result. However, a lower threshold can be suggested to reduce the chance of false results. Also, the emphasis should be on detecting a medically significant effect and not zero effect.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3544 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degradingp-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take smallp-values at face value, but mistrust results with largerp-values. In either case,p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging,p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher,p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also largerp-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of largerp-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or thatp-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


1996 ◽  
Vol 21 (4) ◽  
pp. 299-332 ◽  
Author(s):  
Larry V. Hedges ◽  
Jack L. Vevea

When there is publication bias, studies yielding large p values, and hence small effect estimates, are less likely to be published, which leads to biased estimates of effects in meta-analysis. We investigate a selection model based on one-tailed p values in the context of a random effects model. The procedure both models the selection process and corrects for the consequences of selection on estimates of the mean and variance of effect parameters. A test of the statistical significance of selection is also provided. The small sample properties of the method are evaluated by means of simulations, and the asymptotic theory is found to be reasonably accurate under correct model specification and plausible conditions. The method substantially reduces bias due to selection when model specification is correct, but the variance of estimates is increased; thus mean squared error is reduced only when selection produces substantial bias. The robustness of the method to violations of assumptions about the form of the distribution of the random effects is also investigated via simulation, and the model-corrected estimates of the mean effect are generally found to be much less biased than the uncorrected estimates. The significance test for selection bias, however, is found to be highly nonrobust, rejecting at up to 10 times the nominal rate when there is no selection but the distribution of the effects is incorrectly specified.


2021 ◽  
Author(s):  
Kleber Neves ◽  
Pedro Batista Tan ◽  
Olavo Bohrer Amaral

Diagnostic screening models for the interpretation of null hypothesis significance test (NHST) results have been influential in highlighting the effect of selective publication on the reproducibility of the published literature, leading to John Ioannidis’ much-cited claim that most published research findings are false. These models, however, are typically based on the assumption that hypotheses are dichotomously true or false, without considering that effect sizes for different hypotheses are not the same. To address this limitation, we develop a simulation model that overcomes this by modeling effect sizes explicitly using different continuous distributions, while retaining other aspects of previous models such as publication bias and the pursuit of statistical significance. Our results show that the combination of selective publication, bias, low statistical power and unlikely hypotheses consistently leads to high proportions of false positives, irrespective of the effect size distribution assumed. Using continuous effect sizes also allows us to evaluate the degree of effect size overestimation and prevalence of estimates with the wrong signal in the literature, showing that the same factors that drive false-positive results also lead to errors in estimating effect size direction and magnitude. Nevertheless, the relative influence of these factors on different metrics varies depending on the distribution assumed for effect sizes. The model is made available as an R ShinyApp interface, allowing one to explore features of the literature in various scenarios.


2021 ◽  
Author(s):  
Jean Alexander ◽  
James A Green

Purpose: This research examined the evidential value of research in Speech, Language, and Hearing (SLH), and the extent to which there is publication bias in reported findings. We also looked at the prevalence of good research practices, including those that work to minimize publication bias.Method: We extracted statistical results from 51 articles reported in four meta-analyses. These were there analyzed with two recent tests for evidential value and publication bias —the p-curve and the Z-curve. These articles were also coded for pre-registration, data access statements, and whether they were replication studies. Results: P-curves were right-skewed indicating evidential value, ruling out selective reporting as the sole reason for the significant findings. The Z-curve similarly found evidential value but detected a relative absence of null results, suggesting there is some publication bias. No studies were pre-registered, no studies had a data access statement, and no studies were full replication studies (3 studies were partial replications).Conclusions: Findings indicate SLH research has evidential value. This means that decision-makers and clinicians can continue to rely on the SLH research evidence base to influence service and clinical decisions. However, the presence of publication bias means that meta-analytic estimates of effectiveness may be exaggerated. Thus, we encourage SLH researchers to engage in study pre-registration, make result data accessible, conduct replication studies, and document null findings.


Sign in / Sign up

Export Citation Format

Share Document