scholarly journals Examination of CIs in health and medical journals from 1976 to 2019: an observational study

BMJ Open ◽  
2019 ◽  
Vol 9 (11) ◽  
pp. e032506 ◽  
Author(s):  
Adrian Gerard Barnett ◽  
Jonathan D Wren

ObjectivesPrevious research has shown clear biases in the distribution of published p values, with an excess below the 0.05 threshold due to a combination of p-hacking and publication bias. We aimed to examine the bias for statistical significance using published confidence intervals.DesignObservational study.SettingPapers published inMedlinesince 1976.ParticipantsOver 968 000 confidence intervals extracted from abstracts and over 350 000 intervals extracted from the full-text.Outcome measuresCumulative distributions of lower and upper confidence interval limits for ratio estimates.ResultsWe found an excess of statistically significant results with a glut of lower intervals just above one and upper intervals just below 1. These excesses have not improved in recent years. The excesses did not appear in a set of over 100 000 confidence intervals that were not subject to p-hacking or publication bias.ConclusionsThe huge excesses of published confidence intervals that are just below the statistically significant threshold are not statistically plausible. Large improvements in research practice are needed to provide more results that better reflect the truth.

2019 ◽  
Author(s):  
Adrian Barnett ◽  
Jonathan wren

Results that are statistically significant are more likely to be reported by authors and more likely to be accepted by journals. These common biases warp the published evidence and undermine the ability of research to improve health by giving an incomplete body of evidence. We aimed to show the effect of the bias towards statistical significance on the evidence-base using published confidence intervals. We examined over 968,000 ratios and their confidence intervals in the abstracts of health and medical journals from Medline between 1976 and January 2019. We plotted the distributions of lower and upper confidence interval limits to visually show the strong bias for statistically significant results. There was a striking change in the distributions around 1, which is the statistically significant threshold for ratios. There was an excess of lower intervals just above 1, corresponding to a statistically significant increase in risk. There was a similar excess of upper intervals just below 1, corresponding to a statistically significant decrease in risk. These biases have not improved in recent years. The huge excesses of confidence intervals that are just above and below the statistically significant threshold are not statistically plausible. Large changes in research practice are needed to provide more results that better reflect the truth.


2019 ◽  
Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.


Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically nonsignificant at least at the alpha level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this article, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate’s p-value and its associated confidence interval in relation to a specified alpha level. These plots can help the analyst interpret and report the statistical and substantive significances of their models. I illustrate using a nonprobability sample of activists and participants at a 1962 anticommunism school.


2021 ◽  
Vol 35 (3) ◽  
pp. 157-174
Author(s):  
Guido W. Imbens

The use of statistical significance and p-values has become a matter of substantial controversy in various fields using statistical methods. This has gone as far as some journals banning the use of indicators for statistical significance, or even any reports of p-values, and, in one case, any mention of confidence intervals. I discuss three of the issues that have led to these often-heated debates. First, I argue that in many cases, p-values and indicators of statistical significance do not answer the questions of primary interest. Such questions typically involve making (recommendations on) decisions under uncertainty. In that case, point estimates and measures of uncertainty in the form of confidence intervals or even better, Bayesian intervals, are often more informative summary statistics. In fact, in that case, the presence or absence of statistical significance is essentially irrelevant, and including them in the discussion may confuse the matter at hand. Second, I argue that there are also cases where testing null hypotheses is a natural goal and where p-values are reasonable and appropriate summary statistics. I conclude that banning them in general is counterproductive. Third, I discuss that the overemphasis in empirical work on statistical significance has led to abuse of p-values in the form of p-hacking and publication bias. The use of pre-analysis plans and replication studies, in combination with lowering the emphasis on statistical significance may help address these problems.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2009 ◽  
Vol 217 (1) ◽  
pp. 15-26 ◽  
Author(s):  
Geoff Cumming ◽  
Fiona Fidler

Most questions across science call for quantitative answers, ideally, a single best estimate plus information about the precision of that estimate. A confidence interval (CI) expresses both efficiently. Early experimental psychologists sought quantitative answers, but for the last half century psychology has been dominated by the nonquantitative, dichotomous thinking of null hypothesis significance testing (NHST). The authors argue that psychology should rejoin mainstream science by asking better questions – those that demand quantitative answers – and using CIs to answer them. They explain CIs and a range of ways to think about them and use them to interpret data, especially by considering CIs as prediction intervals, which provide information about replication. They explain how to calculate CIs on means, proportions, correlations, and standardized effect sizes, and illustrate symmetric and asymmetric CIs. They also argue that information provided by CIs is more useful than that provided by p values, or by values of Killeen’s prep, the probability of replication.


2021 ◽  
pp. bmjebm-2020-111603
Author(s):  
John Ferguson

Commonly accepted statistical advice dictates that large-sample size and highly powered clinical trials generate more reliable evidence than trials with smaller sample sizes. This advice is generally sound: treatment effect estimates from larger trials tend to be more accurate, as witnessed by tighter confidence intervals in addition to reduced publication biases. Consider then two clinical trials testing the same treatment which result in the same p values, the trials being identical apart from differences in sample size. Assuming statistical significance, one might at first suspect that the larger trial offers stronger evidence that the treatment in question is truly effective. Yet, often precisely the opposite will be true. Here, we illustrate and explain this somewhat counterintuitive result and suggest some ramifications regarding interpretation and analysis of clinical trial results.


1996 ◽  
Vol 21 (4) ◽  
pp. 299-332 ◽  
Author(s):  
Larry V. Hedges ◽  
Jack L. Vevea

When there is publication bias, studies yielding large p values, and hence small effect estimates, are less likely to be published, which leads to biased estimates of effects in meta-analysis. We investigate a selection model based on one-tailed p values in the context of a random effects model. The procedure both models the selection process and corrects for the consequences of selection on estimates of the mean and variance of effect parameters. A test of the statistical significance of selection is also provided. The small sample properties of the method are evaluated by means of simulations, and the asymptotic theory is found to be reasonably accurate under correct model specification and plausible conditions. The method substantially reduces bias due to selection when model specification is correct, but the variance of estimates is increased; thus mean squared error is reduced only when selection produces substantial bias. The robustness of the method to violations of assumptions about the form of the distribution of the random effects is also investigated via simulation, and the model-corrected estimates of the mean effect are generally found to be much less biased than the uncorrected estimates. The significance test for selection bias, however, is found to be highly nonrobust, rejecting at up to 10 times the nominal rate when there is no selection but the distribution of the effects is incorrectly specified.


2007 ◽  
Vol 97 (2) ◽  
pp. 165-170 ◽  
Author(s):  
Garry T. Allison

There is a well-known phenomenon of publication bias toward manuscripts that report statistically significant differences. The clinical implications of these statistically significant differences are not always clear because the magnitude of the changes may be clinically meaningless. This article relates the critical P value threshold to the magnitude of the actual observed change and provides a rationale for reporting confidence intervals in clinical studies. Strategies for improving statistical power and reducing the magnitude of the confidence interval range for clinical trials are also described. (J Am Podiatr Med Assoc 97(2): 165–170, 2007)


Sign in / Sign up

Export Citation Format

Share Document