CONFIDENCE INTERVALS, NOT P VALUES

PEDIATRICS ◽  
1996 ◽  
Vol 97 (2) ◽  
pp. A42-A42
Author(s):  
Student

To evaluate the extent of prediction error we must discard hypotheses testing in favor of estimation ... The use of confidence intervals as summaries of the effect of an intervention enables the correct conclusions to be drawn from meta-analyses; reliance on whether a P value is more or less than 0.05 is a dangerous way of making decisions ...

PEDIATRICS ◽  
1996 ◽  
Vol 98 (6) ◽  
pp. A22-A22
Author(s):  
Student

When we are told that "there's no evidence that A causes B," we should first ask whether absence of evidence means simply that there is no information at all. If there are data, we should look for quantification of the association rather than just a P value. Where risks are small, P values may well mislead: confidence intervals are likely to be wide, indicating considerable uncertainty.


2016 ◽  
Vol 156 (6) ◽  
pp. 978-980 ◽  
Author(s):  
Peter M. Vila ◽  
Melanie Elizabeth Townsend ◽  
Neel K. Bhatt ◽  
W. Katherine Kao ◽  
Parul Sinha ◽  
...  

There is a lack of reporting effect sizes and confidence intervals in the current biomedical literature. The objective of this article is to present a discussion of the recent paradigm shift encouraging the use of reporting effect sizes and confidence intervals. Although P values help to inform us about whether an effect exists due to chance, effect sizes inform us about the magnitude of the effect (clinical significance), and confidence intervals inform us about the range of plausible estimates for the general population mean (precision). Reporting effect sizes and confidence intervals is a necessary addition to the biomedical literature, and these concepts are reviewed in this article.


2021 ◽  
Vol 8 (1) ◽  
pp. 118-127
Author(s):  
Flavio Martinez-Morales ◽  
Saray Aranda Romo ◽  
Othoniel Hugo Aragon-Martinez

Nowadays, there is not a meta-analytic synthesis of the clinical reports that used a cacao bean husk extract (CBHE) solution as an anticariogenic mouth rinse. Thus, the aim of this study was to evaluate that information through a systematic review and meta-analysis methodology, conducted in accordance with PRISMA guidelines. Scientific databases were searched for studies published up to June 2021. Inclusion and exclusion criteria were applied to studies found and then, their data was analyzed. The five selected studies were categorized with a 36.6, 58.5, and 4.9 % of a low, unclear, and high risk of bias, respectively. Under appropriate heterogeneities (I2 values from 0 to 65 %, p values > 0.09) and absent reporting bias (symmetrical funnels), the meta-analyses show that the use of a CBHE mouth rinse reduced the salivary count of Streptococcus mutans (Z values from 2.45 to 10.61, p values < 0.01), similar to the chlorhexidine rinse performance (Z value= 0.55, p value= 0.58), and produced an insignificant presence of adverse events (Z value= 0.92, p value= 0.36) in children and adults, all these effects compared with those volunteers under an ethanol rinse or their pretest conditions. In conclusion, the CBHE mouth rinse reduced a cariogenic bacterium under an acceptable safety profile, but more clinical studies with high quality and more parameters are needed.


2019 ◽  
Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Agustín Ciapponi ◽  
José M. Belizán ◽  
Gilda Piaggio ◽  
Sanni Yaya

AbstractThis article challenges the “tyranny of P-value” and promote more valuable and applicable interpretations of the results of research on health care delivery. We provide here solid arguments to retire statistical significance as the unique way to interpret results, after presenting the current state of the debate inside the scientific community. Instead, we promote reporting the much more informative confidence intervals and eventually adding exact P-values. We also provide some clues to integrate statistical and clinical significance by referring to minimal important differences and integrating the effect size of an intervention and the certainty of evidence ideally using the GRADE approach. We have argued against interpreting or reporting results as statistically significant or statistically non-significant. We recommend showing important clinical benefits with their confidence intervals in cases of point estimates compatible with results benefits and even important harms. It seems fair to report the point estimate and the more likely values along with a very clear statement of the implications of extremes of the intervals. We recommend drawing conclusions, considering the multiple factors besides P-values such as certainty of the evidence for each outcome, net benefit, economic considerations and values and preferences. We use several examples and figures to illustrate different scenarios and further suggest a wording to standardize the reporting. Several statistical measures have a role in the scientific communication of studies, but it is time to understand that there is life beyond the statistical significance. There is a great opportunity for improvement towards a more complete interpretation and to a more standardized reporting.


2019 ◽  
Author(s):  
Don van Ravenzwaaij ◽  
John P A Ioannidis

Abstract Background: Until recently a typical rule that has often been used for the endorsement of new medications by the Food and Drug Administration has been the existence of at least two statistically significant clinical trials favoring the new medication. This rule has consequences for the true positive (endorsement of an effective treatment) and false positive rates (endorsement of an ineffective treatment). Methods: In this paper, we compare true positive and false positive rates for different evaluation criteria through simulations that rely on (1) conventional p-values; (2) confidence intervals based on meta-analyses assuming fixed or random effects; and (3) Bayes factors. We varied threshold levels for statistical evidence, thresholds for what constitutes a clinically meaningful treatment effect, and number of trials conducted. Results: Our results show that Bayes factors, meta-analytic confidence intervals, and p-values often have similar performance. Bayes factors may perform better when the number of trials conducted is high and when trials have small sample sizes and clinically meaningful effects are not small, particularly in fields where the number of non-zero effects is relatively large. Conclusions: Thinking about realistic effect sizes in conjunction with desirable levels of statistical evidence, as well as quantifying statistical evidence with Bayes factors may help improve decision-making in some circumstances.


Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically nonsignificant at least at the alpha level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this article, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate’s p-value and its associated confidence interval in relation to a specified alpha level. These plots can help the analyst interpret and report the statistical and substantive significances of their models. I illustrate using a nonprobability sample of activists and participants at a 1962 anticommunism school.


2021 ◽  
Author(s):  
Willem M Otte ◽  
Christiaan H Vinkers ◽  
Philippe Habets ◽  
David G P van IJzendoorn ◽  
Joeri K Tijdink

Abstract Objective To quantitatively map how non-significant outcomes are reported in randomised controlled trials (RCTs) over the last thirty years. Design Quantitative analysis of English full-texts containing 567,758 RCTs recorded in PubMed (81.5% of all published RCTs). Methods We determined the exact presence of 505 pre-defined phrases denoting results that do not reach formal statistical significance (P<0.05) in 567,758 RCT full texts between 1990 and 2020 and manually extracted associated P values. Phrase data was modeled with Bayesian linear regression. Evidence for temporal change was obtained through Bayes-factor analysis. In a randomly sampled subset, the associated P values were manually extracted. Results We identified 61,741 phrases indicating close to significant results in 49,134 (8.65%; 95% confidence interval (CI): 8.58–8.73) RCTs. The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being ‘marginally significant’ (in 7,735 RCTs), ‘all but significant’ (7,015), ‘a nonsignificant trend’ (3,442), ‘failed to reach statistical significance’ (2,578) and ‘a strong trend’ (1,700). The strongest evidence for a temporal prevalence increase was found for ‘a numerical trend’, ‘a positive trend’, ‘an increasing trend’ and ‘nominally significant’. The phrases ‘all but significant’, ‘approaches statistical significance’, ‘did not quite reach statistical significance’, ‘difference was apparent’, ‘failed to reach statistical significance’ and ‘not quite significant’ decreased over time. In the random sampled subset, the 11,926 identified P values ranged between 0.05 and 0.15 (68.1%; CI: 67.3–69.0; median 0.06). Conclusions Our results demonstrate that phrases describing marginally significant results are regularly used in RCTs to report P values close to but above the dominant 0.05 cut-off. The phrase prevalence remained stable over time, despite all efforts to change the focus from P < 0.05 to reporting effect sizes and corresponding confidence intervals. To improve transparency and enhance responsible interpretation of RCT results, researchers, clinicians, reviewers, and editors need to abandon the focus on formal statistical significance thresholds and stimulate reporting of exact P values with corresponding effect sizes and confidence intervals. Significance statement The power of language to modify the reader’s perception of how to interpret biomedical results cannot be underestimated. Misreporting and misinterpretation are urgent problems in RCT output. This may be at least partially related to the statistical paradigm of the 0.05 significance threshold. Sometimes, creativity and inventive strategies of clinical researchers may be used – describing their clinical results to be ‘almost significant’ – to get their data published. This phrasing may convince readers about the value of their work. Since 2005 there is an increasing concern that most current published research findings are false and it has been generally advised to switch from null hypothesis significance testing to using effect sizes, estimation, and cumulation of evidence. If this ‘new statistics’ approach has worked out well should be reflected in the phases describing non-significance results of RCTs. In particular in changing patterns describing P values just above 0.05 value. More than five hundred phrases potentially suited to report or discuss non-significant results were searched in over half a million published RCTs. A stable overall prevalence of these phrases (10.87%, CI: 10.79–10.96; N: 61,741), with associated P values close to 0.05, was found in the last three decades, with strong increases or decreases in individual phrases describing these near-significant results. The pressure to pass scientific peer-review barrier may function as an incentive to use effective phrases to mask non-significant results in RCTs. However, this keeps the researcher’s pre-occupied with hypothesis testing rather than presenting outcome estimations with uncertainty. The effect of language on getting RCT results published should ideally be minimal to steer evidence-based medicine away from overselling of research results, unsubstantiated claims about the efficacy of certain RCTs and to prevent an over-reliance on P value cutoffs. Our exhaustive search suggests that presenting RCT findings remains a struggle when P values approach the carved-in-stone threshold of 0.05.


2019 ◽  
Author(s):  
Don van Ravenzwaaij ◽  
John P A Ioannidis

Abstract Background: Until recently a typical rule that has often been often used for the endorsement of new medications by the Food and Drug Administration has been the existence of at least two statistically significant clinical trials favoring the new medication. This rule has consequences for the true positive (endorsement of an effective treatment) and false positive (endorsement of an ineffective treatment) rates. Methods: In this paper, we compare true positive and false positive rates for different evaluation criteria through simulations that rely on (1) conventional p -values; (2) confidence intervals based on meta-analyses assuming fixed or random effects; and (3) Bayes factors. We varied threshold levels for statistical evidence, and thresholds for what constitutes a clinically meaningful treatment effect. Results: Our results show that Bayes factors, meta-analytic confidence intervals, and p-values often have similar performance. Bayes factors may perform better when trials have small sample sizes and clinically meaningful effects are not small, particularly in fields where the number of non-zero effects is relatively large. Conclusions: Thinking about realistic effect sizes in conjunction with desirable levels of statistical evidence, as well as quantifying statistical evidence with Bayes factors may help improve decision-making in some circumstances.


Author(s):  
Peter Wills ◽  
Emanuel Knill ◽  
Kevin Coakley ◽  
Yanbao Zhang

Given a composite null hypothesis H0, test supermartingales are non-negative supermartingales with respect to H0 with an initial value of 1. Large values of test supermartingales provide evidence against H0. As a result, test supermartingales are an effective tool for rejecting H0, particularly when the p-values obtained are very small and serve as certificates against the null hypothesis. Examples include the rejection of local realism as an explanation of Bell test experiments in the foundations of physics and the certification of entanglement in quantum information science. Test supermartingales have the advantage of being adaptable during an experiment and allowing for arbitrary stopping rules. By inversion of acceptance regions, they can also be used to determine confidence sets. We used an example to compare the performance of test supermartingales for computing p-values and confidence intervals to Chernoff-Hoeffding bounds and the “exact” p-value. The example is the problem of inferring the probability of success in a sequence of Bernoulli trials. There is a cost in using a technique that has no restriction on stopping rules, and, for a particular test supermartingale, our study quantifies this cost.


Sign in / Sign up

Export Citation Format

Share Document