scholarly journals Statistical Significance, p-Values, and the Reporting of Uncertainty

2021 ◽  
Vol 35 (3) ◽  
pp. 157-174
Author(s):  
Guido W. Imbens

The use of statistical significance and p-values has become a matter of substantial controversy in various fields using statistical methods. This has gone as far as some journals banning the use of indicators for statistical significance, or even any reports of p-values, and, in one case, any mention of confidence intervals. I discuss three of the issues that have led to these often-heated debates. First, I argue that in many cases, p-values and indicators of statistical significance do not answer the questions of primary interest. Such questions typically involve making (recommendations on) decisions under uncertainty. In that case, point estimates and measures of uncertainty in the form of confidence intervals or even better, Bayesian intervals, are often more informative summary statistics. In fact, in that case, the presence or absence of statistical significance is essentially irrelevant, and including them in the discussion may confuse the matter at hand. Second, I argue that there are also cases where testing null hypotheses is a natural goal and where p-values are reasonable and appropriate summary statistics. I conclude that banning them in general is counterproductive. Third, I discuss that the overemphasis in empirical work on statistical significance has led to abuse of p-values in the form of p-hacking and publication bias. The use of pre-analysis plans and replication studies, in combination with lowering the emphasis on statistical significance may help address these problems.

BMJ Open ◽  
2019 ◽  
Vol 9 (11) ◽  
pp. e032506 ◽  
Author(s):  
Adrian Gerard Barnett ◽  
Jonathan D Wren

ObjectivesPrevious research has shown clear biases in the distribution of published p values, with an excess below the 0.05 threshold due to a combination of p-hacking and publication bias. We aimed to examine the bias for statistical significance using published confidence intervals.DesignObservational study.SettingPapers published inMedlinesince 1976.ParticipantsOver 968 000 confidence intervals extracted from abstracts and over 350 000 intervals extracted from the full-text.Outcome measuresCumulative distributions of lower and upper confidence interval limits for ratio estimates.ResultsWe found an excess of statistically significant results with a glut of lower intervals just above one and upper intervals just below 1. These excesses have not improved in recent years. The excesses did not appear in a set of over 100 000 confidence intervals that were not subject to p-hacking or publication bias.ConclusionsThe huge excesses of published confidence intervals that are just below the statistically significant threshold are not statistically plausible. Large improvements in research practice are needed to provide more results that better reflect the truth.


Author(s):  
Karl Schmedders ◽  
Charlotte Snyder ◽  
Ute Schaedel

Wall Street hedge fund manager Kim Meyer is considering investing in an SFA (slate financing arrangement) in Hollywood. Dave Griffith, a Hollywood producer, is pitching for the investment and has conducted a broad analysis of recent movie data to determine the important drivers of a movie’s success. In order to convince Meyer to invest in an SFA, Griffith must anticipate possible questions to maximize his persuasiveness.Students will analyze the factors driving a movie’s revenue using various statistical methods, including calculating point estimates, computing confidence intervals, conducting hypothesis tests, and developing regression models (in which they must both choose the relevant set of independent variables as well as determine an appropriate functional form for the regression equation). The case also requires the interpretation of the quantitative findings in the context of the application.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2019 ◽  
Vol 81 (8) ◽  
pp. 535-542
Author(s):  
Robert A. Cooper

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.


2019 ◽  
Vol 12 (1) ◽  
pp. 205979911982651
Author(s):  
Michael Wood

In many fields of research, null hypothesis significance tests and p values are the accepted way of assessing the degree of certainty with which research results can be extrapolated beyond the sample studied. However, there are very serious concerns about the suitability of p values for this purpose. An alternative approach is to cite confidence intervals for a statistic of interest, but this does not directly tell readers how certain a hypothesis is. Here, I suggest how the framework used for confidence intervals could easily be extended to derive confidence levels, or “tentative probabilities,” for hypotheses. I also outline four quick methods for estimating these. This allows researchers to state their confidence in a hypothesis as a direct probability, instead of circuitously by p values referring to a hypothetical null hypothesis—which is usually not even stated explicitly. The inevitable difficulties of statistical inference mean that these probabilities can only be tentative, but probabilities are the natural way to express uncertainties, so, arguably, researchers using statistical methods have an obligation to estimate how probable their hypotheses are by the best available method. Otherwise, misinterpretations will fill the void.


2009 ◽  
Vol 8 (4) ◽  
pp. 545-557 ◽  
Author(s):  
JAMES J. FETZER

AbstractThis paper examines how to make inferences from econometric models prepared for antidumping, countervailing duty, and safeguard investigations. Analysis of these models has typically entailed drawing inferences from point estimates that are significantly different from zero at a fixed level of confidence. This paper suggests a more flexible approach of drawing inferences using confidence intervals at various significance levels and reporting p-values for the relevant test of injury. Use of confidence intervals and p-values to identify insights and data patterns would have more impact on USITC trade remedy determinations than definitive conclusions about injury based on whether estimates are statistically significant.


2020 ◽  
Author(s):  
Simon Turner ◽  
Amalia Karahalios ◽  
Andrew Forbes ◽  
Monica Taljaard ◽  
Jeremy Grimshaw ◽  
...  

Abstract Background The Interrupted Time Series (ITS) is a quasi-experimental design commonly used in public health to evaluate the impact of interventions or exposures. Multiple statistical methods are available to analyse data from ITS studies, but no empirical investigation has examined how the different methods compare when applied to real-world datasets. MethodsA random sample of 200 ITS studies identified in a previous methods review were included. Time series data from each of these studies was sought. Each dataset was re-analysed using six statistical methods. Point and confidence interval estimates for level and slope changes, standard errors, p-values and estimates of autocorrelation were compared between methods. ResultsFrom the 200 ITS studies, including 230 time series, 190 datasets were obtained. We found that the choice of statistical method can importantly affect the level and slope change point estimates, their standard errors, width of confidence intervals and p-values. Statistical significance (categorised at the 5% level) often differed across the pairwise comparisons of methods, ranging from 4% to 25% disagreement. Estimates of autocorrelation differed depending on the method used and the length of the series. ConclusionsThe choice of statistical method in ITS studies can lead to substantially different conclusions about the impact of the interruption. Pre-specification of the statistical method is encouraged, and naive conclusions based on statistical significance should be avoided.


2021 ◽  
pp. bmjebm-2020-111603
Author(s):  
John Ferguson

Commonly accepted statistical advice dictates that large-sample size and highly powered clinical trials generate more reliable evidence than trials with smaller sample sizes. This advice is generally sound: treatment effect estimates from larger trials tend to be more accurate, as witnessed by tighter confidence intervals in addition to reduced publication biases. Consider then two clinical trials testing the same treatment which result in the same p values, the trials being identical apart from differences in sample size. Assuming statistical significance, one might at first suspect that the larger trial offers stronger evidence that the treatment in question is truly effective. Yet, often precisely the opposite will be true. Here, we illustrate and explain this somewhat counterintuitive result and suggest some ramifications regarding interpretation and analysis of clinical trial results.


2019 ◽  
Author(s):  
Marshall A. Taylor

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.


Sign in / Sign up

Export Citation Format

Share Document