scholarly journals 22 Comparing statistical significance based on P-values with the probability of replicating a result less extreme than the null hypothesis to make evidence more replicable

Author(s):  
Huw Llewelym
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


Author(s):  
Jackie Sham ◽  
BCIT School of Health Sciences, Environmental Health ◽  
Vanessa Karakilic ◽  
Kevin Soulsbury ◽  
Fred Shaw

  Background and Purpose: Electronic cigarettes are gaining vast popularity because the perceived impression about electronic cigarettes is they are a safer alternative to conventional smoking (Belluz, 2015). As a result, more teenagers are switching to electronic cigarettes either as a smoking cessation tool, or for recreational use. However, it is supported by the evidence review that there is nicotine mislabeling between what the manufacturer has labeled and the actual nicotine content in the liquids (Goniewicz et al., 2012). This is a critical health concern for teenagers and recreational users because they are exposed to nicotine, which is a neurotoxin that creates the addiction for smoking. As a result, over a period of time, recreational electronic cigarette users have a higher chance of switching to conventional smoking (Bach, 2015). Hence, the purpose of this research is to determine whether nicotine can be found in nicotine free electronic cigarette liquids Methods: The nicotine content in the electronic cigarette liquids will be determined using Gas Chromatography - Mass Spectrometry. Inferential statistics such as a one tailed t-test will be done using Microsoft Excel and SAS to see if nicotine can be detected in nicotine-free electronic cigarette liquids and if there is a statistically significant difference. Results: The two p-values from the parametric test were 0.2811 and 0.2953. The p-value to reject the null hypothesis was set at 0.05. Because the p-values from the inferential statistics were greater than 0.05, the null hypothesis was not rejected and the actual nicotine content is equal to what the manufacturer had labeled as nicotine free. Discussion: Although the inferential statistics indicated that there was no statistical significance in nicotine concentration, two out of the ten nicotine free electronic cigarette liquids measured nicotine levels above 0 ppm. Conclusion: There was not a significant difference in nicotine concentration found in the electronic cigarette liquids and the actual nicotine concentration is equal to the labeled concentration. However, because the sample size of only ten is too small, there is a potential for type 2 error. Also, the samples came from only two manufacturers. Therefore, the results from this research are not representative for all the electronic cigarette liquids. More research should be conducted to provide scientific evidence to stop recreational electronic cigarette users from the exposure of electronic cigarettes as these could act as a stepping-stone towards smoking conventional cigarettes. Teenagers who start smoking at an early age will be more  


Author(s):  
Jean-Frédéric Morin ◽  
Christian Olsson ◽  
Ece Özlem Atikcan

This chapter highlights statistical significance. The key question in quantitative analysis is whether a pattern observed in a sample also holds for the population from which the sample was drawn. A positive answer to this question implies that the result is ‘statistically significant’ — i.e. it was not produced by a random variation from sample to sample, but, instead, reflects the pattern that exists in the population. The null hypothesis statistical test (NHST) has been a widely used approach for testing whether inference from a sample to the population is valid. Seeking to test whether valid inferences about the population could be made based on the results from a single sample, a researcher should consider a wide variety of approaches and take into the account not only p-values, but also sampling process, sample size, the quality of measurement, and other factors that may influence the reliability of estimates.


2019 ◽  
Author(s):  
Estibaliz Gómez-de-Mariscal ◽  
Alexandra Sneider ◽  
Hasini Jayatilaka ◽  
Jude M. Phillip ◽  
Denis Wirtz ◽  
...  

ABSTRACTBiomedical research has come to rely on p-values to determine potential translational impact. The p-value is routinely compared with a threshold commonly set to 0.05 to assess the significance of the null hypothesis. Whenever a large enough dataset is available, this threshold is easily reachable. This phenomenon is known as p-hacking and it leads to spurious conclusions. Herein, we propose a systematic and easy-to-follow protocol that models the p-value as an exponential function to test the existence of real statistical significance. This new approach provides a robust assessment of the null hypothesis with accurate values for the minimum data-size needed to reject it. An in-depth study of the model is carried out in both simulated and experimentally-obtained data. Simulations show that under controlled data, our assumptions are true. The results of our analysis in the experimental datasets reflect the large scope of this approach in common decision-making processes.


2017 ◽  
Author(s):  
Jan Peter De Ruiter

Benjamin et al. (2017) proposed to improve the reproducibility of findings in psychological research by lowering the alpha level of our conventional Null Hypothesis Significance Tests from .05 to .005, because findings with p-values close to .05 represent insufficient evidence. This proposal was criticized and rejected by a commentary by Lakens et al. (2017), who argued that a) the empirical evidence for the effectiveness of such a policy is weak, b) the theoretical arguments for the effectiveness of such a policy are weak, and c) the proposal also has negative consequences for reproducibility. In this contribution, I argue that the arguments by Lakens et al. are either unconvincing, or in fact arguments in favor of the proposal by Benjamin et al.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


Genes ◽  
2021 ◽  
Vol 12 (8) ◽  
pp. 1160
Author(s):  
Atsuko Okazaki ◽  
Sukanya Horpaopan ◽  
Qingrun Zhang ◽  
Matthew Randesi ◽  
Jurg Ott

Some genetic diseases (“digenic traits”) are due to the interaction between two DNA variants, which presumably reflects biochemical interactions. For example, certain forms of Retinitis Pigmentosa, a type of blindness, occur in the presence of two mutant variants, one each in the ROM1 and RDS genes, while the occurrence of only one such variant results in a normal phenotype. Detecting variant pairs underlying digenic traits by standard genetic methods is difficult and is downright impossible when individual variants alone have minimal effects. Frequent pattern mining (FPM) methods are known to detect patterns of items. We make use of FPM approaches to find pairs of genotypes (from different variants) that can discriminate between cases and controls. Our method is based on genotype patterns of length two, and permutation testing allows assigning p-values to genotype patterns, where the null hypothesis refers to equal pattern frequencies in cases and controls. We compare different interaction search approaches and their properties on the basis of published datasets. Our implementation of FPM to case-control studies is freely available.


2021 ◽  
pp. 1-10
Author(s):  
Mansour H. Al-Askar ◽  
Fahad A. Abdullatif ◽  
Abdulmonem A. Alshihri ◽  
Asma Ahmed ◽  
Darshan Devang Divakar ◽  
...  

BACKGROUND AND OBJECTIVE: The aim of this study was to compare the efficacy of photobiomodulation therapy (PBMT) and photodynamic therapy (PDT) as adjuncts to mechanical debridement (MD) for the treatment of peri-implantitis. The present study is based on the null hypothesis that there is no difference in the peri-implant inflammatory parameters (modified plaque index [mPI], modified gingival index [mGI], probing depth [PD]) and crestal bone loss (CBL) following MD either with PBMT or PDT in patients with peri-implantitis. METHODS: Forty-nine patients with peri-implantitis were randomly categorized into three groups. In Groups 1 and 2, patients underwent MD with adjunct PBMT and PDT, respectively. In Group 3, patients underwent MD alone (controls). Peri-implant inflammatory parameters were measured at baseline and 3-months follow-up. P-values < 0.01 were considered statistically significant. RESULTS: At baseline, peri-implant clinicoradiographic parameters were comparable in all groups. Compared with baseline, there was a significant reduction in mPI (P< 0.001), mGI (P< 0.001) and PD (P< 0.001) in Groups 1 and 2 at 3-months follow-up. In Group 3, there was no difference in the scores of mPI, mGI and PD at follow-up. At 3-months follow-up, there was no difference in mPI, mGI and PD among patients in Groups 1 and 2. The mPI (P< 0.001), mGI (P< 0.001) and PD (P< 0.001) were significantly higher in Group 3 than Groups 1 and 2. The CBL was comparable in all groups at follow-up. CONCLUSION: PBMT and PDT seem to be useful adjuncts to MD for the treatment of peri-implant soft-tissue inflammation among patients with peri-implantitis.


Mathematics ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 603
Author(s):  
Leonid Hanin

I uncover previously underappreciated systematic sources of false and irreproducible results in natural, biomedical and social sciences that are rooted in statistical methodology. They include the inevitably occurring deviations from basic assumptions behind statistical analyses and the use of various approximations. I show through a number of examples that (a) arbitrarily small deviations from distributional homogeneity can lead to arbitrarily large deviations in the outcomes of statistical analyses; (b) samples of random size may violate the Law of Large Numbers and thus are generally unsuitable for conventional statistical inference; (c) the same is true, in particular, when random sample size and observations are stochastically dependent; and (d) the use of the Gaussian approximation based on the Central Limit Theorem has dramatic implications for p-values and statistical significance essentially making pursuit of small significance levels and p-values for a fixed sample size meaningless. The latter is proven rigorously in the case of one-sided Z test. This article could serve as a cautionary guidance to scientists and practitioners employing statistical methods in their work.


Sign in / Sign up

Export Citation Format

Share Document