Statistical Significance

Author(s):  
Jean-Frédéric Morin ◽  
Christian Olsson ◽  
Ece Özlem Atikcan

This chapter highlights statistical significance. The key question in quantitative analysis is whether a pattern observed in a sample also holds for the population from which the sample was drawn. A positive answer to this question implies that the result is ‘statistically significant’ — i.e. it was not produced by a random variation from sample to sample, but, instead, reflects the pattern that exists in the population. The null hypothesis statistical test (NHST) has been a widely used approach for testing whether inference from a sample to the population is valid. Seeking to test whether valid inferences about the population could be made based on the results from a single sample, a researcher should consider a wide variety of approaches and take into the account not only p-values, but also sampling process, sample size, the quality of measurement, and other factors that may influence the reliability of estimates.

Mathematics ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 603
Author(s):  
Leonid Hanin

I uncover previously underappreciated systematic sources of false and irreproducible results in natural, biomedical and social sciences that are rooted in statistical methodology. They include the inevitably occurring deviations from basic assumptions behind statistical analyses and the use of various approximations. I show through a number of examples that (a) arbitrarily small deviations from distributional homogeneity can lead to arbitrarily large deviations in the outcomes of statistical analyses; (b) samples of random size may violate the Law of Large Numbers and thus are generally unsuitable for conventional statistical inference; (c) the same is true, in particular, when random sample size and observations are stochastically dependent; and (d) the use of the Gaussian approximation based on the Central Limit Theorem has dramatic implications for p-values and statistical significance essentially making pursuit of small significance levels and p-values for a fixed sample size meaningless. The latter is proven rigorously in the case of one-sided Z test. This article could serve as a cautionary guidance to scientists and practitioners employing statistical methods in their work.


2016 ◽  
Vol 11 (4) ◽  
pp. 551-554 ◽  
Author(s):  
Martin Buchheit

The first sport-science-oriented and comprehensive paper on magnitude-based inferences (MBI) was published 10 y ago in the first issue of this journal. While debate continues, MBI is today well established in sport science and in other fields, particularly clinical medicine, where practical/clinical significance often takes priority over statistical significance. In this commentary, some reasons why both academics and sport scientists should abandon null-hypothesis significance testing and embrace MBI are reviewed. Apparent limitations and future areas of research are also discussed. The following arguments are presented: P values and, in turn, study conclusions are sample-size dependent, irrespective of the size of the effect; significance does not inform on magnitude of effects, yet magnitude is what matters the most; MBI allows authors to be honest with their sample size and better acknowledge trivial effects; the examination of magnitudes per se helps provide better research questions; MBI can be applied to assess changes in individuals; MBI improves data visualization; and MBI is supported by spreadsheets freely available on the Internet. Finally, recommendations to define the smallest important effect and improve the presentation of standardized effects are presented.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2021 ◽  
pp. bmjebm-2020-111603
Author(s):  
John Ferguson

Commonly accepted statistical advice dictates that large-sample size and highly powered clinical trials generate more reliable evidence than trials with smaller sample sizes. This advice is generally sound: treatment effect estimates from larger trials tend to be more accurate, as witnessed by tighter confidence intervals in addition to reduced publication biases. Consider then two clinical trials testing the same treatment which result in the same p values, the trials being identical apart from differences in sample size. Assuming statistical significance, one might at first suspect that the larger trial offers stronger evidence that the treatment in question is truly effective. Yet, often precisely the opposite will be true. Here, we illustrate and explain this somewhat counterintuitive result and suggest some ramifications regarding interpretation and analysis of clinical trial results.


2017 ◽  
Vol 16 (3) ◽  
pp. 1
Author(s):  
Laura Badenes-Ribera ◽  
Dolores Frias-Navarro

Resumen La “Práctica Basada en la Evidencia” exige que los profesionales valoren de forma crítica los resultados de las investigaciones psicológicas. Sin embargo, las interpretaciones incorrectas de los valores p de probabilidad son abundantes y repetitivas. Estas interpretaciones incorrectas afectan a las decisiones profesionales y ponen en riesgo la calidad de las intervenciones y la acumulación de un conocimiento científico válido. Identificar el tipo de falacia que subyace a las decisiones estadísticas es fundamental para abordar y planificar estrategias de educación estadística dirigidas a intervenir sobre las interpretaciones incorrectas. En consecuencia, el objetivo de este estudio es analizar la interpretación del valor p en estudiantes y profesores universitarios de Psicología. La muestra estuvo formada por 161 participantes (43 profesores y 118 estudiantes). La antigüedad media como profesor fue de 16.7 años (DT = 10.07). La edad media de los estudiantes fue de 21.59 (DT = 1.3). Los hallazgos sugieren que los estudiantes y profesores universitarios no conocen la interpretación correcta del valor p. La falacia de la probabilidad inversa presenta mayores problemas de comprensión. Además, se confunde la significación estadística y la significación práctica o clínica. Estos resultados destacan la necesidad de la educación estadística y re-educación estadística. Abstract The "Evidence Based Practice" requires professionals to critically assess the results of psychological research. However, incorrect interpretations of p values of probability are abundant and repetitive. These misconceptions affect professional decisions and compromise the quality of interventions and the accumulation of a valid scientific knowledge. Identifying the types of fallacies that underlying statistical decisions is fundamental for approaching and planning statistical education strategies designed to intervene in incorrect interpretations. Therefore, the aim of this study is to analyze the interpretation of p value among college students of psychology and academic psychologist. The sample was composed of 161 participants (43 academic and 118 students). The mean number of years as academic was 16.7 (SD = 10.07). The mean age of college students was 21.59 years (SD = 1.3). The findings suggest that college students and academic do not know the correct interpretation of p values. The fallacy of the inverse probability presents major problems of comprehension. In addition, statistical significance and practical significance or clinical are confused. There is a need for statistical education and statistical re-education.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


Author(s):  
Holly Schaafsma ◽  
Holly Laasanen ◽  
Jasna Twynstra ◽  
Jamie A. Seabrook

Despite the widespread use of statistical techniques in quantitative research, methodological flaws and inadequate statistical reporting persist. The objective of this study is to evaluate the quality of statistical reporting and procedures in all original, quantitative articles published in the Canadian Journal of Dietetic Practice and Research (CJDPR) from 2010 to 2019 using a checklist created by our research team. In total, 107 articles were independently evaluated by 2 raters. The hypothesis or objective(s) was clearly stated in 97.2% of the studies. Over half (51.4%) of the articles reported the study design and 57.9% adequately described the statistical techniques used. Only 21.2% of the studies that required a prestudy sample size calculation reported one. Of the 281 statistical tests conducted, 88.3% of them were correct. P values >0.05–0.10 were reported as “statistically significant” and/or a “trend” in 11.4% of studies. While this evaluation reveals both strengths and areas for improvement in the quality of statistical reporting in CJDPR, we encourage dietitians to pursue additional statistical training and/or seek the assistance of a statistician. Future research should consider validating this new checklist and using it to evaluate the statistical quality of studies published in other nutrition journals and disciplines.


Author(s):  
Jackie Sham ◽  
BCIT School of Health Sciences, Environmental Health ◽  
Vanessa Karakilic ◽  
Kevin Soulsbury ◽  
Fred Shaw

  Background and Purpose: Electronic cigarettes are gaining vast popularity because the perceived impression about electronic cigarettes is they are a safer alternative to conventional smoking (Belluz, 2015). As a result, more teenagers are switching to electronic cigarettes either as a smoking cessation tool, or for recreational use. However, it is supported by the evidence review that there is nicotine mislabeling between what the manufacturer has labeled and the actual nicotine content in the liquids (Goniewicz et al., 2012). This is a critical health concern for teenagers and recreational users because they are exposed to nicotine, which is a neurotoxin that creates the addiction for smoking. As a result, over a period of time, recreational electronic cigarette users have a higher chance of switching to conventional smoking (Bach, 2015). Hence, the purpose of this research is to determine whether nicotine can be found in nicotine free electronic cigarette liquids Methods: The nicotine content in the electronic cigarette liquids will be determined using Gas Chromatography - Mass Spectrometry. Inferential statistics such as a one tailed t-test will be done using Microsoft Excel and SAS to see if nicotine can be detected in nicotine-free electronic cigarette liquids and if there is a statistically significant difference. Results: The two p-values from the parametric test were 0.2811 and 0.2953. The p-value to reject the null hypothesis was set at 0.05. Because the p-values from the inferential statistics were greater than 0.05, the null hypothesis was not rejected and the actual nicotine content is equal to what the manufacturer had labeled as nicotine free. Discussion: Although the inferential statistics indicated that there was no statistical significance in nicotine concentration, two out of the ten nicotine free electronic cigarette liquids measured nicotine levels above 0 ppm. Conclusion: There was not a significant difference in nicotine concentration found in the electronic cigarette liquids and the actual nicotine concentration is equal to the labeled concentration. However, because the sample size of only ten is too small, there is a potential for type 2 error. Also, the samples came from only two manufacturers. Therefore, the results from this research are not representative for all the electronic cigarette liquids. More research should be conducted to provide scientific evidence to stop recreational electronic cigarette users from the exposure of electronic cigarettes as these could act as a stepping-stone towards smoking conventional cigarettes. Teenagers who start smoking at an early age will be more  


2019 ◽  
Vol 3 ◽  
Author(s):  
Jessica K. Witt

What is best criterion for determining statistical significance? In psychology, the criterion has been p < .05. This criterion has been criticized since its inception, and the criticisms have been rejuvenated with recent failures to replicate studies published in top psychology journals. Several replacement criteria have been suggested including reducing the alpha level to .005 or switching to other types of criteria such as Bayes factors or effect sizes. Here, various decision criteria for statistical significance were evaluated using signal detection analysis on the outcomes of simulated data. The signal detection measure of area under the curve (AUC) is a measure of discriminability with a value of 1 indicating perfect discriminability and 0.5 indicating chance performance. Applied to criteria for statistical significance, it provides an estimate of the decision criterion’s performance in discriminating real effects from null effects. AUCs were high (M = .96, median = .97) for p values, suggesting merit in using p values to discriminate significant effects. AUCs can be used to assess methodological questions such as how much improvement will be gained with increased sample size, how much discriminability will be lost with questionable research practices, and whether it is better to run a single high-powered study or a study plus a replication at lower powers. AUCs were also used to compare performance across p values, Bayes factors, and effect size (Cohen’s d). AUCs were equivalent for p values and Bayes factors and were slightly higher for effect size. Signal detection analysis provides separate measures of discriminability and bias. With respect to bias, the specific thresholds that produced maximally-optimal utility depended on sample size, although this dependency was particularly notable for p values and less so for Bayes factors. The application of signal detection theory to the issue of statistical significance highlights the need to focus on both false alarms and misses, rather than false alarms alone.


Sign in / Sign up

Export Citation Format

Share Document