P values: from suggestion to superstition

A threshold probability value of ‘p≤0.05’ is commonly used in clinical investigations to indicate statistical significance. To allow clinicians to better understand evidence generated by research studies, this review defines the p value, summarizes the historical origins of the p value approach to hypothesis testing, describes various applications of p≤0.05 in the context of clinical research and discusses the emergence of p≤5×10−8 and other values as thresholds for genomic statistical analyses. Corresponding issues include a conceptual approach of evaluating whether data do not conform to a null hypothesis (ie, no exposure–outcome association). Importantly, and in the historical context of when p≤0.05 was first proposed, the 1-in-20 chance of a false-positive inference (ie, falsely concluding the existence of an exposure–outcome association) was offered only as a suggestion. In current usage, however, p≤0.05 is often misunderstood as a rigid threshold, sometimes with a misguided ‘win’ (p≤0.05) or ‘lose’ (p>0.05) approach. Also, in contemporary genomic studies, a threshold of p≤10−8 has been endorsed as a boundary for statistical significance when analyzing numerous genetic comparisons for each participant. A value of p≤0.05, or other thresholds, should not be employed reflexively to determine whether a clinical research investigation is trustworthy from a scientific perspective. Rather, and in parallel with conceptual issues of validity and generalizability, quantitative results should be interpreted using a combined assessment of strength of association, p values, CIs, and sample size.

Download Full-text

Abstract MP11: Circulating Plasma Biomarkers Associated With Brain Arteriovenous Malformations

Stroke ◽

10.1161/str.52.suppl_1.mp11 ◽

2021 ◽

Vol 52 (Suppl_1) ◽

Author(s):

Sarah E Wetzel-Strong ◽

Shantel M Weinsheimer ◽

Jeffrey Nelson ◽

Ludmila Pawlikowska ◽

Dewi Clark ◽

...

Keyword(s):

Multiple Testing ◽

Statistical Significance ◽

Protein Profiling ◽

P Value ◽

P Values ◽

Plasma Biomarkers ◽

Standard Curve ◽

Disease States ◽

Heparin Plasma ◽

Circulating Levels

Objective: Circulating plasma protein profiling may aid in the identification of cerebrovascular disease signatures. This study aimed to identify circulating angiogenic and inflammatory biomarkers that may serve as biomarkers to differentiate sporadic brain arteriovenous malformation (bAVM) patients from other conditions with brain AVMs, including hereditary hemorrhagic telangiectasia (HHT) patients. Methods: The Quantibody Human Angiogenesis Array 1000 (Raybiotech) is an ELISA multiplex panel that was used to assess the levels of 60 proteins related to angiogenesis and inflammation in heparin plasma samples from 13 sporadic unruptured bAVM patients (69% male, mean age 51 years) and 37 patients with HHT (40% male, mean age 47 years, n=19 (51%) with bAVM). The Quantibody Q-Analyzer tool was used to calculate biomarker concentrations based on the standard curve for each marker and log-transformed marker levels were evaluated for associations between disease states using a multivariable interval regression model adjusted for age, sex, ethnicity and collection site. Statistical significance was based on Bonferroni correction for multiple testing of 60 biomarkers (P< 8.3x10 - 4 ). Results: Circulating levels of two plasma proteins differed significantly between sporadic bAVM and HHT patients: PDGF-BB (P=2.6x10 -4 , PI= 3.37, 95% CI:1.76-6.46) and CCL5 (P=6.0x10 -6 , PI=3.50, 95% CI=2.04-6.03). When considering markers with a nominal p-value of less than 0.01, MMP1 and angiostatin levels also differed between patients with sporadic bAVM and HHT. Markers with nominal p-values less than 0.05 when comparing sporadic brain AVM and HHT patients also included angiostatin, IL2, VEGF, GRO, CXCL16, ITAC, and TGFB3. Among HHT patients, the circulating levels of UPAR and IL6 were elevated in patients with documented bAVMs when considering markers with nominal p-values less than 0.05. Conclusions: This study identified differential expression of two promising plasma biomarkers that differentiate sporadic bAVMs from patients with HHT. Furthermore, this study allowed us to evaluate markers that are associated with the presence of bAVMs in HHT patients, which may offer insight into mechanisms underlying bAVM pathophysiology.

Download Full-text

The Conundrum of P-Values: Statistical Significance is Unavoidable but Need Medical Significance Too

Journal of Biostatistics and Epidemiology ◽

10.18502/jbe.v5i4.3862 ◽

2020 ◽

Author(s):

Abhaya Indrayan

Keyword(s):

Type I Error ◽

Dominant Role ◽

Statistical Significance ◽

Empirical Studies ◽

P Value ◽

Selective Reporting ◽

Type I ◽

Practical Application ◽

P Values ◽

Zero Effect

Background: Small P-values have been conventionally considered as evidence to reject a null hypothesis in empirical studies. However, there is widespread criticism of P-values now and the threshold we use for statistical significance is questioned.Methods: This communication is on contrarian view and explains why P-value and its threshold are still useful for ruling out sampling fluctuation as a source of the findings.Results: The problem is not with P-values themselves but it is with their misuse, abuse, and over-use, including the dominant role they have assumed in empirical results. False results may be mostly because of errors in design, invalid data, inadequate analysis, inappropriate interpretation, accumulation of Type-I error, and selective reporting, and not because of P-values per se.Conclusion: A threshold of P-values such as 0.05 for statistical significance is helpful in making a binary inference for practical application of the result. However, a lower threshold can be suggested to reduce the chance of false results. Also, the emphasis should be on detecting a medically significant effect and not zero effect.

Download Full-text

Visualization Strategies for Regression Estimates with Randomization Inference

10.31235/osf.io/bsd7g ◽

2019 ◽

Author(s):

Marshall A. Taylor

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Regression Models ◽

Statistical Significance ◽

Permutation Tests ◽

P Value ◽

P Values ◽

Alpha Level ◽

Significance Levels ◽

Nonprobability Sample

Coefficient plots are a popular tool for visualizing regression estimates. The appeal of these plots is that they visualize confidence intervals around the estimates and generally center the plot around zero, meaning that any estimate that crosses zero is statistically non-significant at at least the alpha-level around which the confidence intervals are constructed. For models with statistical significance levels determined via randomization models of inference and for which there is no standard error or confidence intervals for the estimate itself, these plots appear less useful. In this paper, I illustrate a variant of the coefficient plot for regression models with p-values constructed using permutation tests. These visualizations plot each estimate's p-value and its associated confidence interval in relation to a specified alpha-level. These plots can help the analyst interpret and report both the statistical and substantive significance of their models. Illustrations are provided using a nonprobability sample of activists and participants at a 1962 anti-Communism school.

Download Full-text

P543Cardiac resynchronization therapy in left ventricular non-compaction: long-term results in a series of 40 patients

EP Europace ◽

10.1093/europace/euaa162.155 ◽

2020 ◽

Vol 22 (Supplement_1) ◽

Author(s):

D A Radu ◽

C N Iorgulescu ◽

S N Bogdan ◽

A I Deaconu ◽

A Badiul ◽

...

Keyword(s):

Statistical Significance ◽

Systolic Dysfunction ◽

Serum Levels ◽

Left Ventricular ◽

Long Term Results ◽

P Value ◽

P Values ◽

Mean Differences

Abstract Background Left ventricular non-compaction (LVNC) is a structural cardiomyopathy (SC) with a high probability of LV systolic dysfunction. Left bundle branch block (LBBB) frequently occurs in SCs. Purpose We sought to analyse the evolution of LVNC-CRT (LC) patients in general and compare it with the non-LVNC-CRT group (nLC). Methods We analysed 40 patients with contrast-MRI documented LVNC (concomitant positive Petersen and Jacquier criteria) implanted with CRT devices in CEHB. The follow-up included 7 hospital visits for each patient (between baseline and 3 years). Demographics, risk factors, usual serum levels, pre-procedural planning factors, clinical, ECG, TTE and biochemical markers were recorded. Statistical analysis was performed using software. Notable differences were reported as either p-values from crosstabs (discrete) or mean differences, p-values and confidence intervals from t-tests (continuous). A p-value of .05 was chosen for statistical significance (SS). Results Subjects in LC were younger (-7.52 ys; <.000; (-3.617;-11.440)), with no sex predominance, more obese (45.9 vs. 28.3%; <0.24) and had less ischaemic disease (17.9 vs. 39.7%; <.007). LC implants were usually CRT-Ds (91 vs. 49.5%; <.000) and more frequently MPP-ready (35.8 vs. 8.4%; <.000). At baseline, sinus rhythm was predominant in LC (97.4 vs. 79.8%; <.007) and permitted frequent use of optimal fusion CRT (75.5 vs. 46.6%; <.002). Although initial LVEFs were similar, LCs had much larger EDVs (+48.91 ml; <.020; (+7.705;+90.124)) and ESVs (+34.91; <.05; (+1.657;+71.478)). After an initial encouraging ⁓ 1 year evolution the LC-CRT group crashed its performance in terms of both LVEF and volumes. Thus, at 1 year follow-up, when compared to nLCs, LVEFs were far lower (-22.02%; <.000; (-32.29;-11.76)) while EDVs and ESVs much higher – (+70.8 ml; <.037; (+49.27;+ 189.65)) and (+100.13; <.039; (+5.25;+195)) respectively – in LCs in spite of similarly corrected dyssynchrony. The mean mitral regurgitation (MR) degree at 1 year was much higher in LCs (+1.8 classes; <.002; (+0.69;+2.97)) certainly contributing to the poor results. The cumulated super-responder/responder (SR/R) rates were constantly lower and decreasing at both 1 year (37.5 vs. 72.4; <.040) and 2 years of follow-up (10.1 vs. 80%; NS). Conclusions CRT candidates with LVNC are significantly more severe at the time of implant. After an initial short-term improvement (probably due to acute correction of dyssynchrony) most patients fail to respond in the long term. Severe dilation with important secondary MR probably plays an important role.

Download Full-text

Can texture analysis of pre-immunotherapy CT imaging predict clinical outcomes for patients with advanced NSCLC treated with Nivolumab?

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e20720 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e20720-e20720 ◽

Cited By ~ 1

Author(s):

Benjamin Oren Spieler ◽

Diana Saravia ◽

Gilberto Lopes ◽

Gregory Azzam ◽

Deukwoo Kwon ◽

...

Keyword(s):

Texture Analysis ◽

Clinical Outcomes ◽

Multiple Testing ◽

Advanced Nsclc ◽

Statistical Significance ◽

Texture Features ◽

Patient Characteristics ◽

Ct Imaging ◽

P Value ◽

P Values

e20720 Background: Targeted therapies are ineffective in most NSCLC patients and response rates remain < 20% for patients with advanced NSCLC on immuno-monotherapy. Predictive models that distinguish responders from non-responders to immunotherapy could help guide clinical practice. Texture analysis is a data-mining tool used to identify intensity patterns in diagnostic imaging. We hypothesized that texture features on pre-immunotherapy CT imaging can be associated with clinical outcomes for patients with advanced NSCLC treated with Nivolumab. Methods: In an IRB-approved database containing 159 patients with advanced NSCLC treated with Nivolumab monotherapy, 20 patients with the longest overall survival (OS) and 20 with the shortest were selected for retrospective analysis. Patient characteristics were compared using paired t-tests. The last pre-immunotherapy PET CT for each patient was transferred to MIM software for segmentation. All FDG-avid intrathoracic tumors were delineated on the CT scan per RTOG contouring guidelines. Ninety-two texture features within each tumor were analyzed for association with the primary endpoint, OS. OS time was dichotomized to less than 1 year vs. more than 1 year. A univariate logistic regression model was used to estimate odds ratio (OR), 95% confidence interval and p-value for each feature. Multiple testing adjustments were performed using false discovery rate. Results: Eleven out of 92 texture features showed significant association with OS time (p-values from 0.009 to 0.044), of which 7 exhibited large effect (OR < 0.5 or > 1.5). Fifteen additional texture features trended toward statistical significance with p-values from .05 to .10. In all, 26 out of the 92 texture features showed significant association or trended toward significance with duration of OS. Conclusions: This preliminary study suggests that texture features on pre-immunotherapy CT imaging may help in predicting OS duration for patients with advanced NSCLC treated with Nivolumab monotherapy. We are in the process of validating a multivariate predictive model. Future directions include expansion of this study across the full database, survival analyses and correlation of texture features with tissue biology.

Download Full-text

H-Tuple Approach to Evaluate Statistical Significance of Biological Sequence Comparison with Gaps

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1272 ◽

2007 ◽

Vol 6 (1) ◽

Cited By ~ 1

Author(s):

Afshin Fayyaz movaghar ◽

Sabine Mercier ◽

Louis Ferré

Keyword(s):

Sequence Comparison ◽

Numerical Experiments ◽

Statistical Significance ◽

P Value ◽

Biological Sequence ◽

P Values ◽

Local Score ◽

Approximate Distribution ◽

Biological Sequence Comparison ◽

New Scoring

We propose an approximate distribution for the gapped local score of a two sequence comparison. Our method stands on combining an adapted scoring scheme that includes the gaps and an approximate distribution of the ungapped local score of two independent sequences of i.i.d. random variables. The new scoring scheme is defined on h-tuples of the sequences, using the gapped global score. The influence of h and the accuracy of the p-value are numerically studied and compared with obtained p-value of BLAST. The numerical experiments emphasize that our approximate p-values outperform the BLAST ones, particularly for both simulated and real short sequences.

Download Full-text

Falacias sobre el valor p compartidas por profesores y estudiantes universitarios

Universitas Psychologica ◽

10.11144/javeriana.upsy16-3.fvcp ◽

2017 ◽

Vol 16 (3) ◽

pp. 1

Author(s):

Laura Badenes-Ribera ◽

Dolores Frias-Navarro

Keyword(s):

College Students ◽

Statistical Significance ◽

Psychological Research ◽

Practical Significance ◽

P Value ◽

P Values ◽

Statistical Education ◽

The Mean ◽

Estudiantes Universitarios

Resumen La “Práctica Basada en la Evidencia” exige que los profesionales valoren de forma crítica los resultados de las investigaciones psicológicas. Sin embargo, las interpretaciones incorrectas de los valores p de probabilidad son abundantes y repetitivas. Estas interpretaciones incorrectas afectan a las decisiones profesionales y ponen en riesgo la calidad de las intervenciones y la acumulación de un conocimiento científico válido. Identificar el tipo de falacia que subyace a las decisiones estadísticas es fundamental para abordar y planificar estrategias de educación estadística dirigidas a intervenir sobre las interpretaciones incorrectas. En consecuencia, el objetivo de este estudio es analizar la interpretación del valor p en estudiantes y profesores universitarios de Psicología. La muestra estuvo formada por 161 participantes (43 profesores y 118 estudiantes). La antigüedad media como profesor fue de 16.7 años (DT = 10.07). La edad media de los estudiantes fue de 21.59 (DT = 1.3). Los hallazgos sugieren que los estudiantes y profesores universitarios no conocen la interpretación correcta del valor p. La falacia de la probabilidad inversa presenta mayores problemas de comprensión. Además, se confunde la significación estadística y la significación práctica o clínica. Estos resultados destacan la necesidad de la educación estadística y re-educación estadística. Abstract The "Evidence Based Practice" requires professionals to critically assess the results of psychological research. However, incorrect interpretations of p values of probability are abundant and repetitive. These misconceptions affect professional decisions and compromise the quality of interventions and the accumulation of a valid scientific knowledge. Identifying the types of fallacies that underlying statistical decisions is fundamental for approaching and planning statistical education strategies designed to intervene in incorrect interpretations. Therefore, the aim of this study is to analyze the interpretation of p value among college students of psychology and academic psychologist. The sample was composed of 161 participants (43 academic and 118 students). The mean number of years as academic was 16.7 (SD = 10.07). The mean age of college students was 21.59 years (SD = 1.3). The findings suggest that college students and academic do not know the correct interpretation of p values. The fallacy of the inverse probability presents major problems of comprehension. In addition, statistical significance and practical significance or clinical are confused. There is a need for statistical education and statistical re-education.

Download Full-text

There is life beyond the statistical significance

Reproductive Health ◽

10.1186/s12978-021-01131-w ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Agustín Ciapponi ◽

José M. Belizán ◽

Gilda Piaggio ◽

Sanni Yaya

Keyword(s):

Confidence Intervals ◽

Care Delivery ◽

Statistical Significance ◽

Point Estimate ◽

P Value ◽

Net Benefit ◽

Great Opportunity ◽

P Values ◽

Statistical Measures ◽

Clinical Benefits

AbstractThis article challenges the “tyranny of P-value” and promote more valuable and applicable interpretations of the results of research on health care delivery. We provide here solid arguments to retire statistical significance as the unique way to interpret results, after presenting the current state of the debate inside the scientific community. Instead, we promote reporting the much more informative confidence intervals and eventually adding exact P-values. We also provide some clues to integrate statistical and clinical significance by referring to minimal important differences and integrating the effect size of an intervention and the certainty of evidence ideally using the GRADE approach. We have argued against interpreting or reporting results as statistically significant or statistically non-significant. We recommend showing important clinical benefits with their confidence intervals in cases of point estimates compatible with results benefits and even important harms. It seems fair to report the point estimate and the more likely values along with a very clear statement of the implications of extremes of the intervals. We recommend drawing conclusions, considering the multiple factors besides P-values such as certainty of the evidence for each outcome, net benefit, economic considerations and values and preferences. We use several examples and figures to illustrate different scenarios and further suggest a wording to standardize the reporting. Several statistical measures have a role in the scientific communication of studies, but it is time to understand that there is life beyond the statistical significance. There is a great opportunity for improvement towards a more complete interpretation and to a more standardized reporting.

Download Full-text

A Tutorial on Conducting and Interpreting a Bayesian ANOVA in JASP

10.31234/osf.io/spreb ◽

2019 ◽

Cited By ~ 6

Author(s):

Don van den Bergh ◽

Johnny van Doorn ◽

Maarten Marsman ◽

Tim Draws ◽

Erik-Jan van Kesteren ◽

...

Keyword(s):

Graphical User Interface ◽

Statistical Significance ◽

Standard Procedure ◽

P Value ◽

Applied Statistics ◽

P Values ◽

Frequentist Statistics ◽

Key Concepts ◽

The Bayesian Approach ◽

Bayesian Anova

Analysis of variance (ANOVA) is the standard procedure for statistical inference in factorial designs. Typically, ANOVAs are executed using frequentist statistics, where p-values determine statistical significance in an all-or-none fashion. In recent years, the Bayesian approach to statistics is increasingly viewed as a legitimate alternative to the p-value. However, the broad adoption of Bayesian statistics –and Bayesian ANOVA in particular– is frustrated by the fact that Bayesian concepts are rarely taught in applied statistics courses. Consequently, practitioners may be unsure how to conduct a Bayesian ANOVA and interpret the results. Herewe provide a guide for executing and interpreting a Bayesian ANOVA with JASP, an open-source statistical software program with a graphical user interface. We explain the key concepts of the Bayesian ANOVA using twoempirical examples.

Download Full-text

Manipulating the alpha level cannot cure significance testing – comments on "Redefine statistical significance"

10.7287/peerj.preprints.3411v1 ◽

2017 ◽

Cited By ~ 3

Author(s):

David Trafimow ◽

Valentin Amrhein ◽

Corson N. Areshenkoff ◽

Carlos Barrera-Causil ◽

Eric J. Beh ◽

...

Keyword(s):

Experimental Design ◽

Statistical Significance ◽

Significance Testing ◽

P Value ◽

Statistical Tools ◽

P Values ◽

Binary Decision ◽

Clear Cut ◽

Alpha Level ◽

Cumulative Evidence

We argue that depending on p-values to reject null hypotheses, including a recent call for changing the canonical alpha level for statistical significance from .05 to .005, is deleterious for the finding of new discoveries and the progress of science. Given that blanket and variable criterion levels both are problematic, it is sensible to dispense with significance testing altogether. There are alternatives that address study design and determining sample sizes much more directly than significance testing does; but none of the statistical tools should replace significance testing as the new magic method giving clear-cut mechanical answers. Inference should not be based on single studies at all, but on cumulative evidence from multiple independent studies. When evaluating the strength of the evidence, we should consider, for example, auxiliary assumptions, the strength of the experimental design, or implications for applications. To boil all this down to a binary decision based on a p-value threshold of .05, .01, .005, or anything else, is not acceptable.

Download Full-text