The Conundrum of P-Values: Statistical Significance is Unavoidable but Need Medical Significance Too

Author(s):  
Abhaya Indrayan

Background: Small P-values have been conventionally considered as evidence to reject a null hypothesis in empirical studies. However, there is widespread criticism of P-values now and the threshold we use for statistical significance is questioned.Methods: This communication is on contrarian view and explains why P-value and its threshold are still useful for ruling out sampling fluctuation as a source of the findings.Results: The problem is not with P-values themselves but it is with their misuse, abuse, and over-use, including the dominant role they have assumed in empirical results. False results may be mostly because of errors in design, invalid data, inadequate analysis, inappropriate interpretation, accumulation of Type-I error, and selective reporting, and not because of P-values per se.Conclusion: A threshold of P-values such as 0.05 for statistical significance is helpful in making a binary inference for practical application of the result. However, a lower threshold can be suggested to reduce the chance of false results. Also, the emphasis should be on detecting a medically significant effect and not zero effect.

2018 ◽  
Vol 8 (2) ◽  
pp. 58-71
Author(s):  
Richard L. Gorsuch ◽  
Curtis Lehmann

Approximations for Chi-square and F distributions can both be computed to provide a p-value, or probability of Type I error, to evaluate statistical significance. Although Chi-square has been used traditionally for tests of count data and nominal or categorical criterion variables (such as contingency tables) and F ratios for tests of non-nominal or continuous criterion variables (such as regression and analysis of variance), we demonstrate that either statistic can be applied in both situations. We used data simulation studies to examine when one statistic may be more accurate than the other for estimating Type I error rates across different types of analysis (count data/contingencies, dichotomous, and non-nominal) and across sample sizes (Ns) ranging from 20 to 160 (using 25,000 replications for simulating p-value derived from either Chi-squares or F-ratios). Our results showed that those derived from F ratios were generally closer to nominal Type I error rates than those derived from Chi-squares. The p-values derived from F ratios were more consistent for contingency table count data than those derived from Chi-squares. The smaller than 100 the N was, the more discrepant p-values derived from Chi-squares were from the nominal p-value. Only when the N was greater than 80 did the p-values from Chi-square tests become as accurate as those derived from F ratios in reproducing the nominal p-values. Thus, there was no evidence of any need for special treatment of dichotomous dependent variables. The most accurate and/or consistent p's were derived from F ratios. We conclude that Chi-square should be replaced generally with the F ratio as the statistic of choice and that the Chi-square test should only be taught as history.


1996 ◽  
Vol 1 (1) ◽  
pp. 25-28 ◽  
Author(s):  
Martin A. Weinstock

Background: Accurate understanding of certain basic statistical terms and principles is key to critical appraisal of published literature. Objective: This review describes type I error, type II error, null hypothesis, p value, statistical significance, a, two-tailed and one-tailed tests, effect size, alternate hypothesis, statistical power, β, publication bias, confidence interval, standard error, and standard deviation, while including examples from reports of dermatologic studies. Conclusion: The application of the results of published studies to individual patients should be informed by an understanding of certain basic statistical concepts.


Methodology ◽  
2016 ◽  
Vol 12 (2) ◽  
pp. 44-51 ◽  
Author(s):  
José Manuel Caperos ◽  
Ricardo Olmos ◽  
Antonio Pardo

Abstract. Correlation analysis is one of the most widely used methods to test hypotheses in social and health sciences; however, its use is not completely error free. We have explored the frequency of inconsistencies between reported p-values and the associated test statistics in 186 papers published in four Spanish journals of psychology (1,950 correlation tests); we have also collected information about the use of one- versus two-tailed tests in the presence of directional hypotheses, and about the use of some kind of adjustment to control Type I errors due to simultaneous inference. Reported correlation tests (83.8%) are incomplete and 92.5% include an inexact p-value. Gross inconsistencies, which are liable to alter the statistical conclusions, appear in 4% of the reviewed tests, and 26.9% of the inconsistencies found were large enough to bias the results of a meta-analysis. The election of one-tailed tests and the use of adjustments to control the Type I error rate are negligible. We therefore urge authors, reviewers, and editorial boards to pay particular attention to this in order to prevent inconsistencies in statistical reports.


2019 ◽  
Author(s):  
Dimitri Marques Abramov

AbstractBackgroundMethods for p-value correction are criticized for either increasing Type II error or improperly reducing Type I error. This problem is worse when dealing with hundreds or thousands of paired comparisons between waves or images which are performed point-to-point. This text considers patterns in probability vectors resulting from multiple point-to-point comparisons between two ERP waves (mass univariate analysis) to correct p-values. These patterns (probability waves) mirror ERP waveshapes and might be indicators of consistency in statistical differences.New methodIn order to compute and analyze these patterns, we convoluted the decimal logarithm of the probability vector (p’) using a Gaussian vector with size compatible to the ERP periods observed. For verify consistency of this method, we also calculated mean amplitudes of late ERPs from Pz (P300 wave) and O1 electrodes in two samples, respectively of typical and ADHD subjects.Resultsthe present method reduces the range of p’-values that did not show covariance with neighbors (that is, that are likely random differences, type I errors), while preserving the amplitude of probability waves, in accordance to difference between respective mean amplitudes.Comparison with existing methodsthe positive-FDR resulted in a different profile of corrected p-values, which is not consistent with expected results or differences between mean amplitudes of the analyzed ERPs.Conclusionthe present new method seems to be biological and statistically more suitable to correct p-values in mass univariate analysis of ERP waves.


2021 ◽  
Author(s):  
Marcos A. Antezana

ABSTRACTWhen a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase. But model selection and correcting for multiple tests is complex even with few IVs.DMs in genomics will soon summarize millions of markers (mutations) and genomes. Searching exhaustively in such DMs for mutations that alone or synergistically with others are associated with a trait is computationally tractable only for 1- and 2-mutation effects. This is also why population geneticists study mainly 2-marker combinations.I present a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others. PAS does not examine column subsets and its computational cost grows linearly with the number of columns, remaining reasonable even when DMs have millions of columns. PAS P values are readily obtained by permutation and accurately Sidak-corrected for multiple tests, bypassing model selection. The P values of a column’s PASs and dvPASs for different orders of association are i.i.d. and easily turned into a single P value.PAS exploits how associations of markers in the rows of a DM cause associations of matches in the pairwise comparisons of the rows. For every such comparison with a match at a tested column, PAS computes the matches at other columns by modifying the comparison’s total matches (scored once per DM), yielding a distribution of conditional matches that reacts diagnostically to the associations of the tested column. Equally computationally tractable is dvPAS that flags DV-associated IVs by also probing the matches at the DV.Simulations show that i) PAS and dvPAS generate uniform-(0,1)-distributed type I error in null DMs and ii) detect randomly encountered binary and trinary models of significant n-column association and n-IV association to a binary DV, respectively, with power in the order of magnitude of exhaustive evaluation’s and false positives that are uniform-(0,1)-distributed or straightforwardly tuned to be so. Power to detect 2-way associations that extend over 100+ columns is non-parametrically ultimate but that to detect pure n-column associations and pure n-IV DV associations sinks exponentially with increasing n.Important for geneticists, dvPAS power increases about twofold in trinary vs. binary DMs and by orders of magnitude with markers linked like mutations in chromosomes, specially in trinary DMs where furthermore dvPAS fine-maps with highest resolution.


2021 ◽  
pp. 1-2
Author(s):  
Sukhvinder Singh Oberoi ◽  
Mansi Atri

The interpretation of the p-value has been an arena for discussion making it difficult for many researchers. The p-value was introduced in 1900 by Pearson. Though, it is very difficult to comment about the demerits of the p-values and significance testing which has not been spoken in a long time because of the practical application of it as a measure of interpretation in clinical research. The usage of the confidence intervals around the sample statistics and effect size should be given more importance than relying solely upon the statistical significance. The researchers, should be consulting a statistician in the initial stages of the planning of the study for avoidance of the misinterpretation of the P-value especially if they are using statistical software for their data analysis.


Genetics ◽  
2002 ◽  
Vol 160 (3) ◽  
pp. 1113-1122
Author(s):  
A F McRae ◽  
J C McEwan ◽  
K G Dodds ◽  
T Wilson ◽  
A M Crawford ◽  
...  

Abstract The last decade has seen a dramatic increase in the number of livestock QTL mapping studies. The next challenge awaiting livestock geneticists is to determine the actual genes responsible for variation of economically important traits. With the advent of high density single nucleotide polymorphism (SNP) maps, it may be possible to fine map genes by exploiting linkage disequilibrium between genes of interest and adjacent markers. However, the extent of linkage disequilibrium (LD) is generally unknown for livestock populations. In this article microsatellite genotype data are used to assess the extent of LD in two populations of domestic sheep. High levels of LD were found to extend for tens of centimorgans and declined as a function of marker distance. However, LD was also frequently observed between unlinked markers. The prospects for LD mapping in livestock appear encouraging provided that type I error can be minimized. Properties of the multiallelic LD coefficient D′ were also explored. D′ was found to be significantly related to marker heterozygosity, although the relationship did not appear to unduly influence the overall conclusions. Of potentially greater concern was the observation that D′ may be skewed when rare alleles are present. It is recommended that the statistical significance of LD is used in conjunction with coefficients such as D′ to determine the true extent of LD.


Stroke ◽  
2021 ◽  
Vol 52 (Suppl_1) ◽  
Author(s):  
Sarah E Wetzel-Strong ◽  
Shantel M Weinsheimer ◽  
Jeffrey Nelson ◽  
Ludmila Pawlikowska ◽  
Dewi Clark ◽  
...  

Objective: Circulating plasma protein profiling may aid in the identification of cerebrovascular disease signatures. This study aimed to identify circulating angiogenic and inflammatory biomarkers that may serve as biomarkers to differentiate sporadic brain arteriovenous malformation (bAVM) patients from other conditions with brain AVMs, including hereditary hemorrhagic telangiectasia (HHT) patients. Methods: The Quantibody Human Angiogenesis Array 1000 (Raybiotech) is an ELISA multiplex panel that was used to assess the levels of 60 proteins related to angiogenesis and inflammation in heparin plasma samples from 13 sporadic unruptured bAVM patients (69% male, mean age 51 years) and 37 patients with HHT (40% male, mean age 47 years, n=19 (51%) with bAVM). The Quantibody Q-Analyzer tool was used to calculate biomarker concentrations based on the standard curve for each marker and log-transformed marker levels were evaluated for associations between disease states using a multivariable interval regression model adjusted for age, sex, ethnicity and collection site. Statistical significance was based on Bonferroni correction for multiple testing of 60 biomarkers (P< 8.3x10 - 4 ). Results: Circulating levels of two plasma proteins differed significantly between sporadic bAVM and HHT patients: PDGF-BB (P=2.6x10 -4 , PI= 3.37, 95% CI:1.76-6.46) and CCL5 (P=6.0x10 -6 , PI=3.50, 95% CI=2.04-6.03). When considering markers with a nominal p-value of less than 0.01, MMP1 and angiostatin levels also differed between patients with sporadic bAVM and HHT. Markers with nominal p-values less than 0.05 when comparing sporadic brain AVM and HHT patients also included angiostatin, IL2, VEGF, GRO, CXCL16, ITAC, and TGFB3. Among HHT patients, the circulating levels of UPAR and IL6 were elevated in patients with documented bAVMs when considering markers with nominal p-values less than 0.05. Conclusions: This study identified differential expression of two promising plasma biomarkers that differentiate sporadic bAVMs from patients with HHT. Furthermore, this study allowed us to evaluate markers that are associated with the presence of bAVMs in HHT patients, which may offer insight into mechanisms underlying bAVM pathophysiology.


2017 ◽  
Author(s):  
František Váša ◽  
Edward T. Bullmore ◽  
Ameera X. Patel

AbstractFunctional connectomes are commonly analysed as sparse graphs, constructed by thresholding cross-correlations between regional neurophysiological signals. Thresholding generally retains the strongest edges (correlations), either by retaining edges surpassing a given absolute weight, or by constraining the edge density. The latter (more widely used) method risks inclusion of false positive edges at high edge densities and exclusion of true positive edges at low edge densities. Here we apply new wavelet-based methods, which enable construction of probabilistically-thresholded graphs controlled for type I error, to a dataset of resting-state fMRI scans of 56 patients with schizophrenia and 71 healthy controls. By thresholding connectomes to fixed edge-specific P value, we found that functional connectomes of patients with schizophrenia were more dysconnected than those of healthy controls, exhibiting a lower edge density and a higher number of (dis)connected components. Furthermore, many participants’ connectomes could not be built up to the fixed edge densities commonly studied in the literature (~5-30%), while controlling for type I error. Additionally, we showed that the topological randomisation previously reported in the schizophrenia literature is likely attributable to “non-significant” edges added when thresholding connectomes to fixed density based on correlation. Finally, by explicitly comparing connectomes thresholded by increasing P value and decreasing correlation, we showed that probabilistically thresholded connectomes show decreased randomness and increased consistency across participants. Our results have implications for future analysis of functional connectivity using graph theory, especially within datasets exhibiting heterogenous distributions of edge weights (correlations), between groups or across participants.


Author(s):  
Oliver Gutiérrez-Hernández ◽  
Luis Ventura García

Multiplicity arises when data analysis involves multiple simultaneous inferences, increasing the chance of spurious findings. It is a widespread problem frequently ignored by researchers. In this paper, we perform an exploratory analysis of the Web of Science database for COVID-19 observational studies. We examined 100 top-cited COVID-19 peer-reviewed articles based on p-values, including up to 7100 simultaneous tests, with 50% including >34 tests, and 20% > 100 tests. We found that the larger the number of tests performed, the larger the number of significant results (r = 0.87, p < 10−6). The number of p-values in the abstracts was not related to the number of p-values in the papers. However, the highly significant results (p < 0.001) in the abstracts were strongly correlated (r = 0.61, p < 10−6) with the number of p < 0.001 significances in the papers. Furthermore, the abstracts included a higher proportion of significant results (0.91 vs. 0.50), and 80% reported only significant results. Only one reviewed paper addressed multiplicity-induced type I error inflation, pointing to potentially spurious results bypassing the peer-review process. We conclude the need to pay special attention to the increased chance of false discoveries in observational studies, including non-replicated striking discoveries with a potentially large social impact. We propose some easy-to-implement measures to assess and limit the effects of multiplicity.


Sign in / Sign up

Export Citation Format

Share Document