scholarly journals Confronting p-hacking: addressing p-value dependence on sample size

2019 ◽  
Author(s):  
Estibaliz Gómez-de-Mariscal ◽  
Alexandra Sneider ◽  
Hasini Jayatilaka ◽  
Jude M. Phillip ◽  
Denis Wirtz ◽  
...  

ABSTRACTBiomedical research has come to rely on p-values to determine potential translational impact. The p-value is routinely compared with a threshold commonly set to 0.05 to assess the significance of the null hypothesis. Whenever a large enough dataset is available, this threshold is easily reachable. This phenomenon is known as p-hacking and it leads to spurious conclusions. Herein, we propose a systematic and easy-to-follow protocol that models the p-value as an exponential function to test the existence of real statistical significance. This new approach provides a robust assessment of the null hypothesis with accurate values for the minimum data-size needed to reject it. An in-depth study of the model is carried out in both simulated and experimentally-obtained data. Simulations show that under controlled data, our assumptions are true. The results of our analysis in the experimental datasets reflect the large scope of this approach in common decision-making processes.

10.2196/21345 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e21345 ◽  
Author(s):  
Marcus Bendtsen

When should a trial stop? Such a seemingly innocent question evokes concerns of type I and II errors among those who believe that certainty can be the product of uncertainty and among researchers who have been told that they need to carefully calculate sample sizes, consider multiplicity, and not spend P values on interim analyses. However, the endeavor to dichotomize evidence into significant and nonsignificant has led to the basic driving force of science, namely uncertainty, to take a back seat. In this viewpoint we discuss that if testing the null hypothesis is the ultimate goal of science, then we need not worry about writing protocols, consider ethics, apply for funding, or run any experiments at all—all null hypotheses will be rejected at some point—everything has an effect. The job of science should be to unearth the uncertainties of the effects of treatments, not to test their difference from zero. We also show the fickleness of P values, how they may one day point to statistically significant results; and after a few more participants have been recruited, the once statistically significant effect suddenly disappears. We show plots which we hope would intuitively highlight that all assessments of evidence will fluctuate over time. Finally, we discuss the remedy in the form of Bayesian methods, where uncertainty leads; and which allows for continuous decision making to stop or continue recruitment, as new data from a trial is accumulated.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Estibaliz Gómez-de-Mariscal ◽  
Vanesa Guerrero ◽  
Alexandra Sneider ◽  
Hasini Jayatilaka ◽  
Jude M. Phillip ◽  
...  

AbstractBiomedical research has come to rely on p-values as a deterministic measure for data-driven decision-making. In the largely extended null hypothesis significance testing for identifying statistically significant differences among groups of observations, a single p-value is computed from sample data. Then, it is routinely compared with a threshold, commonly set to 0.05, to assess the evidence against the hypothesis of having non-significant differences among groups, or the null hypothesis. Because the estimated p-value tends to decrease when the sample size is increased, applying this methodology to datasets with large sample sizes results in the rejection of the null hypothesis, making it not meaningful in this specific situation. We propose a new approach to detect differences based on the dependence of the p-value on the sample size. We introduce new descriptive parameters that overcome the effect of the size in the p-value interpretation in the framework of datasets with large sample sizes, reducing the uncertainty in the decision about the existence of biological differences between the compared experiments. The methodology enables the graphical and quantitative characterization of the differences between the compared experiments guiding the researchers in the decision process. An in-depth study of the methodology is carried out on simulated and experimental data. Code availability at https://github.com/BIIG-UC3M/pMoSS.


Stroke ◽  
2021 ◽  
Vol 52 (Suppl_1) ◽  
Author(s):  
Sarah E Wetzel-Strong ◽  
Shantel M Weinsheimer ◽  
Jeffrey Nelson ◽  
Ludmila Pawlikowska ◽  
Dewi Clark ◽  
...  

Objective: Circulating plasma protein profiling may aid in the identification of cerebrovascular disease signatures. This study aimed to identify circulating angiogenic and inflammatory biomarkers that may serve as biomarkers to differentiate sporadic brain arteriovenous malformation (bAVM) patients from other conditions with brain AVMs, including hereditary hemorrhagic telangiectasia (HHT) patients. Methods: The Quantibody Human Angiogenesis Array 1000 (Raybiotech) is an ELISA multiplex panel that was used to assess the levels of 60 proteins related to angiogenesis and inflammation in heparin plasma samples from 13 sporadic unruptured bAVM patients (69% male, mean age 51 years) and 37 patients with HHT (40% male, mean age 47 years, n=19 (51%) with bAVM). The Quantibody Q-Analyzer tool was used to calculate biomarker concentrations based on the standard curve for each marker and log-transformed marker levels were evaluated for associations between disease states using a multivariable interval regression model adjusted for age, sex, ethnicity and collection site. Statistical significance was based on Bonferroni correction for multiple testing of 60 biomarkers (P< 8.3x10 - 4 ). Results: Circulating levels of two plasma proteins differed significantly between sporadic bAVM and HHT patients: PDGF-BB (P=2.6x10 -4 , PI= 3.37, 95% CI:1.76-6.46) and CCL5 (P=6.0x10 -6 , PI=3.50, 95% CI=2.04-6.03). When considering markers with a nominal p-value of less than 0.01, MMP1 and angiostatin levels also differed between patients with sporadic bAVM and HHT. Markers with nominal p-values less than 0.05 when comparing sporadic brain AVM and HHT patients also included angiostatin, IL2, VEGF, GRO, CXCL16, ITAC, and TGFB3. Among HHT patients, the circulating levels of UPAR and IL6 were elevated in patients with documented bAVMs when considering markers with nominal p-values less than 0.05. Conclusions: This study identified differential expression of two promising plasma biomarkers that differentiate sporadic bAVMs from patients with HHT. Furthermore, this study allowed us to evaluate markers that are associated with the presence of bAVMs in HHT patients, which may offer insight into mechanisms underlying bAVM pathophysiology.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


BJPsych Open ◽  
2021 ◽  
Vol 7 (S1) ◽  
pp. S233-S233
Author(s):  
Hamid Alhaj ◽  
Rahaf Abughosh ◽  
Batool Aldaher ◽  
Asma Elhewairis ◽  
Ahmed Ali ◽  
...  

AimsMidwakh, which involves smoking an Arabian tobacco blend typically mixed with herbs and spices, has recently become a major health concern due to a spreading popularity among adolescents and young adults in the United Arab Emirates (UAE). It is known to contain a higher nicotine content than cigarettes, potentially increasing the risk of addiction, despite contrary popular belief among young smokers. Yet, little is known about attitudes and decision-making processes involving this emerging smoking behaviour. The aim of this study was to ascertain the knowledge, attitudes and practices of Midwakh use among adult males in the UAE.MethodA cross sectional study was conducted among male adults in Abu Dhabi, Dubai and Sharjah. A total of 500 participants completed self-administered validated questionnaires, which consisted of 30 questions that targeted the public's understanding, perception and use of Midwakh. Data were analysed using SPSS 23. Percentages and means were calculated for demographic data and Chi-Square was utilised to measure relations between categorical variables. Odds Ratio (OR) was used to estimate how strongly a predictor was associated to an outcome. A p-value less than 0.05 was considered statistically significant.ResultThe prevalence of smoking Midwakh was 34.8% among the study sample. Males between ages 26 to 35 were found to be 4.48 times (95% CI: 1.59–12.66) more likely to be current Midwakh smokers than any other age groups (P = 0.01). Emiratis in the study were 5.92 times (95% CI: 2.83–12.35) more likely to smoke Midwakh than expats. 65% of respondents reported willingness to smoke Midwakh if it was offered to them. Adults with 3-4 close friends who smoke Midwakh were 6.8 times (95% CI: 2.08–22.41) more likely to smoke Midwakh themselves. Knowledge of being unsafe was cited in 62% of the participants as a cause of quitting Midwakh within two years.ConclusionOur results demonstrate a significant impact of peer pressure on the decision-making process of Midwakh smoking. The high prevalence among young male residents warrants a multi-agency public health approach to tackle the issue. Culturally sensitive campaigns raising awareness to the harmful effect of Midwakh including its addictiveness appear to be essential. Further research investigating the effects of a targeted Midwakh-smoking cessation approaches is warranted.


PEDIATRICS ◽  
1989 ◽  
Vol 84 (6) ◽  
pp. A30-A30
Author(s):  
Student

Often investigators report many P values in the same study. The expected number of P values smaller than 0.05 is 1 in 20 tests of true null hypotheses; therefore the probability that at least one P value will be smaller than 0.05 increases with the number of tests, even when the null hypothesis is correct for each test. This increase is known as the "multiple-comparisons" problem...One reasonable way to correct for multiplicity is simply to multiply the P value by the number of tests. Thus, with five tests, an orignal 0.05 level for each is increased, perhaps to a value as high as 0.25 for the set. To achieve a level of not more than 0.05 for the set, we need to choose a level of 0.05/5 = 0.01 for the individual tests. This adjustment is conservative. We know only that the probability does not exceed 0.05 for the set.


1996 ◽  
Vol 1 (1) ◽  
pp. 25-28 ◽  
Author(s):  
Martin A. Weinstock

Background: Accurate understanding of certain basic statistical terms and principles is key to critical appraisal of published literature. Objective: This review describes type I error, type II error, null hypothesis, p value, statistical significance, a, two-tailed and one-tailed tests, effect size, alternate hypothesis, statistical power, β, publication bias, confidence interval, standard error, and standard deviation, while including examples from reports of dermatologic studies. Conclusion: The application of the results of published studies to individual patients should be informed by an understanding of certain basic statistical concepts.


Author(s):  
Abhaya Indrayan

Background: Small P-values have been conventionally considered as evidence to reject a null hypothesis in empirical studies. However, there is widespread criticism of P-values now and the threshold we use for statistical significance is questioned.Methods: This communication is on contrarian view and explains why P-value and its threshold are still useful for ruling out sampling fluctuation as a source of the findings.Results: The problem is not with P-values themselves but it is with their misuse, abuse, and over-use, including the dominant role they have assumed in empirical results. False results may be mostly because of errors in design, invalid data, inadequate analysis, inappropriate interpretation, accumulation of Type-I error, and selective reporting, and not because of P-values per se.Conclusion: A threshold of P-values such as 0.05 for statistical significance is helpful in making a binary inference for practical application of the result. However, a lower threshold can be suggested to reduce the chance of false results. Also, the emphasis should be on detecting a medically significant effect and not zero effect.


2017 ◽  
Vol 27 (12) ◽  
pp. 3560-3576 ◽  
Author(s):  
Albert Vexler ◽  
Jihnhee Yu ◽  
Yang Zhao ◽  
Alan D Hutson ◽  
Gregory Gurevich

Many statistical studies report p-values for inferential purposes. In several scenarios, the stochastic aspect of p-values is neglected, which may contribute to drawing wrong conclusions in real data experiments. The stochastic nature of p-values makes their use to examine the performance of given testing procedures or associations between investigated factors to be difficult. We turn our focus on the modern statistical literature to address the expected p-value (EPV) as a measure of the performance of decision-making rules. During the course of our study, we prove that the EPV can be considered in the context of receiver operating characteristic (ROC) curve analysis, a well-established biostatistical methodology. The ROC-based framework provides a new and efficient methodology for investigating and constructing statistical decision-making procedures, including: (1) evaluation and visualization of properties of the testing mechanisms, considering, e.g. partial EPVs; (2) developing optimal tests via the minimization of EPVs; (3) creation of novel methods for optimally combining multiple test statistics. We demonstrate that the proposed EPV-based approach allows us to maximize the integrated power of testing algorithms with respect to various significance levels. In an application, we use the proposed method to construct the optimal test and analyze a myocardial infarction disease dataset. We outline the usefulness of the “EPV/ROC” technique for evaluating different decision-making procedures, their constructions and properties with an eye towards practical applications.


Sign in / Sign up

Export Citation Format

Share Document