"MULTIPLE-COMPARISONS" PROBLEM

PEDIATRICS ◽  
1989 ◽  
Vol 84 (6) ◽  
pp. A30-A30
Author(s):  
Student

Often investigators report many P values in the same study. The expected number of P values smaller than 0.05 is 1 in 20 tests of true null hypotheses; therefore the probability that at least one P value will be smaller than 0.05 increases with the number of tests, even when the null hypothesis is correct for each test. This increase is known as the "multiple-comparisons" problem...One reasonable way to correct for multiplicity is simply to multiply the P value by the number of tests. Thus, with five tests, an orignal 0.05 level for each is increased, perhaps to a value as high as 0.25 for the set. To achieve a level of not more than 0.05 for the set, we need to choose a level of 0.05/5 = 0.01 for the individual tests. This adjustment is conservative. We know only that the probability does not exceed 0.05 for the set.

2004 ◽  
Vol 22 (19) ◽  
pp. 3965-3972 ◽  
Author(s):  
Volkert B. Wreesmann ◽  
Weiji Shi ◽  
Howard T. Thaler ◽  
Ashok Poluri ◽  
Dennis H. Kraus ◽  
...  

Purpose The goal of this study was to identify chromosomal aberrations associated with poor outcome in patients with head and neck squamous cell carcinoma (HNSCC). Patients and Methods We assessed the global genomic composition of 82 HNSCCs from previously untreated patients with comparative genomic hybridization (CGH). The CGH data were subcategorized into individual cytogenetic bands. Only genomic aberrations occurring in more than 5% of cases were analyzed, and redundancies were eliminated. Each aberration was submitted to univariate analysis to assess its relationship with disease-specific survival (DSS). We used Monte Carlo simulations (MCS) to adjust P values for the log-rank approximate χ2 statistics for each abnormality and further applied the Hochberg-Benjamini procedure to adjust the P values for multiple testing of the large number of abnormalities. We then submitted abnormalities whose univariate tests resulted in an adjusted P value of less than .15 together with significant demographic/clinical variables to stepwise Cox proportional hazards regression. We again verified and adjusted P values for the χ2 approximation of the final model by MCS. Results CGH analysis revealed a recurrent pattern of chromosomal aberrations typical for HNSCC. Univariate analysis revealed 38 abnormalities that were correlated with DSS. After controlling for multiple comparisons and confounding effects of stage, five chromosomal aberrations were significantly associated with outcome, including amplification at 11q13, gain of 12q24, and losses at 5q11, 6q14, and 21q11 (MCS adjusted P = .0009 to P = .01). Conclusion HNSCC contains a complex pattern of chromosomal aberrations. A sequential approach to control for multiple comparisons and effect of confounding variables allows the identification of clinically relevant aberrations. The significance of each individual abnormality merits further consideration.


Author(s):  
David McGiffin ◽  
Geoff Cumming ◽  
Paul Myles

Null hypothesis significance testing (NHST) and p-values are widespread in the cardiac surgical literature but are frequently misunderstood and misused. The purpose of the review is to discuss major disadvantages of p-values and suggest alternatives. We describe diagnostic tests, the prosecutor’s fallacy in the courtroom, and NHST, which involve inter-related conditional probabilities, to help clarify the meaning of p-values, and discuss the enormous sampling variability, or unreliability, of p-values. Finally, we use a cardiac surgical database and simulations to explore further issues involving p-values. In clinical studies, p-values provide a poor summary of the observed treatment effect, whereas the three- number summary provided by effect estimates and confidence intervals is more informative and minimises over-interpretation of a “significant” result. P-values are an unreliable measure of strength of evidence; if used at all they give only, at best, a very rough guide to decision making. Researchers should adopt Open Science practices to improve the trustworthiness of research and, where possible, use estimation (three-number summaries) or other better techniques.


2020 ◽  
Vol 30 (5) ◽  
pp. 305-317
Author(s):  
Emil Riis Abrahamsen ◽  
Regitze Kuhr Skals ◽  
Dan Dupont Hougaard

BACKGROUND: It has not yet been tested whether averaged gain values and the presence of pathological saccades are significantly altered by manual data selection or if data selection only done by the incorporated software detection algorithms provides a reliable data set following v-HIT testing. OBJECTIVE: The primary endpoint was to evaluate whether the averaged gain values of all six SCCs are significantly altered by manual data selection with two different v-HIT systems. METHOD: 120 subjects with previously neither vestibular nor neurological disorders underwent four separate tests of all six SCCs with either EyeSeeCam® or ICS Impulse®. All v-HIT test reports underwent manual data selection by an experienced ENT Specialist with deletion of any noise and/or artifacts. Generalized estimating equations were used to compare averaged gain values based on unsorted data with averaged gain values based on the sorted data. RESULTS: EyeSeeCam®: Horizontal SCCs: The estimate and the p-value (shown in parenthesis) for the right lateral SCC and the left lateral SCC were 0.00004 (0.95) and 0.00087 (0.70) respectively. Vertical SCCs: The estimate varied from –0.00858 to 0.00634 with p-values ranging from 0.31 to 0.78. ICS Impulse®: Horizontal SCCs: The estimate and the p-value for the right lateral SCC and the left lateral SCC were 0.00159 (0.18) and 0.00071 (0.38) respectively. Vertical SCCs: The estimate varied from 0.00217 to 0.01357 with p-values ranging from 0.00 to 0.17. Based upon the averaged gain value from the individual SCC being tested, 148 tests before and 127 after manual data selection were considered pathological. CONCLUSION: None of the two v-HIT systems revealed any clinically important effects of manual data selection. However, 21 fewer tests were considered pathological after manual data selection.


PLoS ONE ◽  
2021 ◽  
Vol 16 (6) ◽  
pp. e0252991
Author(s):  
Werner A. Stahel

The p-value has been debated exorbitantly in the last decades, experiencing fierce critique, but also finding some advocates. The fundamental issue with its misleading interpretation stems from its common use for testing the unrealistic null hypothesis of an effect that is precisely zero. A meaningful question asks instead whether the effect is relevant. It is then unavoidable that a threshold for relevance is chosen. Considerations that can lead to agreeable conventions for this choice are presented for several commonly used statistical situations. Based on the threshold, a simple quantitative measure of relevance emerges naturally. Statistical inference for the effect should be based on the confidence interval for the relevance measure. A classification of results that goes beyond a simple distinction like “significant / non-significant” is proposed. On the other hand, if desired, a single number called the “secured relevance” may summarize the result, like the p-value does it, but with a scientifically meaningful interpretation.


2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
David C Wheeler ◽  
Matthew Weir ◽  
Jagadish Gogate ◽  
Vlado Perkovic ◽  
Kenneth W Mahaffey

Abstract Background and Aims People with type 2 diabetes mellitus (T2DM) have a greater risk of cardiovascular (CV) disease and major adverse CV events (MACE) that is more common as renal function declines. The sodium glucose co-transporter 2 (SGLT2) inhibitor canagliflozin reduced the risk of MACE (CV death, nonfatal myocardial infarction [MI], and nonfatal stroke) in patients with T2DM and high CV risk or nephropathy in the CANVAS Program and CREDENCE trials, respectively. Method This post hoc analysis included integrated, pooled data from the CANVAS Program and the CREDENCE trial. The effects of canagliflozin compared with placebo on MACE were assessed in subgroups defined by baseline urinary albumin:creatinine ratio (UACR; <30, 30-300, and >300 mg/g). Hazard ratios (HRs) and 95% confidence intervals (CIs) were estimated using stratified (by study) Cox regression models, with subgroup by treatment interaction terms added to test for heterogeneity. Interaction P values were calculated by including the terms of treatment group, baseline UACR, and their interaction in the model. Results A total of 14,543 participants from the CANVAS Program (N = 10,142) and CREDENCE (N = 4,401) were included, with mean estimated glomerular filtration rate of 70.3 mL/min/1.73 m2 and median (interquartile range) UACR of 501.0 (8.4-523.6) mg/g. Among participants with baseline UACR measurements, 7038 (48.8%), 2762 (19.1%), and 4634 (32.1%) participants had baseline UACR <30, 30-300, and >300 mg/g, respectively. Rates of MACE and its components increased as UACR increased (Figure). Canagliflozin reduced the risk of MACE compared with placebo in the overall population (HR, 0.83; 95% CI, 0.75, 0.92), with consistent effects observed across UACR subgroups (interaction P value = 0.42). Canagliflozin also reduced the risk of the individual components of CV death (HR, 0.84; 95% CI, 0.72, 0.97), nonfatal MI (HR, 0.83; 95% CI, 0.70, 0.99), and nonfatal stroke (HR, 0.84; 95% CI, 0.69, 1.03), independent of baseline UACR (interaction P values = 0.40, 0.88, and 0.69, respectively). Canagliflozin was generally well tolerated in the CANVAS Program and the CREDENCE trial, with consistent results on safety outcomes across UACR subgroups. Conclusion Event rates of MACE and its components increased with higher UACR. Canagliflozin reduced the risk of MACE and its components in participants with T2DM and high CV risk or CKD in the CANVAS Program and CREDENCE trial, with consistent benefits observed regardless of baseline UACR.


2017 ◽  
Author(s):  
Jose D. Perezgonzalez

Wagenmakers et al. addressed the illogic use of p-values in 'Psychological Science under Scrutiny'. While historical criticisms mostly deal with the illogical nature of null hypothesis significance testing (NHST), Wagenmakers et al. generalize such argumentation to the p-value itself. Unfortunately, Wagenmakers et al. misinterpret the formal logic basis of tests of significance (and, by extension, of tests of acceptance). This article highlights three instances where such logical interpretation fails and provides plausible corrections and further clarification.


10.2196/21345 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e21345 ◽  
Author(s):  
Marcus Bendtsen

When should a trial stop? Such a seemingly innocent question evokes concerns of type I and II errors among those who believe that certainty can be the product of uncertainty and among researchers who have been told that they need to carefully calculate sample sizes, consider multiplicity, and not spend P values on interim analyses. However, the endeavor to dichotomize evidence into significant and nonsignificant has led to the basic driving force of science, namely uncertainty, to take a back seat. In this viewpoint we discuss that if testing the null hypothesis is the ultimate goal of science, then we need not worry about writing protocols, consider ethics, apply for funding, or run any experiments at all—all null hypotheses will be rejected at some point—everything has an effect. The job of science should be to unearth the uncertainties of the effects of treatments, not to test their difference from zero. We also show the fickleness of P values, how they may one day point to statistically significant results; and after a few more participants have been recruited, the once statistically significant effect suddenly disappears. We show plots which we hope would intuitively highlight that all assessments of evidence will fluctuate over time. Finally, we discuss the remedy in the form of Bayesian methods, where uncertainty leads; and which allows for continuous decision making to stop or continue recruitment, as new data from a trial is accumulated.


2009 ◽  
Vol 33 (2) ◽  
pp. 81-86 ◽  
Author(s):  
Douglas Curran-Everett

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This second installment of Explorations in Statistics delves into test statistics and P values, two concepts fundamental to the test of a scientific null hypothesis. The essence of a test statistic is that it compares what we observe in the experiment to what we expect to see if the null hypothesis is true. The P value associated with the magnitude of that test statistic answers this question: if the null hypothesis is true, what proportion of possible values of the test statistic are at least as extreme as the one I got? Although statisticians continue to stress the limitations of hypothesis tests, there are two realities we must acknowledge: hypothesis tests are ingrained within science, and the simple test of a null hypothesis can be useful. As a result, it behooves us to explore the notions of hypothesis tests, test statistics, and P values.


2016 ◽  
Vol 77 (3) ◽  
pp. 529-539 ◽  
Author(s):  
Maarten Marsman ◽  
Eric-Jan Wagenmakers

P values have been critiqued on several grounds but remain entrenched as the dominant inferential method in the empirical sciences. In this article, we elaborate on the fact that in many statistical models, the one-sided P value has a direct Bayesian interpretation as the approximate posterior mass for values lower than zero. The connection between the one-sided P value and posterior probability mass reveals three insights: (1) P values can be interpreted as Bayesian tests of direction, to be used only when the null hypothesis is known from the outset to be false; (2) as a measure of evidence, P values are biased against a point null hypothesis; and (3) with N fixed and effect size variable, there is an approximately linear relation between P values and Bayesian point null hypothesis tests.


2021 ◽  
Vol 3 ◽  
Author(s):  
A.M. Pyatnitskiy ◽  
◽  
V.M. Gukasov ◽  
A.S. Smirnov

The article continues the series of publications developing new statistically motivated approach to data clustering. Proposed method is applied for searching clusters of increased or decreased frequencies of some events in sets of neighboring cells in two dimensional tessellations of plane. Such cells may correspond to administrative regions, counties etc. The case of simple frequency tables (histograms) with rectangular cells was considered earlier. The observed distribution of event frequencies in cells can be compared either with expected one (for instance uniform) or with distribution corresponding to the previous moment of time. The groups of neighboring cells with the same direction of changes are unified in clusters which are checked to be statistically significant with account on multiple comparisons. Each group of cells is characterized with two parameters – its size (the number of cells) and the intensity of changing. If the size of group or (and) its intensity are too pronounced then such group is considered to be statistically significant cluster. There are no a priori suggestions concerning the number, size or shape of potentially existing clusters. Method can be used for clustering any multidimensional arrays of p-values which are independent and uniformly distributed according null hypothesis, while alternative is that there are sets of neighboring cells where p-values are close to 0 or to 1.


Sign in / Sign up

Export Citation Format

Share Document