scholarly journals Use of the p-values as a size-dependent function to address practical differences when analyzing large datasets

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Estibaliz Gómez-de-Mariscal ◽  
Vanesa Guerrero ◽  
Alexandra Sneider ◽  
Hasini Jayatilaka ◽  
Jude M. Phillip ◽  
...  

AbstractBiomedical research has come to rely on p-values as a deterministic measure for data-driven decision-making. In the largely extended null hypothesis significance testing for identifying statistically significant differences among groups of observations, a single p-value is computed from sample data. Then, it is routinely compared with a threshold, commonly set to 0.05, to assess the evidence against the hypothesis of having non-significant differences among groups, or the null hypothesis. Because the estimated p-value tends to decrease when the sample size is increased, applying this methodology to datasets with large sample sizes results in the rejection of the null hypothesis, making it not meaningful in this specific situation. We propose a new approach to detect differences based on the dependence of the p-value on the sample size. We introduce new descriptive parameters that overcome the effect of the size in the p-value interpretation in the framework of datasets with large sample sizes, reducing the uncertainty in the decision about the existence of biological differences between the compared experiments. The methodology enables the graphical and quantitative characterization of the differences between the compared experiments guiding the researchers in the decision process. An in-depth study of the methodology is carried out on simulated and experimental data. Code availability at https://github.com/BIIG-UC3M/pMoSS.

2009 ◽  
Vol 40 (4) ◽  
pp. 415-427 ◽  
Author(s):  
Lee-Shen Chen ◽  
Ming-Chung Yang

This article considers the problem of testing marginal homogeneity in $2 \times 2$ contingency tables under the multinomial sampling scheme. From the frequentist perspective, McNemar's exact $p$-value ($p_{_{\textsl ME}}$) is the most commonly used $p$-value in practice, but it can be conservative for small to moderate sample sizes. On the other hand, from the Bayesian perspective, one can construct Bayesian $p$-values by using the proper prior and posterior distributions, which are called the prior predictive $p$-value ($p_{prior}$) and the posterior predictive $p$-value ($p_{post}$), respectively. Another Bayesian $p$-value is called the partial posterior predictive $p$-value ($p_{ppost}$), first proposed by [2], which can avoid the double use of the data that occurs in $p_{post}$. For the preceding problem, we derive $p_{prior}$, $p_{post}$, and $p_{ppost}$ based on the noninformative uniform prior. Under the criterion of uniformity in the frequentist sense, comparisons among $p_{prior}$, $p_{_{{\textsl ME}}}$, $p_{post}$ and $p_{ppost}$ are given. Numerical results show that $p_{ppost}$ has the best performance for small to moderately large sample sizes.


2019 ◽  
Author(s):  
Estibaliz Gómez-de-Mariscal ◽  
Alexandra Sneider ◽  
Hasini Jayatilaka ◽  
Jude M. Phillip ◽  
Denis Wirtz ◽  
...  

ABSTRACTBiomedical research has come to rely on p-values to determine potential translational impact. The p-value is routinely compared with a threshold commonly set to 0.05 to assess the significance of the null hypothesis. Whenever a large enough dataset is available, this threshold is easily reachable. This phenomenon is known as p-hacking and it leads to spurious conclusions. Herein, we propose a systematic and easy-to-follow protocol that models the p-value as an exponential function to test the existence of real statistical significance. This new approach provides a robust assessment of the null hypothesis with accurate values for the minimum data-size needed to reject it. An in-depth study of the model is carried out in both simulated and experimentally-obtained data. Simulations show that under controlled data, our assumptions are true. The results of our analysis in the experimental datasets reflect the large scope of this approach in common decision-making processes.


2016 ◽  
Vol 11 (4) ◽  
pp. 551-554 ◽  
Author(s):  
Martin Buchheit

The first sport-science-oriented and comprehensive paper on magnitude-based inferences (MBI) was published 10 y ago in the first issue of this journal. While debate continues, MBI is today well established in sport science and in other fields, particularly clinical medicine, where practical/clinical significance often takes priority over statistical significance. In this commentary, some reasons why both academics and sport scientists should abandon null-hypothesis significance testing and embrace MBI are reviewed. Apparent limitations and future areas of research are also discussed. The following arguments are presented: P values and, in turn, study conclusions are sample-size dependent, irrespective of the size of the effect; significance does not inform on magnitude of effects, yet magnitude is what matters the most; MBI allows authors to be honest with their sample size and better acknowledge trivial effects; the examination of magnitudes per se helps provide better research questions; MBI can be applied to assess changes in individuals; MBI improves data visualization; and MBI is supported by spreadsheets freely available on the Internet. Finally, recommendations to define the smallest important effect and improve the presentation of standardized effects are presented.


PEDIATRICS ◽  
1989 ◽  
Vol 84 (6) ◽  
pp. A30-A30
Author(s):  
Student

Often investigators report many P values in the same study. The expected number of P values smaller than 0.05 is 1 in 20 tests of true null hypotheses; therefore the probability that at least one P value will be smaller than 0.05 increases with the number of tests, even when the null hypothesis is correct for each test. This increase is known as the "multiple-comparisons" problem...One reasonable way to correct for multiplicity is simply to multiply the P value by the number of tests. Thus, with five tests, an orignal 0.05 level for each is increased, perhaps to a value as high as 0.25 for the set. To achieve a level of not more than 0.05 for the set, we need to choose a level of 0.05/5 = 0.01 for the individual tests. This adjustment is conservative. We know only that the probability does not exceed 0.05 for the set.


2019 ◽  
Vol 9 (4) ◽  
pp. 813-850 ◽  
Author(s):  
Jay Mardia ◽  
Jiantao Jiao ◽  
Ervin Tánczos ◽  
Robert D Nowak ◽  
Tsachy Weissman

Abstract We study concentration inequalities for the Kullback–Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $L_1$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviours between small and large sample sizes compared to the alphabet size.


Author(s):  
David McGiffin ◽  
Geoff Cumming ◽  
Paul Myles

Null hypothesis significance testing (NHST) and p-values are widespread in the cardiac surgical literature but are frequently misunderstood and misused. The purpose of the review is to discuss major disadvantages of p-values and suggest alternatives. We describe diagnostic tests, the prosecutor’s fallacy in the courtroom, and NHST, which involve inter-related conditional probabilities, to help clarify the meaning of p-values, and discuss the enormous sampling variability, or unreliability, of p-values. Finally, we use a cardiac surgical database and simulations to explore further issues involving p-values. In clinical studies, p-values provide a poor summary of the observed treatment effect, whereas the three- number summary provided by effect estimates and confidence intervals is more informative and minimises over-interpretation of a “significant” result. P-values are an unreliable measure of strength of evidence; if used at all they give only, at best, a very rough guide to decision making. Researchers should adopt Open Science practices to improve the trustworthiness of research and, where possible, use estimation (three-number summaries) or other better techniques.


2012 ◽  
Vol 9 (5) ◽  
pp. 561-569 ◽  
Author(s):  
KK Gordan Lan ◽  
Janet T Wittes

Background Traditional calculations of sample size do not formally incorporate uncertainty about the likely effect size. Use of a normal prior to express that uncertainty, as recently recommended, can lead to power that does not approach 1 as the sample size approaches infinity. Purpose To provide approaches for calculating sample size and power that formally incorporate uncertainty about effect size. The relevant formulas should ensure that power approaches one as sample size increases indefinitely and should be easy to calculate. Methods We examine normal, truncated normal, and gamma priors for effect size computationally and demonstrate analytically an approach to approximating the power for a truncated normal prior. We also propose a simple compromise method that requires a moderately larger sample size than the one derived from the fixed effect method. Results Use of a realistic prior distribution instead of a fixed treatment effect is likely to increase the sample size required for a Phase 3 trial. The standard fixed effect method for moving from estimates of effect size obtained in a Phase 2 trial to the sample size of a Phase 3 trial ignores the variability inherent in the estimate from Phase 2. Truncated normal priors appear to require unrealistically large sample sizes while gamma priors appear to place too much probability on large effect sizes and therefore produce unrealistically high power. Limitations The article deals with a few examples and a limited range of parameters. It does not deal explicitly with binary or time-to-failure data. Conclusions Use of the standard fixed approach to sample size calculation often yields a sample size leading to lower power than desired. Other natural parametric priors lead either to unacceptably large sample sizes or to unrealistically high power. We recommend an approach that is a compromise between assuming a fixed effect size and assigning a normal prior to the effect size.


2021 ◽  
Vol 2 (4) ◽  
Author(s):  
R Mukherjee ◽  
N Muehlemann ◽  
A Bhingare ◽  
G W Stone ◽  
C Mehta

Abstract Background Cardiovascular trials increasingly require large sample sizes and long follow-up periods. Several approaches have been developed to optimize sample size such as adaptive group sequential trials, samples size re-estimation based on the promising zone, and the win ratio. Traditionally, the log-rank or the Cox proportional hazards model is used to test for treatment effects, based on a constant hazard rate and proportional hazards alternatives, which however, may not always hold. Large sample sizes and/or long follow up periods are especially challenging for trials evaluating the efficacy of acute care interventions. Purpose We propose an adaptive design wherein using interim data, Bayesian computation of predictive power guides the increase in sample size and/or the minimum follow-up duration. These computations do not depend on the constant hazard rate and proportional hazards assumptions, thus yielding more robust interim decision making for the future course of the trial. Methods PROTECT IV is designed to evaluate mechanical circulatory support with the Impella CP device vs. standard of care during high-risk PCI. The primary endpoint is a composite of all-cause death, stroke, MI or hospitalization for cardiovascular causes with initial minimum follow-up of 12 months and initial enrolment of 1252 patients with expected recruitment in 24 months. The study will employ an adaptive increase in sample size and/or minimum follow-up at the Interim analysis when ∼80% of patients have been enrolled. The adaptations utilize extensive simulations to choose a new sample size up to 2500 and new minimal follow-up time up to 36 months that provides a Bayesian predictive power of 85%. Bayesian calculations are based on patient-level information rather than summary statistics therefore enabling more reliable interim decisions. Constant or proportional hazard assumptions are not required for this approach because two separate Piece-wise Constant Hazard Models with Gamma-priors are fitted to the interim data. Bayesian predictive power is then calculated using Monte-Carlo methodology. Via extensive simulations, we have examined the utility of the proposed design for situations with time varying hazards and non-proportional hazards ratio such as situations of delayed treatment effect (Figure) and crossing of survival curves. The heat map of Bayesian predictive power obtained when the interim Kaplan-Meier curves reflected delayed response shows that for this scenario an optimal combination of increased sample size and increased follow-up time would be needed to attain 85% predictive power. Conclusion A proposed adaptive design with sample size and minimum follow-up period adaptation based on Bayesian predictive power at interim looks allows for de-risking the trial of uncertainties regarding effect size in terms of control arm outcome rate, hazard ratio, and recruitment rate. Funding Acknowledgement Type of funding sources: Private company. Main funding source(s): Abiomed, Inc Figure 1


Author(s):  
Adam Chan ◽  
BCIT School of Health Sciences, Environmental Health ◽  
Helen Heacock

  Cantaloupe melon was the source of a lethal outbreak of Listeria in 2011. This research investigated whether washing a contaminated cantaloupe rind was sufficient in preventing the transferring of Escherichia coli. Hence, the null hypothesis for this study was that there is no association between washing a contaminated cantaloupe melon and the presence of the contamination in the flesh. In this study, 10 cantaloupes were used to produce a sample size of 20 per washed and unwashed treatments. Each of the samples was transferred to EC broth to determine the presence and absence of Escherichia coli (E. coli), the indicator organism that acted as the “outbreak contaminant.” The results showed 100% of the unwashed melons and 80% of the washed melons to have E. coli transferred into the flesh. A Chi Square analysis produced a p-value of 0.035. The study determined that there was a statistically significant association between washing a melon and the presence of E. coli in the melon flesh. The author recommends washing melon rind as a means to prevent foodborne illness caused by surface contaminants.  


Sign in / Sign up

Export Citation Format

Share Document