New relevance and significance measures to replace p-values

The p-value has been debated exorbitantly in the last decades, experiencing fierce critique, but also finding some advocates. The fundamental issue with its misleading interpretation stems from its common use for testing the unrealistic null hypothesis of an effect that is precisely zero. A meaningful question asks instead whether the effect is relevant. It is then unavoidable that a threshold for relevance is chosen. Considerations that can lead to agreeable conventions for this choice are presented for several commonly used statistical situations. Based on the threshold, a simple quantitative measure of relevance emerges naturally. Statistical inference for the effect should be based on the confidence interval for the relevance measure. A classification of results that goes beyond a simple distinction like “significant / non-significant” is proposed. On the other hand, if desired, a single number called the “secured relevance” may summarize the result, like the p-value does it, but with a scientifically meaningful interpretation.

Download Full-text

"MULTIPLE-COMPARISONS" PROBLEM

PEDIATRICS ◽

10.1542/peds.84.6.a30a ◽

1989 ◽

Vol 84 (6) ◽

pp. A30-A30

Author(s):

Student

Keyword(s):

Null Hypothesis ◽

Multiple Comparisons ◽

P Value ◽

Expected Number ◽

P Values ◽

A Value ◽

True Null Hypotheses ◽

The Individual

Often investigators report many P values in the same study. The expected number of P values smaller than 0.05 is 1 in 20 tests of true null hypotheses; therefore the probability that at least one P value will be smaller than 0.05 increases with the number of tests, even when the null hypothesis is correct for each test. This increase is known as the "multiple-comparisons" problem...One reasonable way to correct for multiplicity is simply to multiply the P value by the number of tests. Thus, with five tests, an orignal 0.05 level for each is increased, perhaps to a value as high as 0.25 for the set. To achieve a level of not more than 0.05 for the set, we need to choose a level of 0.05/5 = 0.01 for the individual tests. This adjustment is conservative. We know only that the probability does not exceed 0.05 for the set.

Download Full-text

The frequent insignificance of a “significant” P-value

10.22541/au.163250082.20225291/v1 ◽

2021 ◽

Author(s):

David McGiffin ◽

Geoff Cumming ◽

Paul Myles

Keyword(s):

Diagnostic Tests ◽

Null Hypothesis ◽

Open Science ◽

Significance Testing ◽

P Value ◽

Conditional Probabilities ◽

Null Hypothesis Significance Testing ◽

P Values ◽

Science Practices ◽

Strength Of Evidence

Null hypothesis significance testing (NHST) and p-values are widespread in the cardiac surgical literature but are frequently misunderstood and misused. The purpose of the review is to discuss major disadvantages of p-values and suggest alternatives. We describe diagnostic tests, the prosecutor’s fallacy in the courtroom, and NHST, which involve inter-related conditional probabilities, to help clarify the meaning of p-values, and discuss the enormous sampling variability, or unreliability, of p-values. Finally, we use a cardiac surgical database and simulations to explore further issues involving p-values. In clinical studies, p-values provide a poor summary of the observed treatment effect, whereas the three- number summary provided by effect estimates and confidence intervals is more informative and minimises over-interpretation of a “significant” result. P-values are an unreliable measure of strength of evidence; if used at all they give only, at best, a very rough guide to decision making. Researchers should adopt Open Science practices to improve the trustworthiness of research and, where possible, use estimation (three-number summaries) or other better techniques.

Download Full-text

EVALUATION OF THE ARROWS METHOD FOR CLASSIFICATION OF DATA

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595910002600 ◽

2010 ◽

Vol 27 (01) ◽

pp. 121-142 ◽

Cited By ~ 1

Author(s):

LANTING LU ◽

CHRISTINE S. M. CURRIE

Keyword(s):

Goodness Of Fit ◽

Assembly Line ◽

P Value ◽

Practical Application ◽

P Values ◽

Conflicting Objectives ◽

Technical Details ◽

Simulation Input ◽

Von Mises

We evaluate the Arrows Classification Method (ACM) for grouping objects based on the similarity of their data. This is a new method, which aims to achieve a balance between the conflicting objectives of maximizing internal cohesion and external isolation in the output groups. The method is widely applicable, especially in simulation input and output modelling, and has previously been used for grouping machines on an assembly line, based on data on time-to-repair; and hospital procedures, based on length-of-stay data. The similarity of the data from a pair of objects is measured using the two-sample Cramér-von-Mises goodness of fit statistic, with bootstrapping employed to find the significance or p-value of the calculated statistic. The p-values coming from the paired comparisons serve as inputs to the ACM, and allow the objects to be classified such that no pair of objects that are grouped together have significantly different data. In this article, we give the technical details of the method and evaluate its use through testing with specially generated samples. We will also demonstrate its practical application with two real examples.

Download Full-text

The logic of p-values

10.31234/osf.io/z9ua2 ◽

2017 ◽

Author(s):

Jose D. Perezgonzalez

Keyword(s):

Null Hypothesis ◽

Formal Logic ◽

Significance Testing ◽

P Value ◽

Null Hypothesis Significance Testing ◽

P Values ◽

Logical Interpretation ◽

Psychological Science ◽

Tests Of Significance

Wagenmakers et al. addressed the illogic use of p-values in 'Psychological Science under Scrutiny'. While historical criticisms mostly deal with the illogical nature of null hypothesis significance testing (NHST), Wagenmakers et al. generalize such argumentation to the p-value itself. Unfortunately, Wagenmakers et al. misinterpret the formal logic basis of tests of significance (and, by extension, of tests of acceptance). This article highlights three instances where such logical interpretation fails and provides plausible corrections and further clarification.

Download Full-text

The P Value Line Dance: When Does the Music Stop?

Journal of Medical Internet Research ◽

10.2196/21345 ◽

2020 ◽

Vol 22 (8) ◽

pp. e21345 ◽

Cited By ~ 3

Author(s):

Marcus Bendtsen

Keyword(s):

Decision Making ◽

Bayesian Methods ◽

Null Hypothesis ◽

Driving Force ◽

P Value ◽

Type I ◽

P Values ◽

Interim Analyses ◽

Back Seat ◽

Over Time

When should a trial stop? Such a seemingly innocent question evokes concerns of type I and II errors among those who believe that certainty can be the product of uncertainty and among researchers who have been told that they need to carefully calculate sample sizes, consider multiplicity, and not spend P values on interim analyses. However, the endeavor to dichotomize evidence into significant and nonsignificant has led to the basic driving force of science, namely uncertainty, to take a back seat. In this viewpoint we discuss that if testing the null hypothesis is the ultimate goal of science, then we need not worry about writing protocols, consider ethics, apply for funding, or run any experiments at all—all null hypotheses will be rejected at some point—everything has an effect. The job of science should be to unearth the uncertainties of the effects of treatments, not to test their difference from zero. We also show the fickleness of P values, how they may one day point to statistically significant results; and after a few more participants have been recruited, the once statistically significant effect suddenly disappears. We show plots which we hope would intuitively highlight that all assessments of evidence will fluctuate over time. Finally, we discuss the remedy in the form of Bayesian methods, where uncertainty leads; and which allows for continuous decision making to stop or continue recruitment, as new data from a trial is accumulated.

Download Full-text

Explorations in statistics: hypothesis tests and P values

AJP Advances in Physiology Education ◽

10.1152/advan.90218.2008 ◽

2009 ◽

Vol 33 (2) ◽

pp. 81-86 ◽

Cited By ~ 29

Author(s):

Douglas Curran-Everett

Keyword(s):

Null Hypothesis ◽

P Value ◽

Test Statistics ◽

Simple Test ◽

Hypothesis Tests ◽

Test Statistic ◽

P Values ◽

The One

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This second installment of Explorations in Statistics delves into test statistics and P values, two concepts fundamental to the test of a scientific null hypothesis. The essence of a test statistic is that it compares what we observe in the experiment to what we expect to see if the null hypothesis is true. The P value associated with the magnitude of that test statistic answers this question: if the null hypothesis is true, what proportion of possible values of the test statistic are at least as extreme as the one I got? Although statisticians continue to stress the limitations of hypothesis tests, there are two realities we must acknowledge: hypothesis tests are ingrained within science, and the simple test of a null hypothesis can be useful. As a result, it behooves us to explore the notions of hypothesis tests, test statistics, and P values.

Download Full-text

Three Insights from a Bayesian Interpretation of the One-Sided P Value

Educational and Psychological Measurement ◽

10.1177/0013164416669201 ◽

2016 ◽

Vol 77 (3) ◽

pp. 529-539 ◽

Cited By ~ 34

Author(s):

Maarten Marsman ◽

Eric-Jan Wagenmakers

Keyword(s):

Statistical Models ◽

Null Hypothesis ◽

Posterior Probability ◽

P Value ◽

P Values ◽

Inferential Method ◽

Point Null Hypothesis ◽

Probability Mass ◽

The One ◽

Size Variable

P values have been critiqued on several grounds but remain entrenched as the dominant inferential method in the empirical sciences. In this article, we elaborate on the fact that in many statistical models, the one-sided P value has a direct Bayesian interpretation as the approximate posterior mass for values lower than zero. The connection between the one-sided P value and posterior probability mass reveals three insights: (1) P values can be interpreted as Bayesian tests of direction, to be used only when the null hypothesis is known from the outset to be false; (2) as a measure of evidence, P values are biased against a point null hypothesis; and (3) with N fixed and effect size variable, there is an approximately linear relation between P values and Bayesian point null hypothesis tests.

Download Full-text

Confronting p-hacking: addressing p-value dependence on sample size

10.1101/2019.12.17.878405 ◽

2019 ◽

Author(s):

Estibaliz Gómez-de-Mariscal ◽

Alexandra Sneider ◽

Hasini Jayatilaka ◽

Jude M. Phillip ◽

Denis Wirtz ◽

...

Keyword(s):

Decision Making ◽

Null Hypothesis ◽

Statistical Significance ◽

P Value ◽

P Values ◽

New Approach ◽

Minimum Data ◽

Large Scope ◽

Depth Study ◽

Decision Making Processes

ABSTRACTBiomedical research has come to rely on p-values to determine potential translational impact. The p-value is routinely compared with a threshold commonly set to 0.05 to assess the significance of the null hypothesis. Whenever a large enough dataset is available, this threshold is easily reachable. This phenomenon is known as p-hacking and it leads to spurious conclusions. Herein, we propose a systematic and easy-to-follow protocol that models the p-value as an exponential function to test the existence of real statistical significance. This new approach provides a robust assessment of the null hypothesis with accurate values for the minimum data-size needed to reject it. An in-depth study of the model is carried out in both simulated and experimentally-obtained data. Simulations show that under controlled data, our assumptions are true. The results of our analysis in the experimental datasets reflect the large scope of this approach in common decision-making processes.

Download Full-text

Performance of Test Supermartingale Confidence Intervals for the Success Probability of Bernoulli Trials

Journal of Research of the National Institute of Standards and Technology ◽

10.6028/jres.125.003 ◽

2020 ◽

Vol 125 ◽

Author(s):

Peter Wills ◽

Emanuel Knill ◽

Kevin Coakley ◽

Yanbao Zhang

Keyword(s):

Confidence Intervals ◽

Null Hypothesis ◽

Information Science ◽

Success Probability ◽

Stopping Rules ◽

P Value ◽

Bernoulli Trials ◽

P Values ◽

Science Test ◽

Hoeffding Bounds

Given a composite null hypothesis H0, test supermartingales are non-negative supermartingales with respect to H0 with an initial value of 1. Large values of test supermartingales provide evidence against H0. As a result, test supermartingales are an effective tool for rejecting H0, particularly when the p-values obtained are very small and serve as certificates against the null hypothesis. Examples include the rejection of local realism as an explanation of Bell test experiments in the foundations of physics and the certification of entanglement in quantum information science. Test supermartingales have the advantage of being adaptable during an experiment and allowing for arbitrary stopping rules. By inversion of acceptance regions, they can also be used to determine confidence sets. We used an example to compare the performance of test supermartingales for computing p-values and confidence intervals to Chernoff-Hoeffding bounds and the “exact” p-value. The example is the problem of inferring the probability of success in a sequence of Bernoulli trials. There is a cost in using a technique that has no restriction on stopping rules, and, for a particular test supermartingale, our study quantifies this cost.

Download Full-text

Preprint of: "Redefine or Justify? Comments on the alpha debate"

10.31234/osf.io/rbm8y ◽

2018 ◽

Author(s):

Jan Peter De Ruiter

Keyword(s):

Empirical Evidence ◽

Null Hypothesis ◽

Psychological Research ◽

P Value ◽

Significance Tests ◽

P Values ◽

Alpha Level ◽

Replication Crisis ◽

Crisis In Psychology

Benjamin et al. (2017) proposed improving the reproducibility of findings in psychological research by lowering the alpha level of our conventional Null Hypothesis Significance Tests from .05 to .005, because findings with p-values close to .05 represent insufficient empirical evidence. They argued that findings with a p-value between 0.005 and 0.05 should still be published, but not called “significant” anymore.This proposal was criticized and rejected in a response by Lakens et al. (2018), who argued that instead of lowering the traditional alpha threshold to .005, we should stop using the term “statistically significant”, and require researchers to determine and justify their alpha levels before they collect data.In this contribution, I argue that the arguments presented by Lakens et al. against the proposal by Benjamin et al (2017) are not convincing. Thus, given that it is highly unlikely that our field will abandon the NHST paradigm any time soon, lowering our alpha level to .005 is at this moment the best way to combat the replication crisis in psychology.

Download Full-text