scholarly journals The frequent insignificance of a “significant” P-value

Author(s):  
David McGiffin ◽  
Geoff Cumming ◽  
Paul Myles

Null hypothesis significance testing (NHST) and p-values are widespread in the cardiac surgical literature but are frequently misunderstood and misused. The purpose of the review is to discuss major disadvantages of p-values and suggest alternatives. We describe diagnostic tests, the prosecutor’s fallacy in the courtroom, and NHST, which involve inter-related conditional probabilities, to help clarify the meaning of p-values, and discuss the enormous sampling variability, or unreliability, of p-values. Finally, we use a cardiac surgical database and simulations to explore further issues involving p-values. In clinical studies, p-values provide a poor summary of the observed treatment effect, whereas the three- number summary provided by effect estimates and confidence intervals is more informative and minimises over-interpretation of a “significant” result. P-values are an unreliable measure of strength of evidence; if used at all they give only, at best, a very rough guide to decision making. Researchers should adopt Open Science practices to improve the trustworthiness of research and, where possible, use estimation (three-number summaries) or other better techniques.

2017 ◽  
Author(s):  
Jose D. Perezgonzalez

Wagenmakers et al. addressed the illogic use of p-values in 'Psychological Science under Scrutiny'. While historical criticisms mostly deal with the illogical nature of null hypothesis significance testing (NHST), Wagenmakers et al. generalize such argumentation to the p-value itself. Unfortunately, Wagenmakers et al. misinterpret the formal logic basis of tests of significance (and, by extension, of tests of acceptance). This article highlights three instances where such logical interpretation fails and provides plausible corrections and further clarification.


Econometrics ◽  
2019 ◽  
Vol 7 (2) ◽  
pp. 26 ◽  
Author(s):  
David Trafimow

There has been much debate about null hypothesis significance testing, p-values without null hypothesis significance testing, and confidence intervals. The first major section of the present article addresses some of the main reasons these procedures are problematic. The conclusion is that none of them are satisfactory. However, there is a new procedure, termed the a priori procedure (APP), that validly aids researchers in obtaining sample statistics that have acceptable probabilities of being close to their corresponding population parameters. The second major section provides a description and review of APP advances. Not only does the APP avoid the problems that plague other inferential statistical procedures, but it is easy to perform too. Although the APP can be performed in conjunction with other procedures, the present recommendation is that it be used alone.


2021 ◽  
Author(s):  
Валерій Боснюк

Для підтвердження результатів дослідження в психологічних наукових роботах протягом багатьох років використовується процедура перевірки значущості нульової гіпотези (загальноприйнята абревіатура NHST – Null Hypothesis Significance Testing) із застосуванням спеціальних статистичних критеріїв. При цьому здебільшого значення статистики «p» (p-value) розглядається як еквівалент важливості отриманих результатів і сили наукових доказів на користь практичного й теоретичного ефекту дослідження. Таке некоректне використання та інтерпретації p-value ставить під сумнів застосування статистики взагалі та загрожує розвитку психології як науки. Ототожнення статистичного висновку з науковим висновком, орієнтація виключно на новизну в наукових дослідженнях, ритуальна прихильність дослідників до рівня значущості 0,05, опора на статистичну категоричність «так/ні» під час прийняття рішення призводить до того, що психологія примножує тільки результати про наявність ефекту без врахування його величини, практичної цінності. Дана робота призначена для аналізу обмеженості p-value при інтерпретації результатів психологічних досліджень та переваг представлення інформації про розмір ефекту. Застосування розмірів ефекту дозволить здійснити перехід від дихотомічного мислення до оціночного, визначати цінність результатів незалежно від рівня статистичної значущості, приймати рішення більш раціонально та обґрунтовано. Обґрунтовується позиція, що автор наукової роботи при формулюванні висновків дослідження не повинен обмежуватися одним єдиним показником рівня статистичної значущості. Осмислені висновки повинні базуватися на розумному балансуванні p-value та інших не менш важливих параметрів, одним з яких виступає розмір ефекту. Ефект (відмінність, зв’язок, асоціація) може бути статистично значущим, а його практична (клінічна) цінність – незначною, тривіальною. «Статистично значущий» не означає «корисний», «важливий», «цінний», «значний». Тому звернення уваги психологів до питання аналізу виявленого розміру ефекту має стати обов’язковим при інтерпретації результатів дослідження.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S773-S773
Author(s):  
Christopher Brydges ◽  
Allison A Bielak

Abstract Objective: Non-significant p values derived from null hypothesis significance testing do not distinguish between true null effects or cases where the data are insensitive in distinguishing the hypotheses. This study aimed to investigate the prevalence of Bayesian analyses in gerontological psychology, a statistical technique that can distinguish between conclusive and inconclusive non-significant results, by using Bayes factors (BFs) to reanalyze non-significant results from published gerontological research. Method: Non-significant results mentioned in abstracts of articles published in 2017 volumes of ten top gerontological psychology journals were extracted (N = 409) and categorized based on whether Bayesian analyses were conducted. BFs were calculated from non-significant t-tests within this sample to determine how frequently the null hypothesis was strongly supported. Results: Non-significant results were directly tested with Bayes factors in 1.22% of studies. Bayesian reanalyses of 195 non-significant t-tests found that only 7.69% of the findings provided strong evidence in support of the null hypothesis. Conclusions: Bayesian analyses are rarely used in gerontological research, and a large proportion of null findings were deemed inconclusive when reanalyzed with BFs. Researchers are encouraged to use BFs to test the validity of non-significant results, and ensure that sufficient sample sizes are used so that the meaningfulness of null findings can be evaluated.


2009 ◽  
Vol 217 (1) ◽  
pp. 15-26 ◽  
Author(s):  
Geoff Cumming ◽  
Fiona Fidler

Most questions across science call for quantitative answers, ideally, a single best estimate plus information about the precision of that estimate. A confidence interval (CI) expresses both efficiently. Early experimental psychologists sought quantitative answers, but for the last half century psychology has been dominated by the nonquantitative, dichotomous thinking of null hypothesis significance testing (NHST). The authors argue that psychology should rejoin mainstream science by asking better questions – those that demand quantitative answers – and using CIs to answer them. They explain CIs and a range of ways to think about them and use them to interpret data, especially by considering CIs as prediction intervals, which provide information about replication. They explain how to calculate CIs on means, proportions, correlations, and standardized effect sizes, and illustrate symmetric and asymmetric CIs. They also argue that information provided by CIs is more useful than that provided by p values, or by values of Killeen’s prep, the probability of replication.


Author(s):  
Tamás Ferenci ◽  
Levente Kovács

Null hypothesis significance testing dominates the current biostatistical practice. However, this routine has many flaws, in particular p-values are very often misused and misinterpreted. Several solutions has been suggested to remedy this situation, the application of Bayes Factors being perhaps the most well-known. Nevertheless, even Bayes Factors are very seldom applied in medical research. This paper investigates the application of Bayes Factors in the analysis of a realistic medical problem using actual data from a representative US survey, and compares the results to those obtained with traditional means. Linear regression is used as an example as it is one of the most basic tools in biostatistics. The effect of sample size and sampling variation is investigated (with resampling) as well as the impact of the choice of prior. Results show that there is a strong relationship between p-values and Bayes Factors, especially for large samples. The application of Bayes Factors should be encouraged evenin spite of this, as the message they convey is much more instructive and scientifically correct than the current typical practice.


F1000Research ◽  
2017 ◽  
Vol 4 ◽  
pp. 621
Author(s):  
Cyril Pernet

Although thoroughly criticized, null hypothesis significance testing (NHST) remains the statistical method of choice used to provide evidence for an effect, in biological, biomedical and social sciences. In this short guide, I first summarize the concepts behind the method, distinguishing test of significance (Fisher) and test of acceptance (Newman-Pearson) and point to common interpretation errors regarding the p-value. I then present the related concepts of confidence intervals and again point to common interpretation errors. Finally, I discuss what should be reported in which context. The goal is to clarify concepts to avoid interpretation errors and propose simple reporting practices.


2019 ◽  
Author(s):  
Christopher Brydges

Objective: Non-significant p values derived from null hypothesis significance testing do not distinguish between true null effects or cases where the data are insensitive in distinguishing the hypotheses. This study aimed to investigate the prevalence of Bayesian analyses in gerontological psychology, a statistical technique that can distinguish between conclusive and inconclusive non-significant results, by using Bayes factors (BFs) to reanalyze non-significant results from published gerontological research.Method: Non-significant results mentioned in abstracts of articles published in 2017 volumes of ten top gerontological psychology journals were extracted (N = 409) and categorized based on whether Bayesian analyses were conducted. BFs were calculated from non-significant t-tests within this sample to determine how frequently the null hypothesis was strongly supported.Results: Non-significant results were directly tested with Bayes factors in 1.22% of studies. Bayesian reanalyses of 195 non-significant ¬t-tests found that only 7.69% of the findings provided strong evidence in support of the null hypothesis.Conclusions: Bayesian analyses are rarely used in gerontological research, and a large proportion of null findings were deemed inconclusive when reanalyzed with BFs. Researchers are encouraged to use BFs to test the validity of non-significant results, and ensure that sufficient sample sizes are used so that the meaningfulness of null findings can be evaluated.


Sign in / Sign up

Export Citation Format

Share Document