Why not to (over)emphasize statistical significance

P values should not merely be used to categorize results into significant and non-significant. This practice disregards clinical relevance, confounds non-significance with no effect and underestimates the likelihood of false-positive results. Better than to use the P value as a dichotomizing instrument, the P values and the confidence intervals around effect estimates can be used to put research findings in a context, thereby taking clinical relevance but also uncertainty genuinely into account.

Download Full-text

How Often Should We Believe Positive Results? Assessing the Credibility of Research Findings in Development Economics

10.31222/osf.io/5nsh3 ◽

2017 ◽

Cited By ~ 1

Author(s):

Aidan Coville ◽

Eva Vivalt

Keyword(s):

Positive Result ◽

Development Economics ◽

False Positive ◽

False Negative ◽

Large Majority ◽

Prior Beliefs ◽

Negative Report ◽

Research Findings ◽

Positive Results ◽

Systematic Collection

Under-powered studies combined with low prior beliefs about intervention effects increase the chances that a positive result is overstated. We collect prior beliefs about intervention impacts from 125 experts to estimate the false positive and false negative report probabilities (FPRP and FNRP) as well as Type S (sign) and Type M (magnitude) errors for studies in development economics. We find that the large majority of studies in our sample are generally credible. We discuss how more systematic collection and use of prior expectations could help improve the literature.

Download Full-text

Diagnostic accuracy of seven radiographic views, alone and in combination, for diagnosis of pectoral girdle fractures in wild passerines after window collisions

Journal of the American Veterinary Medical Association ◽

10.2460/javma.20.11.0642 ◽

2022 ◽

pp. 1-6

Author(s):

Crystal L. Matt ◽

Nicola Di Girolamo ◽

Ruth M. Hallman ◽

Keith L. Bailey ◽

Timothy J. O’Connell ◽

...

Keyword(s):

Diagnostic Accuracy ◽

False Positive ◽

Clinical Relevance ◽

False Negative ◽

Pectoral Girdle ◽

Blinded Observer ◽

Positive Results ◽

Limited Accuracy

Abstract OBJECTIVE To determine the prevalence of pectoral girdle fractures in wild passerines found dead following presumed window collision and evaluate the diagnostic accuracy of various radiographic views for diagnosis of pectoral girdle fractures. SAMPLE Cadavers of 103 wild passerines that presumptively died as a result of window collisions. PROCEDURES Seven radiographic projections (ventrodorsal, dorsoventral, lateral, and 4 oblique views) were obtained for each cadaver. A necropsy was then performed, and each bone of the pectoral girdle (coracoid, clavicle, and scapula) was evaluated for fractures. Radiographs were evaluated in a randomized order by a blinded observer, and results were compared with results of necropsy. RESULTS Fifty-six of the 103 (54%) cadavers had ≥ 1 pectoral girdle fracture. Overall accuracy of using individual radiographic projections to diagnose pectoral girdle fractures ranged from 63.1% to 72.8%, sensitivity ranged from 21.3% to 51.1%, and specificity ranged from 85.7% to 100.0%. The sensitivity of using various combinations of radiographic projections to diagnose pectoral girdle fractures ranged from 51.1% to 66.0%; specificity ranged from 76.8% to 96.4%. CLINICAL RELEVANCE Radiography alone appeared to have limited accuracy for diagnosing fractures of the bones of the pectoral girdle in wild passerines after collision with a window. Both individual radiographic projections and combinations of projections resulted in numerous false negative but few false positive results.

Download Full-text

Significance tests in clinical research—Challenges and pitfalls

Scandinavian Journal of Pain ◽

10.1016/j.sjpain.2013.07.023 ◽

2013 ◽

Vol 4 (4) ◽

pp. 220-223 ◽

Cited By ~ 1

Author(s):

Eva Skovlund

Keyword(s):

Statistical Analysis ◽

Confidence Intervals ◽

Clinical Relevance ◽

Statistical Significance ◽

Research Question ◽

Statistical Analyses ◽

Practical Significance ◽

P Value ◽

Significance Tests ◽

Number Of Patients

AbstractBackgroundStatistical analyses are used to help understand the practical significance of the findings in a clinical study. Many clinical researchers appear to have limited knowledge onhowto perform appropriate statistical analysis as well as understanding what the results in fact mean.MethodsThis focal review is based on long experience in supervising clinicians on statistical analysis and advising editors of scientific journals on the quality of statistical analysis applied in scientific reports evaluated for publication.ResultsBasic facts on elementary statistical analyses are presented, and common misunderstandings are elucidated. Efficacy estimates, the effect of sample size, and confidence intervals for effect estimates are reviewed, and the difference between statistical significance and clinical relevance is highlighted. The weaknesses of p-values and misunderstandings in how to interpret them are illustrated with practical examples.Conclusions and recommendationsSome very important questions need to be answered before initiating a clinical trial. What is the research question? To which patients should the result be generalised? Is the number of patients sufficient to draw a valid conclusion? When data are analysed the number of (preplanned) significance tests should be kept small and post hoc analyses should be avoided. It should also be remembered that the clinical relevance of a finding cannot be assessed by the p-value. Thus effect estimates and corresponding 95% confidence intervals should always be reported.

Download Full-text

Drug repurposing for COVID-19: the problem of excessive hypothesis testing

10.22541/au.161004273.37140148/v1 ◽

2021 ◽

Author(s):

Mariusz Maziarz ◽

Adrian Stencel

Keyword(s):

Hypothesis Testing ◽

False Positive ◽

Prior Probability ◽

Statistical Tests ◽

Statistical Significance ◽

Drug Repurposing ◽

Type I ◽

True Positive ◽

Expected Number ◽

Positive Results

Rationale, aims, and objectives The current strategy of searching for an effective drug to treat COVID-19 relies mainly on repurposing existing therapies developed to target other diseases. There are currently more than four thousand active studies assessing the efficacy of existing drugs as therapies for COVID-19. The number of ongoing trials and the urgent need for a treatment poses the risk that false-positive results will be incorrectly interpreted as evidence for treatments’ efficacy and a ground for drug approval. Our purpose is to assess the risk of false-positive outcomes by analyzing the mechanistic evidence for the efficacy of exemplary candidates for repurposing, estimate false discovery rate, and discuss solutions to the problem of excessive hypothesis testing. Methods We estimate the expected number of false-positive results and probability of at least one false-positive result under the assumption that all tested compounds have no effect on the course of the disease. Later, we relax this assumption and analyze the sensitivity of the expected number of true-positive results to changes in the prior probability (π) that tested compounds are effective. Finally, we calculate False Positive Report Probability and expected numbers of false-positive and true-positive results for different thresholds of statistical significance, power of studies, and ratios of effective to non-effective compounds. We also review mechanistic evidence for the efficacy of two exemplary repurposing candidates (hydroxychloroquine and ACE2 inhibitors) and assess its quality to choose the plausible values of the prior probability (π) that tested compounds are effective against COVID-19. Results Our analysis shows that, due to the excessive number of statistical tests in the field of drug repurposing for COVID-19 and low prior probability (π) of the efficacy of tested compounds, positive results are far more likely to result from type-I error than reflect the effects of pharmaceutical interventions.

Download Full-text

Using confidence intervals to assess clinical relevance of research findings

International Journal of Therapy and Rehabilitation ◽

10.12968/ijtr.2008.15.12.31810 ◽

2008 ◽

Vol 15 (12) ◽

pp. 542-550

Author(s):

CC Wright

Keyword(s):

Confidence Intervals ◽

Clinical Relevance ◽

Research Findings

Download Full-text

Abstract P273: Does Coronary Vessel Dominant System Dictate False Positive Results in Nuclear Stress Imaging?

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.4.suppl_1.ap273 ◽

2011 ◽

Vol 4 (suppl_1) ◽

Author(s):

Ayad S Jazrawi ◽

Amruthlal Jain ◽

Sachin Kumar ◽

Almuhannad Idris ◽

Rony Gorges ◽

...

Keyword(s):

Coronary Artery Disease ◽

Coronary Artery ◽

False Positive ◽

Statistical Significance ◽

The United States ◽

Coronary Vessel ◽

Stress Imaging ◽

Prognostic Information ◽

Artery Disease ◽

Positive Results

Objective: We sought to investigate whether the coronary vessel dominance dictates the false positive results in nuclear stress imaging. Background: Atherosclerotic coronary artery disease is the leading cause of morbidity and mortality in men and women in the United States. Myocardial perfusion imaging (MPI) has provided incremental diagnostic and prognostic information in evaluation of patients with suspected or known coronary artery disease. It has a sensitivity of 87% but specificity of 64%. Tissue attenuation secondary to diaphragm or breast have been documented as a cause of false positive tests. It is also believed that patients with a left dominant coronary system are more prone to have a false positive test, however there is no conclusive literature available supporting this. Methods: A retrospective analysis was performed of all patients who underwent coronary angiography from January 2006 to December 2008, who had intermediate to high probability of ischemia as assessed by MPI. The location and the size of ischemia were documented. Patients were classified as either true positive (TP) or false positive (FP) based on presence of significant coronary stenosis (>50%) by angiography. Furthermore, coronary vessel dominance (right, left or mixed) was documented. Results: A total of 991 patients were included in the analysis. 901 patients had a TP test (91%). As expected, females had a significantly higher FP rate compared to males (16.1% vs. 4.6%, p<0.05).The population was divided into three groups, right, left and co-dominant system: 855, 93 and 43 patients respectively. There was no difference in the FP rate between the right and co-dominant systems (8.1% vs. 7.5%, p=0.7). Patients with left dominant systems were noted to have higher rate of false positive (12.9%) however, this did not achieve statistical significance (p=0.1). When location of lesion was taken into consideration, the far majority (92%) of anterior wall FP cases were females, regardless of system dominance. Conclusion: This is preliminary study supported previous study that female have a higher FP test on their MPI. Furthermore, patients with left dominant system are more prone for FP than either co dominant or right dominant system, especially in females.

Download Full-text

p-Hacking in Experimental Audit Research

Behavioral Research in Accounting ◽

10.2308/bria-52183 ◽

2018 ◽

Vol 31 (1) ◽

pp. 119-131

Author(s):

Mohammad Jahanzeb Khan ◽

Per Christen Trønnes

Keyword(s):

False Positive ◽

Statistical Significance ◽

P Values ◽

Substantial Bias ◽

Positive Results ◽

Jel Classifications

ABSTRACT A focus on novel, confirmatory, and statistically significant results by journals that publish experimental audit research may result in substantial bias in the literature. We explore one type of bias known as p-hacking: a practice where researchers, whether knowingly or unknowingly, adjust their collection, analysis, and reporting of data and results, until nonsignificant results become significant. Examining experimental audit literature published in eight accounting and audit journals within the last three decades, we find an overabundance of p-values at or just below the conventional thresholds for statistical significance. The finding of too many “just significant” results is an indication that some of the results published in the experimental audit literature are potentially a consequence of p-hacking. We discuss potential remedies that, if adopted, may to some extent alleviate concerns regarding p-hacking and the publication of false positive results. JEL Classifications: M40.

Download Full-text

Effect Size and Effect Uncertainty in Organizational Research Methods

Oxford Research Encyclopedia of Business and Management ◽

10.1093/acrefore/9780190224851.013.238 ◽

2021 ◽

Author(s):

Scott B. Morris ◽

Arash Shokri

Keyword(s):

Confidence Intervals ◽

Effect Size ◽

Sampling Error ◽

Statistical Significance ◽

Scientific Progress ◽

Effect Sizes ◽

Practical Significance ◽

Significance Tests ◽

Wide Range ◽

Research Findings

To understand and communicate research findings, it is important for researchers to consider two types of information provided by research results: the magnitude of the effect and the degree of uncertainty in the outcome. Statistical significance tests have long served as the mainstream method for statistical inferences. However, the widespread misinterpretation and misuse of significance tests has led critics to question their usefulness in evaluating research findings and to raise concerns about the far-reaching effects of this practice on scientific progress. An alternative approach involves reporting and interpreting measures of effect size along with confidence intervals. An effect size is an indicator of magnitude and direction of a statistical observation. Effect size statistics have been developed to represent a wide range of research questions, including indicators of the mean difference between groups, the relative odds of an event, or the degree of correlation among variables. Effect sizes play a key role in evaluating practical significance, conducting power analysis, and conducting meta-analysis. While effect sizes summarize the magnitude of an effect, the confidence intervals represent the degree of uncertainty in the result. By presenting a range of plausible alternate values that might have occurred due to sampling error, confidence intervals provide an intuitive indicator of how strongly researchers should rely on the results from a single study.

Download Full-text

College football, elections, and false-positive results in observational research

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1502615112 ◽

2015 ◽

Vol 112 (45) ◽

pp. 13800-13804 ◽

Cited By ~ 27

Author(s):

Anthony Fowler ◽

B. Pablo Montagnes

Keyword(s):

Best Practices ◽

False Positive ◽

Voting Behavior ◽

National Football League ◽

College Football ◽

Observational Research ◽

Research Findings ◽

Multiple Elections ◽

Incumbent Party ◽

Positive Results

A recent, widely cited study [Healy AJ, Malhotra N, Mo CH (2010) Proc Natl Acad Sci USA 107(29):12804–12809] finds that college football games influence voting behavior. Victories within 2 weeks of an election reportedly increase the success of the incumbent party in presidential, senatorial, and gubernatorial elections in the home county of the team. We reassess the evidence and conclude that there is likely no such effect, despite the fact that Healy et al. followed the best practices in social science and used a credible research design. Multiple independent sources of evidence suggest that the original finding was spurious—reflecting bad luck for researchers rather than a shortcoming of American voters. We fail to estimate the same effect when we leverage situations where multiple elections with differing incumbent parties occur in the same county and year. We also find that the purported effect of college football games is stronger in counties where people are less interested in college football, just as strong when the incumbent candidate does not run for reelection, and just as strong in other parts of the state outside the home county of the team. Lastly, we detect no effect of National Football League games on elections, despite their greater popularity. We conclude with recommendations for evaluating surprising research findings and avoiding similar false-positive results.

Download Full-text

Chromogenic Tube Test for Presumptive Identification or Confirmation of Isolates as Candida albicans

Journal of Clinical Microbiology ◽

10.1128/jcm.36.4.1157-1159.1998 ◽

1998 ◽

Vol 36 (4) ◽

pp. 1157-1159 ◽

Cited By ~ 7

Author(s):

John Merlino ◽

Evanthia Tambosis ◽

Duncan Veal

Keyword(s):

Candida Albicans ◽

Sensitivity And Specificity ◽

False Positive ◽

Germ Tube ◽

Cost Effective ◽

Tube Test ◽

Cost Effective Method ◽

Presumptive Identification ◽

Positive Results ◽

Better Than

This report describes a new, modified, simple, and cost-effective method for the use of CHROMagar Candida (CHROMagar Company, Paris, France) for the presumptive identification of isolates as Candida albicans after preliminary growth. Sixty randomly selected clinical isolates were evaluated, including 38 of C. albicans. With incubation at 37°C for 24 h, the sensitivity and specificity appeared to be excellent and the test performed better than the traditional germ tube test. However, at earlier times, C. tropicalis isolates gave false-positive results.

Download Full-text