scholarly journals Too True to be Bad: When Sets of Studies with Significant and Non-Significant Findings Are Probably True

Author(s):  
Daniel Lakens ◽  
Alexander Etz

Psychology journals rarely publish non-significant results. At the same time, it is often very unlikely (or ‘too good to be true’) that a set of studies yields exclusively significant results. Here, we use likelihood ratios to explain when sets of studies that contain a mix of significant and non-significant results are likely to be true, or ‘too true to be bad’. As we show, mixed results are not only likely to be observed in lines of research, but when observed, mixed results often provide evidence for the alternative hypothesis, given reasonable levels of statistical power and an adequately controlled low Type 1 error rate. Researchers should feel comfortable submitting such lines of research with an internal meta-analysis for publication. A better understanding of probabilities, accompanied by more realistic expectations of what real lines of studies look like, might be an important step in mitigating publication bias in the scientific literature.

2017 ◽  
Vol 8 (8) ◽  
pp. 875-881 ◽  
Author(s):  
Daniël Lakens ◽  
Alexander J. Etz

Psychology journals rarely publish nonsignificant results. At the same time, it is often very unlikely (or “too good to be true”) that a set of studies yields exclusively significant results. Here, we use likelihood ratios to explain when sets of studies that contain a mix of significant and nonsignificant results are likely to be true or “too true to be bad.” As we show, mixed results are not only likely to be observed in lines of research but also, when observed, often provide evidence for the alternative hypothesis, given reasonable levels of statistical power and an adequately controlled low Type 1 error rate. Researchers should feel comfortable submitting such lines of research with an internal meta-analysis for publication. A better understanding of probabilities, accompanied by more realistic expectations of what real sets of studies look like, might be an important step in mitigating publication bias in the scientific literature.


1986 ◽  
Vol 20 (2) ◽  
pp. 189-200 ◽  
Author(s):  
Kevin D. Bird ◽  
Wayne Hall

Statistical power is neglected in much psychiatric research, with the consequence that many studies do not provide a reasonable chance of detecting differences between groups if they exist in the population. This paper attempts to improve current practice by providing an introduction to the essential quantities required for performing a power analysis (sample size, effect size, type 1 and type 2 error rates). We provide simplified tables for estimating the sample size required to detect a specified size of effect with a type 1 error rate of α and a type 2 error rate of β, and for estimating the power provided by a given sample size for detecting a specified size of effect with a type 1 error rate of α. We show how to modify these tables to perform power analyses for multiple comparisons in univariate and some multivariate designs. Power analyses for each of these types of design are illustrated by examples.


2021 ◽  
Author(s):  
Patrick Turley ◽  
Alicia R. Martin ◽  
Grant Goldman ◽  
Hui Li ◽  
Masahiro Kanai ◽  
...  

ABSTRACTWe present a new method, Multi-Ancestry Meta-Analysis (MAMA), which combines genome-wide association study (GWAS) summary statistics from multiple populations to produce new summary statistics for each population, identifying novel loci that would not have been discovered in either set of GWAS summary statistics alone. In simulations, MAMA increases power with less bias and generally lower type-1 error rate than other multi-ancestry meta-analysis approaches. We apply MAMA to 23 phenotypes in East-Asian- and European-ancestry populations and find substantial gains in power. In an independent sample, novel genetic discoveries from MAMA replicate strongly.


2014 ◽  
Vol 56 (4) ◽  
pp. 614-630 ◽  
Author(s):  
Alexandra C. Graf ◽  
Peter Bauer ◽  
Ekkehard Glimm ◽  
Franz Koenig

Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2016 ◽  
Vol 148 (8) ◽  
pp. 24-31
Author(s):  
Kayode Ayinde ◽  
John Olatunde ◽  
Gbenga Sunday

1988 ◽  
Vol 8 (2) ◽  
pp. 125-128 ◽  
Author(s):  
D. N. Churchill ◽  
D. W. Taylor ◽  
S. I. Vas ◽  
J. Singer ◽  
M. L. Beecroft ◽  
...  

A double-blind randomized controlled trial compared the effectiveness of prophylactic oral trimethoprim/sulfamethoxazole (cotrimoxazole) to a placebo in preventing peritonitis in continuous ambulatory peritoneal dialysis (CAPD) patients. A daily trimethoprim/sulfamethoxazole dose of 160/800 mg gives a steady state dialysate concentration of 1.07/4.35 mg/L in the final dwell of each dosing interval. Identification of a 40% reduction in peritonitis probability with 80% statistical power and a type 1 error probability of 0.05 required 52 subjects per group. With stratification by previous peritonitis, 56 were allocated to cotrimoxazole and 49 to placebo. For cotrimoxazole there were five deaths and seven catheter losses. For placebo there were three deaths and nine catheter losses. There were 20 withdrawals from cotrimoxazole and 9 from the placebo group. With respect to time to peritonitis, there was no statistically significant difference between cotrimoxazole and placebo groups (p = 0.19). At 6 months, 64.1% of cotrimoxazole and 62.5% of placebo were peritonitis free; at 12 months 41.9% of cotrimoxazole and 35% of placebo were peritonitis free. There was no effect (p > 0.05) of age, sex, catheter care technique, spike or luer, or dialysate additives. Previous peritonitis increased the risk of peritonitis by 2.06 (95% CI, 3.61–1.18) while frequent (six weekly) extension tubing changes increased the risk of by 1.79, (95% CI, 3.04–1.02) when compared to six monthly changes. Cotrimoxazole appears ineffective in prevention of CAPD peritonitis.


2008 ◽  
Vol 27 (3) ◽  
pp. 371-381 ◽  
Author(s):  
Steven Snapinn ◽  
Qi Jiang
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document