Estimating Effect Size Under Publication Bias: Small Sample Properties and Robustness of a Random Effects Selection Model

1996 ◽  
Vol 21 (4) ◽  
pp. 299-332 ◽  
Author(s):  
Larry V. Hedges ◽  
Jack L. Vevea

When there is publication bias, studies yielding large p values, and hence small effect estimates, are less likely to be published, which leads to biased estimates of effects in meta-analysis. We investigate a selection model based on one-tailed p values in the context of a random effects model. The procedure both models the selection process and corrects for the consequences of selection on estimates of the mean and variance of effect parameters. A test of the statistical significance of selection is also provided. The small sample properties of the method are evaluated by means of simulations, and the asymptotic theory is found to be reasonably accurate under correct model specification and plausible conditions. The method substantially reduces bias due to selection when model specification is correct, but the variance of estimates is increased; thus mean squared error is reduced only when selection produces substantial bias. The robustness of the method to violations of assumptions about the form of the distribution of the random effects is also investigated via simulation, and the model-corrected estimates of the mean effect are generally found to be much less biased than the uncorrected estimates. The significance test for selection bias, however, is found to be highly nonrobust, rejecting at up to 10 times the nominal rate when there is no selection but the distribution of the effects is incorrectly specified.

Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2019 ◽  
Vol 32 (Supplement_1) ◽  
Author(s):  
S Birro ◽  
S Kelly ◽  
T Omari ◽  
U Krishnan

Abstract Background To the aim of this study was to determine the effect of domperidone on gastric function in the EA-TEF cohort. Methods Five participants with previously demonstrated abnormal gastric myoelectrical activity and/or delayed gastric emptying on electrogastrography (EGG) and 13C-octanoic acid breath test (OBT), respectively, were recruited. These participants were treated with domperidone (0.2 mg/kg/dose twice a day), for a minimum of 2 weeks, and EGG and OBT investigations were repeated along with a validated PedsQL gastrointestinal symptom questionnaire. A baseline and follow-up ECG was done to check for potential QT interval prolongation. Results Mean gastric emptying half-time was135.4 minutes off therapy and 277.01 minutes on therapy (p = NS) while the mean gastric emptying coefficients were 3.34 and 3.25, respectively (p = NS). All five participants’ gastric myoelectrical activity on EGG remained abnormal. Although the mean percentage of gastric slow waves spent in normal frequency decreased by 1.65%, the post- prandial-to-resting power ratio increased by 8.452 (p = NS). Both parent- and child-reported overall PedsQL scores increased as did the child-reported PedsQL scores based on symptoms related to gastric function (p = NS). Conclusions Domperidone in standard doses did not result in significant change in gastric emptying in EA-TEF children with delayed gastric emptying. This may be due to abnormalities in gastric innervation, in children with EA-TEF. There was however an improvement in the power ratio on EGG. There was also improvement in the PedsQL scores. The lack of statistical significance may be due to our small sample size.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4995 ◽  
Author(s):  
Chase Meyer ◽  
Kaleb Fuller ◽  
Jared Scott ◽  
Matt Vassar

Background Publication bias is the tendency of investigators, reviewers, and editors to submit or accept manuscripts for publication based on their direction or strength of findings. In this study, we investigated if publication bias was present in gastroenterological research by evaluating abstracts at Americas Hepato-Pancreato-Biliary Congresses from 2011 to 2013. Methods We searched Google, Google Scholar, and PubMed to locate the published reports of research described in these abstracts. If a publication was not found, a second investigator searched to verify nonpublication. If abstract publication status remained undetermined, authors were contacted regarding reasons for nonpublication. For articles reaching publication, the P value, study design, time to publication, citation count, and journals in which the published report appeared were recorded. Results Our study found that of 569 abstracts presented, 297 (52.2%) reported a P value. Of these, 254 (85.5%) contained P values supporting statistical significance. The abstracts reporting a statistically significant outcome were twice as likely to reach publication than abstracts with no significant findings (OR 2.10, 95% CI [1.06–4.14]). Overall, 243 (42.7%) abstracts reached publication. The mean time to publication was 14 months and a median time of nine months. Conclusion In conclusion, we found evidence for publication bias in gastroenterological research. Abstracts with significant P values had a higher probability of reaching publication. More than half of abstracts presented from 2011 to 2013 failed to reach publication. Readers should take these findings into consideration when reviewing medical literature.


2021 ◽  
pp. 695-700
Author(s):  
Ruoyong Xu ◽  
Patrick Brown ◽  
Nancy Baxter ◽  
Anna M. Sawka

PURPOSE Health care priorities of individuals may change during a pandemic, which may, in turn, affect health services utilization. We examined Canadians' online relative search interest in five common solid tumors (breast, colon, lung, prostate, and thyroid) during the COVID-19 pandemic to that observed in the same months in the prior 5 years. METHODS We conducted a cross-sectional retrospective study using Google Trends aggregate anonymous online search data from Canada. We compared the respective relative search volumes for breast, colon, lung, prostate, and thyroid cancers for the months March-November 2020 with the mean for the same months in 2015-2019. Welch's two-sample t tests were performed and the raw P values were then adjusted using Benjamini-Hochberg procedure to correct for multiple comparisons. The level of statistical significance was defined by choosing false discovery rate at .05 for the primary analysis. RESULTS We observed temporary statistically significant reductions in Canadians' relative search volumes for various cancers, largely early in the pandemic, in the spring of 2020. Specifically, significant reductions (after adjustment for multiple comparisons) were observed for breast cancer in April, May, and October 2020; colon cancer in March and April of 2020; lung cancer in April and September 2020; and prostate cancer in April and May 2020. Thyroid cancer relative search volumes were not significantly different from those observed prior to the pandemic. CONCLUSION Although Canadians' online interest in various cancers temporarily waned early in the COVID-19 pandemic, recent relative search volumes for various cancers are largely not significantly different from prior to the pandemic.


2017 ◽  
Vol 16 (3) ◽  
pp. 1
Author(s):  
Laura Badenes-Ribera ◽  
Dolores Frias-Navarro

Resumen La “Práctica Basada en la Evidencia” exige que los profesionales valoren de forma crítica los resultados de las investigaciones psicológicas. Sin embargo, las interpretaciones incorrectas de los valores p de probabilidad son abundantes y repetitivas. Estas interpretaciones incorrectas afectan a las decisiones profesionales y ponen en riesgo la calidad de las intervenciones y la acumulación de un conocimiento científico válido. Identificar el tipo de falacia que subyace a las decisiones estadísticas es fundamental para abordar y planificar estrategias de educación estadística dirigidas a intervenir sobre las interpretaciones incorrectas. En consecuencia, el objetivo de este estudio es analizar la interpretación del valor p en estudiantes y profesores universitarios de Psicología. La muestra estuvo formada por 161 participantes (43 profesores y 118 estudiantes). La antigüedad media como profesor fue de 16.7 años (DT = 10.07). La edad media de los estudiantes fue de 21.59 (DT = 1.3). Los hallazgos sugieren que los estudiantes y profesores universitarios no conocen la interpretación correcta del valor p. La falacia de la probabilidad inversa presenta mayores problemas de comprensión. Además, se confunde la significación estadística y la significación práctica o clínica. Estos resultados destacan la necesidad de la educación estadística y re-educación estadística. Abstract The "Evidence Based Practice" requires professionals to critically assess the results of psychological research. However, incorrect interpretations of p values of probability are abundant and repetitive. These misconceptions affect professional decisions and compromise the quality of interventions and the accumulation of a valid scientific knowledge. Identifying the types of fallacies that underlying statistical decisions is fundamental for approaching and planning statistical education strategies designed to intervene in incorrect interpretations. Therefore, the aim of this study is to analyze the interpretation of p value among college students of psychology and academic psychologist. The sample was composed of 161 participants (43 academic and 118 students). The mean number of years as academic was 16.7 (SD = 10.07). The mean age of college students was 21.59 years (SD = 1.3). The findings suggest that college students and academic do not know the correct interpretation of p values. The fallacy of the inverse probability presents major problems of comprehension. In addition, statistical significance and practical significance or clinical are confused. There is a need for statistical education and statistical re-education.


2017 ◽  
Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be 'conflicting', meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.


Sign in / Sign up

Export Citation Format

Share Document