scholarly journals Type M error can explain Weisburd's Paradox

2016 ◽  
Author(s):  
andrew gelman

Simple calculations seem to show that larger studies should have higher statistical power, but empirical meta-analyses of published work in criminology have found zero or weak correlations between sample size and estimated statistical power. This is “Weisburd’s paradox” and has been attributed by Weisburd, Petrosino, and Mason (1993) to a difficulty in maintaining quality control as studies get larger, and attributed by Nelson, Wooditch, and Dario (2014) to a negative correlation between sample sizes and the underlying sizes of the effects being measured. We argue against the necessity of both these explanations, instead suggeting that the apparent Weisburd paradox is an artifact of systematic overestimation inherent in post-hoc power calculations, a bias that is large with small n. Furthermore, we recommend abandoning the use of statistical power as a measure of the strength of a study, because implicit in the definition of power is the bad idea of statistical significance as a research goal.

2007 ◽  
Vol 25 (18_suppl) ◽  
pp. 6516-6516
Author(s):  
P. Bedard ◽  
M. K. Krzyzanowska ◽  
M. Pintilie ◽  
I. F. Tannock

6516 Background: Underpowered randomized clinical trials (RCTs) may expose participants to risks and burdens of research without scientific merit. We investigated the prevalence of underpowered RCTs presented at ASCO annual meetings. Methods: We surveyed all two-arm parallel phase III RCTs presented at the ASCO annual meeting from 1995–2003 where differences for the primary endpoint were non-statistically significant. Post hoc calculations were performed using a power of 80% and a=0.05 (two-sided) to determine the sample size required to detect a small, medium, and large effect size between the two groups. For studies reporting a proportion or time to event as a primary endpoint, effect size was expressed as an odds ratio (OR) or hazard ratio (HR) respectively, with a small effect size defined as OR/HR=1.3, medium effect size OR/HR=1.5, and large effect OR/HR=2.0. Logistic regression was used to identify factors associated with lack of statistical power. Results: Of 423 negative RCTs for which post hoc sample size calculations could be performed, 45 (10.6%), 138 (32.6%), and 333 (78.7%) had adequate sample size to detect small, medium, and large effect sizes respectively. Only 35 negative RCTs (7.1%) reported a reason for inadequate sample size. In a multivariable model, studies presented at plenary or oral sessions (p<0.0001) and multicenter studies supported by a co-operative group were more likely to have adequate sample size (p<0.0001). Conclusion: Two-thirds of negative RCTs presented at the ASCO annual meeting do not have an adequate sample to detect a medium-sized treatment effect. Most underpowered negative RCTs do not report a sample size calculation or reasons for inadequate patient accrual. No significant financial relationships to disclose.


2020 ◽  
Author(s):  
Michael W. Beets ◽  
R. Glenn Weaver ◽  
John P.A. Ioannidis ◽  
Alexis Jones ◽  
Lauren von Klinggraeff ◽  
...  

Abstract Background: Pilot/feasibility or studies with small sample sizes may be associated with inflated effects. This study explores the vibration of effect sizes (VoE) in meta-analyses when considering different inclusion criteria based upon sample size or pilot/feasibility status. Methods: Searches were conducted for meta-analyses of behavioral interventions on topics related to the prevention/treatment of childhood obesity from 01-2016 to 10-2019. The computed summary effect sizes (ES) were extracted from each meta-analysis. Individual studies included in the meta-analyses were classified into one of the following four categories: self-identified pilot/feasibility studies or based upon sample size (N≤100, N>100, and N>370 the upper 75th of sample size). The VoE was defined as the absolute difference (ABS) between the re-estimations of summary ES restricted to study classifications compared to the originally reported summary ES. Concordance (kappa) of statistical significance between summary ES was assessed. Fixed and random effects models and meta-regressions were estimated. Three case studies are presented to illustrate the impact of including pilot/feasibility and N≤100 studies on the estimated summary ES.Results: A total of 1,602 effect sizes, representing 145 reported summary ES, were extracted from 48 meta-analyses containing 603 unique studies (avg. 22 avg. meta-analysis, range 2-108) and included 227,217 participants. Pilot/feasibility and N≤100 studies comprised 22% (0-58%) and 21% (0-83%) of studies. Meta-regression indicated the ABS between the re-estimated and original summary ES where summary ES were comprised of ≥40% of N≤100 studies was 0.29. The ABS ES was 0.46 when summary ES comprised of >80% of both pilot/feasibility and N≤100 studies. Where ≤40% of the studies comprising a summary ES had N>370, the ABS ES ranged from 0.20-0.30. Concordance was low when removing both pilot/feasibility and N≤100 studies (kappa=0.53) and restricting analyses only to the largest studies (N>370, kappa=0.35), with 20% and 26% of the originally reported statistically significant ES rendered non-significant. Reanalysis of the three case study meta-analyses resulted in the re-estimated ES rendered either non-significant or half of the originally reported ES. Conclusions: When meta-analyses of behavioral interventions include a substantial proportion of both pilot/feasibility and N≤100 studies, summary ES can be affected markedly and should be interpreted with caution.


2021 ◽  
Author(s):  
Benjamin J Burgess ◽  
Michelle C Jackson ◽  
David J Murrell

1. Most ecosystems are subject to co-occurring, anthropogenically driven changes and understanding how these multiple stressors interact is a pressing concern. Stressor interactions are typically studied using null models, with the additive and multiplicative null expectation being those most widely applied. Such approaches classify interactions as being synergistic, antagonistic, reversal, or indistinguishable from the null expectation. Despite their wide-spread use, there has been no thorough analysis of these null models, nor a systematic test of the robustness of their results to sample size or sampling error in the estimates of the responses to stressors. 2. We use data simulated from food web models where the true stressor interactions are known, and analytical results based on the null model equations to uncover how (i) sample size, (ii) variation in biological responses to the stressors and (iii) statistical significance, affect the ability to detect non-null interactions. 3. Our analyses lead to three main results. Firstly, it is clear the additive and multiplicative null models are not directly comparable, and over one third of all simulated interactions had classifications that were model dependent. Secondly, both null models have weak power to correctly classify interactions at commonly implemented sample sizes (i.e., ≤6 replicates), unless data uncertainty is unrealistically low. This means all but the most extreme interactions are indistinguishable from the null model expectation. Thirdly, we show that increasing sample size increases the power to detect the true interactions but only very slowly. However, the biggest gains come from increasing replicates from 3 up to 25 and we provide an R function for users to determine sample sizes required to detect a critical effect size of biological interest for the additive model. 4. Our results will aid researchers in the design of their experiments and the subsequent interpretation of results. We find no clear statistical advantage of using one null model over the other and argue null model choice should be based on biological relevance rather than statistical properties. However, there is a pressing need to increase experiment sample sizes otherwise many biologically important synergistic and antagonistic stressor interactions will continue to be missed.


2020 ◽  
Vol 42 (4) ◽  
pp. 849-870
Author(s):  
Reza Norouzian

AbstractResearchers are traditionally advised to plan for their required sample size such that achieving a sufficient level of statistical power is ensured (Cohen, 1988). While this method helps distinguishing statistically significant effects from the nonsignificant ones, it does not help achieving the higher goal of accurately estimating the actual size of those effects in an intended study. Adopting an open-science approach, this article presents an alternative approach, accuracy in effect size estimation (AESE), to sample size planning that ensures that researchers obtain adequately narrow confidence intervals (CI) for their effect sizes of interest thereby ensuring accuracy in estimating the actual size of those effects. Specifically, I (a) compare the underpinnings of power-analytic and AESE methods, (b) provide a practical definition of narrow CIs, (c) apply the AESE method to various research studies from L2 literature, and (d) offer several flexible R programs to implement the methods discussed in this article.


2017 ◽  
Vol 4 (2) ◽  
pp. 160254 ◽  
Author(s):  
Estelle Dumas-Mallet ◽  
Katherine S. Button ◽  
Thomas Boraud ◽  
Francois Gonon ◽  
Marcus R. Munafò

Studies with low statistical power increase the likelihood that a statistically significant finding represents a false positive result. We conducted a review of meta-analyses of studies investigating the association of biological, environmental or cognitive parameters with neurological, psychiatric and somatic diseases, excluding treatment studies, in order to estimate the average statistical power across these domains. Taking the effect size indicated by a meta-analysis as the best estimate of the likely true effect size, and assuming a threshold for declaring statistical significance of 5%, we found that approximately 50% of studies have statistical power in the 0–10% or 11–20% range, well below the minimum of 80% that is often considered conventional. Studies with low statistical power appear to be common in the biomedical sciences, at least in the specific subject areas captured by our search strategy. However, we also observe evidence that this depends in part on research methodology, with candidate gene studies showing very low average power and studies using cognitive/behavioural measures showing high average power. This warrants further investigation.


2021 ◽  
Vol 108 (Supplement_9) ◽  
Author(s):  
James Halle-Smith ◽  
Rupaly Pande ◽  
Lewis Hall ◽  
James Hodson ◽  
Keith J Roberts ◽  
...  

Abstract Background Many studies evaluate interventions to reduce POPF following PD, but often report conflicting results. Previous meta-analyses have generally included non-randomised trials and not considered novel interventions.  Aim To evaluate interventions to reduce postoperative pancreatic fistula (POPF) following pancreatoduodenectomy (PD) with level 1 data. Methods A systematic review and meta-analysis assessed randomised controlled trials (RCTs) evaluating interventions to reduce All-POPF or clinically relevant (CR)-POPF after PD. A post-hoc analysis of negative RCTs assessed whether these had appropriate levels of statistical power. Results Among 22 interventions (n = 7,512 patients, 55 studies), 12 were assessed by multiple studies, and subject to meta-analysis. Of these, external pancreatic duct drainage was the only intervention found to be associated with significantly reduced rates of CR- and all-POPF. In addition, Ulinastatin was associated with significantly reduced rates of CR-POPF, whilst invagination (vs duct to mucosa) pancreatojejunostomy was associated with significantly reduced rates of all-POPF. Review of negative RCTs found the majority to be underpowered, with post-hoc power calculations indicating that interventions would need to reduce the POPF rate to ≤ 1% in order to achieve 80% power in 16/34 (All-POPF) and 19/25 (CR-POPF) studies, respectively.   Conclusions Meta-analysis supports a role for several interventions to reduce POPF after PD, although data is often inconsistent and/or based on small trials. Systematic review identifies other interventions which may benefit from further study. However, underpowered trials appear to be a fundamental problem, inherently more so with CR-POPF. Larger trials, or new directions for research are required to further understanding in this field. 


2007 ◽  
Vol 25 (23) ◽  
pp. 3482-3487 ◽  
Author(s):  
Philippe L. Bedard ◽  
Monika K. Krzyzanowska ◽  
Melania Pintilie ◽  
Ian F. Tannock

Purpose To investigate the prevalence of underpowered randomized controlled trials (RCTs) presented at American Society of Clinical Oncology (ASCO) annual meetings. Methods We surveyed all two-arm phase III RCTs presented at ASCO annual meetings from 1995 to 2003 for which negative results were obtained. Post hoc calculations were performed using a power of 80% and an α level of .05 (two sided) to determine sample sizes required to detect small, medium, and large effect sizes. For studies reporting a proportion or time-to-event as primary end point, effect size was expressed as an odds ratio (OR) or hazard ratio (HR), respectively, with a small effect size defined as OR/HR ≥ 1.3, medium effect size defined as OR/HR ≥ 1.5, and large effect size defined as OR/HR ≥ 2.0. Logistic regression was used to identify factors associated with lack of statistical power. Results Of 423 negative RCTs for which post hoc sample size calculations could be performed, 45 (10.6%), 138 (32.6%), and 233 (55.1%) had adequate sample size to detect small, medium, and large effect sizes, respectively. Only 35 negative RCTs (7.1%) reported a reason for inadequate sample size. In a multivariable model, studies that were presented at oral sessions (P = .0038), multicenter studies supported by a cooperative group (P < .0001), and studies with time to event as primary outcome (P < .0001) were more likely to have adequate sample size. Conclusion More than half of negative RCTs presented at ASCO annual meetings do not have an adequate sample to detect a medium-size treatment effect.


2021 ◽  
Author(s):  
Blair Saunders ◽  
Michael Inzlicht

Recent years have witnessed calls for increased rigour and credibility in the cognitive and behavioural sciences, including psychophysiology. Many procedures exist to increase rigour, and among the most important is the need to increase statistical power. Achieving sufficient statistical power, however, is a considerable challenge for resource intensive methodologies, particularly for between-subjects designs. Meta-analysis is one potential solution; yet, the validity of such quantitative review is limited by potential bias in both the primary literature and in meta-analysis itself. Here, we provide a non-technical overview and evaluation of open science methods that could be adopted to increase the transparency of novel meta-analyses. We also contrast post hoc statistical procedures that can be used to correct for publication bias in the primary literature. We suggest that traditional meta-analyses, as applied in ERP research, are exploratory in nature, providing a range of plausible effect sizes without necessarily having the ability to confirm (or disconfirm) existing hypotheses. To complement traditional approaches, we detail how prospective meta-analyses, combined with multisite collaboration, could be used to conduct statistically powerful, confirmatory ERP research.


2019 ◽  
Author(s):  
Francesco Margoni ◽  
Martin Shepperd

Infant research is making considerable progresses. However, among infant researchers there is growing concern regarding the widespread habit of undertaking studies that have small sample sizes and employ tests with low statistical power (to detect a wide range of possible effects). For many researchers, issues of confidence may be partially resolved by relying on replications. Here, we bring further evidence that the classical logic of confirmation, according to which the result of a replication study confirms the original finding when it reaches statistical significance, could be usefully abandoned. With real examples taken from the infant literature and Monte Carlo simulations, we show that a very wide range of possible replication results would in a formal statistical sense constitute confirmation as they can be explained simply due to sampling error. Thus, often no useful conclusion can be derived from a single or small number of replication studies. We suggest that, in order to accumulate and generate new knowledge, the dichotomous view of replication as confirmatory/disconfirmatory can be replaced by an approach that emphasizes the estimation of effect sizes via meta-analysis. Moreover, we discuss possible solutions for reducing problems affecting the validity of conclusions drawn from meta-analyses in infant research.


Sign in / Sign up

Export Citation Format

Share Document