scholarly journals What is the minimum number of effect sizes required in meta-regression? An estimation based on statistical power and estimation precision

2020 ◽  
Vol 28 (4) ◽  
pp. 673
Author(s):  
Junyan FANG ◽  
Minqiang ZHANG
2021 ◽  
Author(s):  
Ymkje Anna de Vries ◽  
Robert A Schoevers ◽  
Julian Higgins ◽  
Marcus Munafo ◽  
Jojanneke Bastiaansen

Background: Previous research has suggested that statistical power is suboptimal in many biomedical disciplines, but it is unclear whether power is better in trials for particular interventions, disorders, or outcome types. We therefore performed a detailed examination of power in trials of psychotherapy, pharmacotherapy, and complementary and alternative medicine (CAM) for mood, anxiety, and psychotic disorders. Methods: We extracted data from the Cochrane Database of Systematic Reviews (Mental Health). We focused on continuous efficacy outcomes and estimated power to detect standardized effect sizes (SMD=0.20-0.80, primary effect size SMD=0.40) and the meta-analytic effect size (ESMA). We performed meta-regression to estimate the influence of including underpowered studies in meta-analyses. Results: We included 216 reviews with 8809 meta-analyses and 36540 studies. Statistical power for continuous efficacy outcomes was very low across intervention and disorder types (overall median [IQR] power for SMD=0.40: 0.33 [0.19-0.54]; for ESMA: 0.15 [0.07-0.44]), only reaching conventionally acceptable levels (80%) for SMD=0.80. Median power to detect the ESMA was higher in TAU/waitlist-controlled (0.54-0.66) or placebo-controlled (0.15-0.40) trials than in trials comparing active treatments (0.07-0.10). Meta-regression indicated that adequately-powered studies produced smaller effect sizes than underpowered studies (B=-0.06, p=0.008). Conclusions: Power to detect both fixed and meta-analytic effect sizes in clinical trials in psychiatry was low across all interventions and disorders examined. As underpowered studies produced larger effect sizes than adequately-powered studies, these results confirm the need to increase sample sizes and to reduce reporting bias against studies reporting null results to improve the reliability of the published literature.


2019 ◽  
Vol 50 (5-6) ◽  
pp. 292-304 ◽  
Author(s):  
Mario Wenzel ◽  
Marina Lind ◽  
Zarah Rowland ◽  
Daniela Zahn ◽  
Thomas Kubiak

Abstract. Evidence on the existence of the ego depletion phenomena as well as the size of the effects and potential moderators and mediators are ambiguous. Building on a crossover design that enables superior statistical power within a single study, we investigated the robustness of the ego depletion effect between and within subjects and moderating and mediating influences of the ego depletion manipulation checks. Our results, based on a sample of 187 participants, demonstrated that (a) the between- and within-subject ego depletion effects only had negligible effect sizes and that there was (b) large interindividual variability that (c) could not be explained by differences in ego depletion manipulation checks. We discuss the implications of these results and outline a future research agenda.


2001 ◽  
Vol 88 (3_suppl) ◽  
pp. 1194-1198 ◽  
Author(s):  
F. Stephen Bridges ◽  
C. Bennett Williamson ◽  
Donna Rae Jarvis

Of 75 letters “lost” in the Florida Panhandle, 33 (44%) were returned in the mail by the finders (the altruistic response). Addressees' affiliations were significantly associated with different rates of return; fewer emotive Intercontinental Gay and Lesbian Outdoors Organization addressees were returned than nonemotive ones. The technique for power analysis by Gillett (1996) was applied to data from an earlier study and indicated our sample of 75 subjects would still yield a desired power level, i.e., 80, for the likely effect sizes. Statistical power was .83, and the effect was medium in size at .34.


Author(s):  
Thomas Groß

AbstractBackground. In recent years, cyber security user studies have been appraised in meta-research, mostly focusing on the completeness of their statistical inferences and the fidelity of their statistical reporting. However, estimates of the field’s distribution of statistical power and its publication bias have not received much attention.Aim. In this study, we aim to estimate the effect sizes and their standard errors present as well as the implications on statistical power and publication bias.Method. We built upon a published systematic literature review of 146 user studies in cyber security (2006–2016). We took into account 431 statistical inferences including t-, $$\chi ^2$$ χ 2 -, r-, one-way F-tests, and Z-tests. In addition, we coded the corresponding total sample sizes, group sizes and test families. Given these data, we established the observed effect sizes and evaluated the overall publication bias. We further computed the statistical power vis-à-vis of parametrized population thresholds to gain unbiased estimates of the power distribution.Results. We obtained a distribution of effect sizes and their conversion into comparable log odds ratios together with their standard errors. We, further, gained funnel-plot estimates of the publication bias present in the sample as well as insights into the power distribution and its consequences.Conclusions. Through the lenses of power and publication bias, we shed light on the statistical reliability of the studies in the field. The upshot of this introspection is practical recommendations on conducting and evaluating studies to advance the field.


2021 ◽  
Vol 3 (1) ◽  
pp. 61-89
Author(s):  
Stefan Geiß

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.


Author(s):  
Valentin Amrhein ◽  
Fränzi Korner-Nievergelt ◽  
Tobias Roth

The widespread use of 'statistical significance' as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (American Statistical Association, Wasserstein & Lazar 2016). We review why degrading p-values into 'significant' and 'nonsignificant' contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that 'there is no effect'. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as 'we need more stringent decision rules', 'sample sizes will decrease' or 'we need to get rid of p-values'.


2018 ◽  
Author(s):  
Qianying Wang ◽  
Jing Liao ◽  
Kaitlyn Hair ◽  
Alexandra Bannach-Brown ◽  
Zsanett Bahor ◽  
...  

AbstractBackgroundMeta-analysis is increasingly used to summarise the findings identified in systematic reviews of animal studies modelling human disease. Such reviews typically identify a large number of individually small studies, testing efficacy under a variety of conditions. This leads to substantial heterogeneity, and identifying potential sources of this heterogeneity is an important function of such analyses. However, the statistical performance of different approaches (normalised compared with standardised mean difference estimates of effect size; stratified meta-analysis compared with meta-regression) is not known.MethodsUsing data from 3116 experiments in focal cerebral ischaemia to construct a linear model predicting observed improvement in outcome contingent on 25 independent variables. We used stochastic simulation to attribute these variables to simulated studies according to their prevalence. To ascertain the ability to detect an effect of a given variable we introduced in addition this “variable of interest” of given prevalence and effect. To establish any impact of a latent variable on the apparent influence of the variable of interest we also introduced a “latent confounding variable” with given prevalence and effect, and allowed the prevalence of the variable of interest to be different in the presence and absence of the latent variable.ResultsGenerally, the normalised mean difference (NMD) approach had higher statistical power than the standardised mean difference (SMD) approach. Even when the effect size and the number of studies contributing to the meta-analysis was small, there was good statistical power to detect the overall effect, with a low false positive rate. For detecting an effect of the variable of interest, stratified meta-analysis was associated with a substantial false positive rate with NMD estimates of effect size, while using an SMD estimate of effect size had very low statistical power. Univariate and multivariable meta-regression performed substantially better, with low false positive rate for both NMD and SMD approaches; power was higher for NMD than for SMD. The presence or absence of a latent confounding variables only introduced an apparent effect of the variable of interest when there was substantial asymmetry in the prevalence of the variable of interest in the presence or absence of the confounding variable.ConclusionsIn meta-analysis of data from animal studies, NMD estimates of effect size should be used in preference to SMD estimates, and meta-regression should, where possible, be chosen over stratified meta-analysis. The power to detect the influence of the variable of interest depends on the effect of the variable of interest and its prevalence, but unless effects are very large adequate power is only achieved once at least 100 experiments are included in the meta-analysis.


2020 ◽  
Vol 14 ◽  
Author(s):  
Aline da Silva Frost ◽  
Alison Ledgerwood

Abstract This article provides an accessible tutorial with concrete guidance for how to start improving research methods and practices in your lab. Following recent calls to improve research methods and practices within and beyond the borders of psychological science, resources have proliferated across book chapters, journal articles, and online media. Many researchers are interested in learning more about cutting-edge methods and practices but are unsure where to begin. In this tutorial, we describe specific tools that help researchers calibrate their confidence in a given set of findings. In Part I, we describe strategies for assessing the likely statistical power of a study, including when and how to conduct different types of power calculations, how to estimate effect sizes, and how to think about power for detecting interactions. In Part II, we provide strategies for assessing the likely type I error rate of a study, including distinguishing clearly between data-independent (“confirmatory”) and data-dependent (“exploratory”) analyses and thinking carefully about different forms and functions of preregistration.


Author(s):  
Liana R. Taylor ◽  
Avinash Bhati ◽  
Faye S. Taxman

The Washington State Institute for Public Policy (WSIPP) uses meta-analyses to help program administrators identify effective programs that reduce recidivism. The results are displayed as summary effect sizes. Yet, many programs are grouped within a category (such as Intensive Supervision or Correctional Education), even though the features of the programs might suggest they may be very different. The following research question was examined: What program features are related to the effect size in the WSIPP program category? Researchers at ACE! at George Mason University reviewed the studies analyzed by WSIPP and their effect sizes. The meta-regression global models showed recidivism decreased with certain program features, while other program features actually increased recidivism. A multivariate meta-regression showed substantial variation across Cognitive-Behavioral Therapy programs. These preliminary findings suggest the need to further research how differing program features contribute to client-level outcomes, and develop a scheme to better classify programs.


1995 ◽  
Vol 55 (5) ◽  
pp. 773-776 ◽  
Author(s):  
Bernard S. Gorman ◽  
Louis H. Primavera ◽  
David B. Allison

Sign in / Sign up

Export Citation Format

Share Document