Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty

The sample size necessary to obtain a desired level of statistical power depends in part on the population value of the effect size, which is, by definition, unknown. A common approach to sample-size planning uses the sample effect size from a prior study as an estimate of the population value of the effect to be detected in the future study. Although this strategy is intuitively appealing, effect-size estimates, taken at face value, are typically not accurate estimates of the population effect size because of publication bias and uncertainty. We show that the use of this approach often results in underpowered studies, sometimes to an alarming degree. We present an alternative approach that adjusts sample effect sizes for bias and uncertainty, and we demonstrate its effectiveness for several experimental designs. Furthermore, we discuss an open-source R package, BUCSS, and user-friendly Web applications that we have made available to researchers so that they can easily implement our suggested methods.

Download Full-text

SAMPLE SIZE PLANNING IN QUANTITATIVE L2 RESEARCH

Studies in Second Language Acquisition ◽

10.1017/s0272263120000017 ◽

2020 ◽

Vol 42 (4) ◽

pp. 849-870

Author(s):

Reza Norouzian

Keyword(s):

Sample Size ◽

Statistical Power ◽

Open Science ◽

Effect Sizes ◽

Size Estimation ◽

Alternative Approach ◽

Sample Size Planning ◽

Effect Size Estimation ◽

Definition Of ◽

Actual Size

AbstractResearchers are traditionally advised to plan for their required sample size such that achieving a sufficient level of statistical power is ensured (Cohen, 1988). While this method helps distinguishing statistically significant effects from the nonsignificant ones, it does not help achieving the higher goal of accurately estimating the actual size of those effects in an intended study. Adopting an open-science approach, this article presents an alternative approach, accuracy in effect size estimation (AESE), to sample size planning that ensures that researchers obtain adequately narrow confidence intervals (CI) for their effect sizes of interest thereby ensuring accuracy in estimating the actual size of those effects. Specifically, I (a) compare the underpinnings of power-analytic and AESE methods, (b) provide a practical definition of narrow CIs, (c) apply the AESE method to various research studies from L2 literature, and (d) offer several flexible R programs to implement the methods discussed in this article.

Download Full-text

Statistical Power in Content Analysis Designs: How Effect Size, Sample Size and Coding Accuracy Jointly Affect Hypothesis Testing ‐ A Monte Carlo Simulation Approach.

Computational Communication Research ◽

10.5117/ccr2021.1.003.geis ◽

2021 ◽

Vol 3 (1) ◽

pp. 61-89

Author(s):

Stefan Geiß

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Content Analysis ◽

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Effect Sizes ◽

Sample Sizes ◽

Expected Effect ◽

Sample Size Effect

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.

Download Full-text

Effect size and statistical power in the rodent fear conditioning literature – a systematic review

10.1101/116202 ◽

2017 ◽

Author(s):

Clarissa F. D. Carneiro ◽

Thiago C. Moulin ◽

Malcolm R. Macleod ◽

Olavo B. Amaral

Keyword(s):

Systematic Review ◽

Sample Size ◽

Fear Conditioning ◽

Effect Size ◽

Statistical Power ◽

Sample Size Calculation ◽

Wide Distribution ◽

Effect Sizes ◽

Biomedical Science ◽

Case Scenario

AbstractProposals to increase research reproducibility frequently call for focusing on effect sizes instead of p values, as well as for increasing the statistical power of experiments. However, it is unclear to what extent these two concepts are indeed taken into account in basic biomedical science. To study this in a real-case scenario, we performed a systematic review of effect sizes and statistical power in studies on learning of rodent fear conditioning, a widely used behavioral task to evaluate memory. Our search criteria yielded 410 experiments comparing control and treated groups in 122 articles. Interventions had a mean effect size of 29.5%, and amnesia caused by memory-impairing interventions was nearly always partial. Mean statistical power to detect the average effect size observed in well-powered experiments with significant differences (37.2%) was 65%, and was lower among studies with non-significant results. Only one article reported a sample size calculation, and our estimated sample size to achieve 80% power considering typical effect sizes and variances (15 animals per group) was reached in only 12.2% of experiments. Actual effect sizes correlated with effect size inferences made by readers on the basis of textual descriptions of results only when findings were non-significant, and neither effect size nor power correlated with study quality indicators, number of citations or impact factor of the publishing journal. In summary, effect sizes and statistical power have a wide distribution in the rodent fear conditioning literature, but do not seem to have a large influence on how results are described or cited. Failure to take these concepts into consideration might limit attempts to improve reproducibility in this field of science.

Download Full-text

Are Most Published Research Findings False In A Continuous Universe?

10.31222/osf.io/jk7sa ◽

2021 ◽

Author(s):

Kleber Neves ◽

Pedro Batista Tan ◽

Olavo Bohrer Amaral

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Statistical Significance ◽

Significance Test ◽

Effect Sizes ◽

Diagnostic Screening ◽

Research Findings ◽

Published Research ◽

Effect Size Distribution

Diagnostic screening models for the interpretation of null hypothesis significance test (NHST) results have been influential in highlighting the effect of selective publication on the reproducibility of the published literature, leading to John Ioannidis’ much-cited claim that most published research findings are false. These models, however, are typically based on the assumption that hypotheses are dichotomously true or false, without considering that effect sizes for different hypotheses are not the same. To address this limitation, we develop a simulation model that overcomes this by modeling effect sizes explicitly using different continuous distributions, while retaining other aspects of previous models such as publication bias and the pursuit of statistical significance. Our results show that the combination of selective publication, bias, low statistical power and unlikely hypotheses consistently leads to high proportions of false positives, irrespective of the effect size distribution assumed. Using continuous effect sizes also allows us to evaluate the degree of effect size overestimation and prevalence of estimates with the wrong signal in the literature, showing that the same factors that drive false-positive results also lead to errors in estimating effect size direction and magnitude. Nevertheless, the relative influence of these factors on different metrics varies depending on the distribution assumed for effect sizes. The model is made available as an R ShinyApp interface, allowing one to explore features of the literature in various scenarios.

Download Full-text

Statistical Power of Negative Randomized Controlled Trials Presented at American Society for Clinical Oncology Annual Meetings

Journal of Clinical Oncology ◽

10.1200/jco.2007.11.3670 ◽

2007 ◽

Vol 25 (23) ◽

pp. 3482-3487 ◽

Cited By ~ 43

Author(s):

Philippe L. Bedard ◽

Monika K. Krzyzanowska ◽

Melania Pintilie ◽

Ian F. Tannock

Keyword(s):

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Effect Sizes ◽

Controlled Trials ◽

Time To Event ◽

Randomized Controlled ◽

American Society ◽

Post Hoc ◽

Adequate Sample Size

Purpose To investigate the prevalence of underpowered randomized controlled trials (RCTs) presented at American Society of Clinical Oncology (ASCO) annual meetings. Methods We surveyed all two-arm phase III RCTs presented at ASCO annual meetings from 1995 to 2003 for which negative results were obtained. Post hoc calculations were performed using a power of 80% and an α level of .05 (two sided) to determine sample sizes required to detect small, medium, and large effect sizes. For studies reporting a proportion or time-to-event as primary end point, effect size was expressed as an odds ratio (OR) or hazard ratio (HR), respectively, with a small effect size defined as OR/HR ≥ 1.3, medium effect size defined as OR/HR ≥ 1.5, and large effect size defined as OR/HR ≥ 2.0. Logistic regression was used to identify factors associated with lack of statistical power. Results Of 423 negative RCTs for which post hoc sample size calculations could be performed, 45 (10.6%), 138 (32.6%), and 233 (55.1%) had adequate sample size to detect small, medium, and large effect sizes, respectively. Only 35 negative RCTs (7.1%) reported a reason for inadequate sample size. In a multivariable model, studies that were presented at oral sessions (P = .0038), multicenter studies supported by a cooperative group (P < .0001), and studies with time to event as primary outcome (P < .0001) were more likely to have adequate sample size. Conclusion More than half of negative RCTs presented at ASCO annual meetings do not have an adequate sample to detect a medium-size treatment effect.

Download Full-text

EFFECT SIZE–DRIVEN SAMPLE-SIZE PLANNING, RANDOMIZATION, AND MULTISITE USE IN L2

Studies in Second Language Acquisition ◽

10.1017/s0272263121000541 ◽

2021 ◽

pp. 1-25

Author(s):

Joseph P. Vitta ◽

Christopher Nicklin ◽

Stuart McLean

Keyword(s):

Second Language ◽

Sample Size ◽

Effect Size ◽

A Priori ◽

Effect Sizes ◽

Construction Process ◽

Sample Sizes ◽

Randomized Sampling ◽

Sample Size Planning ◽

Planning Processes

Abstract In this focused methodological synthesis, the sample construction procedures of 110 second language (L2) instructed vocabulary interventions were assessed in relation to effect size–driven sample-size planning, randomization, and multisite usage. These three areas were investigated because inferential testing makes better generalizations when researchers consider them during the sample construction process. Only nine reports used effect sizes to plan or justify sample sizes in any fashion, with only one engaging in an a priori power procedure referencing vocabulary-centric effect sizes from previous research. Randomized assignment was observed in 56% of the reports while no report involved randomized sampling. Approximately 15% of the samples observed were constructed from multiple sites and none of these empirically investigated the effect of site clustering. Leveraging the synthesized findings, we conclude by offering suggestions for future L2 instructed vocabulary researchers to consider a priori effect size–driven sample planning processes, randomization, and multisite usage when constructing samples.

Download Full-text

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates

Review of General Psychology ◽

10.1037/gpr0000034 ◽

2015 ◽

Vol 19 (2) ◽

pp. 172-182 ◽

Cited By ~ 26

Author(s):

Michèle B. Nuijten ◽

Marcel A. L. M. van Assen ◽

Coosje L. S. Veldkamp ◽

Jelte M. Wicherts

Keyword(s):

Low Power ◽

Sample Size ◽

Publication Bias ◽

Effect Size ◽

Effect Sizes ◽

Population Effect ◽

Replication Studies ◽

Unpublished Studies ◽

Meta Analyses ◽

Size Estimates

Replication is often viewed as the demarcation between science and nonscience. However, contrary to the commonly held view, we show that in the current (selective) publication system replications may increase bias in effect size estimates. Specifically, we examine the effect of replication on bias in estimated population effect size as a function of publication bias and the studies’ sample size or power. We analytically show that incorporating the results of published replication studies will in general not lead to less bias in the estimated population effect size. We therefore conclude that mere replication will not solve the problem of overestimation of effect sizes. We will discuss the implications of our findings for interpreting results of published and unpublished studies, and for conducting and interpreting results of meta-analyses. We also discuss solutions for the problem of overestimation of effect sizes, such as discarding and not publishing small studies with low power, and implementing practices that completely eliminate publication bias (e.g., study registration).

Download Full-text

How to Detect Publication Bias in Psychological Research

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000386 ◽

2019 ◽

Vol 227 (4) ◽

pp. 261-279 ◽

Cited By ~ 2

Author(s):

Frank Renkewitz ◽

Melanie Keiner

Keyword(s):

Publication Bias ◽

Effect Size ◽

Statistical Power ◽

Type I Error ◽

Psychological Research ◽

Type I ◽

True Effect Size ◽

Questionable Research Practices ◽

True Effect ◽

Meta Analyses

Abstract. Publication biases and questionable research practices are assumed to be two of the main causes of low replication rates. Both of these problems lead to severely inflated effect size estimates in meta-analyses. Methodologists have proposed a number of statistical tools to detect such bias in meta-analytic results. We present an evaluation of the performance of six of these tools. To assess the Type I error rate and the statistical power of these methods, we simulated a large variety of literatures that differed with regard to true effect size, heterogeneity, number of available primary studies, and sample sizes of these primary studies; furthermore, simulated studies were subjected to different degrees of publication bias. Our results show that across all simulated conditions, no method consistently outperformed the others. Additionally, all methods performed poorly when true effect sizes were heterogeneous or primary studies had a small chance of being published, irrespective of their results. This suggests that in many actual meta-analyses in psychology, bias will remain undiscovered no matter which detection method is used.

Download Full-text

Effect size, statistical power, and sample size for assessing interactions between categorical and continuous variables

British Journal of Mathematical and Statistical Psychology ◽

10.1111/bmsp.12147 ◽

2018 ◽

Vol 72 (1) ◽

pp. 136-154 ◽

Cited By ~ 5

Author(s):

Gwowen Shieh

Keyword(s):

Sample Size ◽

Effect Size ◽

Statistical Power ◽

Continuous Variables

Download Full-text

Statistical Reliability of 10 Years of Cyber Security User Studies

Lecture Notes in Computer Science - Socio-Technical Aspects in Security and Trust ◽

10.1007/978-3-030-79318-0_10 ◽

2021 ◽

pp. 171-190

Author(s):

Thomas Groß

Keyword(s):

Publication Bias ◽

Cyber Security ◽

Power Distribution ◽

Statistical Power ◽

Total Sample ◽

User Studies ◽

Effect Sizes ◽

Standard Errors ◽

Statistical Reliability ◽

Statistical Inferences

AbstractBackground. In recent years, cyber security user studies have been appraised in meta-research, mostly focusing on the completeness of their statistical inferences and the fidelity of their statistical reporting. However, estimates of the field’s distribution of statistical power and its publication bias have not received much attention.Aim. In this study, we aim to estimate the effect sizes and their standard errors present as well as the implications on statistical power and publication bias.Method. We built upon a published systematic literature review of 146 user studies in cyber security (2006–2016). We took into account 431 statistical inferences including t-, $$\chi ^2$$ χ 2 -, r-, one-way F-tests, and Z-tests. In addition, we coded the corresponding total sample sizes, group sizes and test families. Given these data, we established the observed effect sizes and evaluated the overall publication bias. We further computed the statistical power vis-à-vis of parametrized population thresholds to gain unbiased estimates of the power distribution.Results. We obtained a distribution of effect sizes and their conversion into comparable log odds ratios together with their standard errors. We, further, gained funnel-plot estimates of the publication bias present in the sample as well as insights into the power distribution and its consequences.Conclusions. Through the lenses of power and publication bias, we shed light on the statistical reliability of the studies in the field. The upshot of this introspection is practical recommendations on conducting and evaluating studies to advance the field.

Download Full-text