Some thoughts on sample size: A Bayesian-frequentist hybrid approach

2012 ◽  
Vol 9 (5) ◽  
pp. 561-569 ◽  
Author(s):  
KK Gordan Lan ◽  
Janet T Wittes

Background Traditional calculations of sample size do not formally incorporate uncertainty about the likely effect size. Use of a normal prior to express that uncertainty, as recently recommended, can lead to power that does not approach 1 as the sample size approaches infinity. Purpose To provide approaches for calculating sample size and power that formally incorporate uncertainty about effect size. The relevant formulas should ensure that power approaches one as sample size increases indefinitely and should be easy to calculate. Methods We examine normal, truncated normal, and gamma priors for effect size computationally and demonstrate analytically an approach to approximating the power for a truncated normal prior. We also propose a simple compromise method that requires a moderately larger sample size than the one derived from the fixed effect method. Results Use of a realistic prior distribution instead of a fixed treatment effect is likely to increase the sample size required for a Phase 3 trial. The standard fixed effect method for moving from estimates of effect size obtained in a Phase 2 trial to the sample size of a Phase 3 trial ignores the variability inherent in the estimate from Phase 2. Truncated normal priors appear to require unrealistically large sample sizes while gamma priors appear to place too much probability on large effect sizes and therefore produce unrealistically high power. Limitations The article deals with a few examples and a limited range of parameters. It does not deal explicitly with binary or time-to-failure data. Conclusions Use of the standard fixed approach to sample size calculation often yields a sample size leading to lower power than desired. Other natural parametric priors lead either to unacceptably large sample sizes or to unrealistically high power. We recommend an approach that is a compromise between assuming a fixed effect size and assigning a normal prior to the effect size.

2021 ◽  
Vol 3 (1) ◽  
pp. 61-89
Author(s):  
Stefan Geiß

Abstract This study uses Monte Carlo simulation techniques to estimate the minimum required levels of intercoder reliability in content analysis data for testing correlational hypotheses, depending on sample size, effect size and coder behavior under uncertainty. The ensuing procedure is analogous to power calculations for experimental designs. In most widespread sample size/effect size settings, the rule-of-thumb that chance-adjusted agreement should be ≥.80 or ≥.667 corresponds to the simulation results, resulting in acceptable α and β error rates. However, this simulation allows making precise power calculations that can consider the specifics of each study’s context, moving beyond one-size-fits-all recommendations. Studies with low sample sizes and/or low expected effect sizes may need coder agreement above .800 to test a hypothesis with sufficient statistical power. In studies with high sample sizes and/or high expected effect sizes, coder agreement below .667 may suffice. Such calculations can help in both evaluating and in designing studies. Particularly in pre-registered research, higher sample sizes may be used to compensate for low expected effect sizes and/or borderline coding reliability (e.g. when constructs are hard to measure). I supply equations, easy-to-use tables and R functions to facilitate use of this framework, along with example code as online appendix.


2018 ◽  
Vol 52 (4) ◽  
pp. 341-350 ◽  
Author(s):  
Michael FW Festing

Scientists using laboratory animals are under increasing pressure to justify their sample sizes using a “power analysis”. In this paper I review the three methods currently used to determine sample size: “tradition” or “common sense”, the “resource equation” and the “power analysis”. I explain how, using the “KISS” approach, scientists can make a provisional choice of sample size using any method, and then easily estimate the effect size likely to be detectable according to a power analysis. Should they want to be able to detect a smaller effect they can increase their provisional sample size and recalculate the effect size. This is simple, does not need any software and provides justification for the sample size in the terms used in a power analysis.


2019 ◽  
Vol 9 (4) ◽  
pp. 813-850 ◽  
Author(s):  
Jay Mardia ◽  
Jiantao Jiao ◽  
Ervin Tánczos ◽  
Robert D Nowak ◽  
Tsachy Weissman

Abstract We study concentration inequalities for the Kullback–Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $L_1$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviours between small and large sample sizes compared to the alphabet size.


2014 ◽  
Vol 45 (3) ◽  
pp. 209-215 ◽  
Author(s):  
David J. Johnson ◽  
Felix Cheung ◽  
M. Brent Donnellan

Schnall, Benton, and Harvey (2008) hypothesized that physical cleanliness reduces the severity of moral judgments. In support of this idea, they found that individuals make less severe judgments when they are primed with the concept of cleanliness (Exp. 1) and when they wash their hands after experiencing disgust (Exp. 2). We conducted direct replications of both studies using materials supplied by the original authors. We did not find evidence that physical cleanliness reduced the severity of moral judgments using samples sizes that provided over .99 power to detect the original effect sizes. Our estimates of the overall effect size were much smaller than estimates from Experiment 1 (original d = −0.60, 95% CI [−1.23, 0.04], N = 40; replication d = −0.01, 95% CI [−0.28, 0.26], N = 208) and Experiment 2 (original d = −0.85, 95% CI [−1.47, −0.22], N = 43; replication d = 0.01, 95% CI [−.34, 0.36], N = 126). These findings suggest that the population effect sizes are probably substantially smaller than the original estimates. Researchers investigating the connections between cleanliness and morality should therefore use large sample sizes to have the necessary power to detect subtle effects.


2009 ◽  
Vol 31 (4) ◽  
pp. 500-506 ◽  
Author(s):  
Robert Slavin ◽  
Dewi Smith

Research in fields other than education has found that studies with small sample sizes tend to have larger effect sizes than those with large samples. This article examines the relationship between sample size and effect size in education. It analyzes data from 185 studies of elementary and secondary mathematics programs that met the standards of the Best Evidence Encyclopedia. As predicted, there was a significant negative correlation between sample size and effect size. The differences in effect sizes between small and large experiments were much greater than those between randomized and matched experiments. Explanations for the effects of sample size on effect size are discussed.


2013 ◽  
Vol 112 (3) ◽  
pp. 835-844 ◽  
Author(s):  
M. T. Bradley ◽  
A. Brand

Tables of alpha values as a function of sample size, effect size, and desired power were presented. The tables indicated expected alphas for small, medium, and large effect sizes given a variety of sample sizes. It was evident that sample sizes for most psychological studies are adequate for large effect sizes defined at .8. The typical alpha level of .05 and desired power of 90% can be achieved with 70 participants in two groups. It was perhaps doubtful if these ideal levels of alpha and power have generally been achieved for medium effect sizes in actual research, since 170 participants would be required. Small effect sizes have rarely been tested with an adequate number of participants or power. Implications were discussed.


2021 ◽  
Vol 2 (4) ◽  
Author(s):  
R Mukherjee ◽  
N Muehlemann ◽  
A Bhingare ◽  
G W Stone ◽  
C Mehta

Abstract Background Cardiovascular trials increasingly require large sample sizes and long follow-up periods. Several approaches have been developed to optimize sample size such as adaptive group sequential trials, samples size re-estimation based on the promising zone, and the win ratio. Traditionally, the log-rank or the Cox proportional hazards model is used to test for treatment effects, based on a constant hazard rate and proportional hazards alternatives, which however, may not always hold. Large sample sizes and/or long follow up periods are especially challenging for trials evaluating the efficacy of acute care interventions. Purpose We propose an adaptive design wherein using interim data, Bayesian computation of predictive power guides the increase in sample size and/or the minimum follow-up duration. These computations do not depend on the constant hazard rate and proportional hazards assumptions, thus yielding more robust interim decision making for the future course of the trial. Methods PROTECT IV is designed to evaluate mechanical circulatory support with the Impella CP device vs. standard of care during high-risk PCI. The primary endpoint is a composite of all-cause death, stroke, MI or hospitalization for cardiovascular causes with initial minimum follow-up of 12 months and initial enrolment of 1252 patients with expected recruitment in 24 months. The study will employ an adaptive increase in sample size and/or minimum follow-up at the Interim analysis when ∼80% of patients have been enrolled. The adaptations utilize extensive simulations to choose a new sample size up to 2500 and new minimal follow-up time up to 36 months that provides a Bayesian predictive power of 85%. Bayesian calculations are based on patient-level information rather than summary statistics therefore enabling more reliable interim decisions. Constant or proportional hazard assumptions are not required for this approach because two separate Piece-wise Constant Hazard Models with Gamma-priors are fitted to the interim data. Bayesian predictive power is then calculated using Monte-Carlo methodology. Via extensive simulations, we have examined the utility of the proposed design for situations with time varying hazards and non-proportional hazards ratio such as situations of delayed treatment effect (Figure) and crossing of survival curves. The heat map of Bayesian predictive power obtained when the interim Kaplan-Meier curves reflected delayed response shows that for this scenario an optimal combination of increased sample size and increased follow-up time would be needed to attain 85% predictive power. Conclusion A proposed adaptive design with sample size and minimum follow-up period adaptation based on Bayesian predictive power at interim looks allows for de-risking the trial of uncertainties regarding effect size in terms of control arm outcome rate, hazard ratio, and recruitment rate. Funding Acknowledgement Type of funding sources: Private company. Main funding source(s): Abiomed, Inc Figure 1


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
O Cotter ◽  
B A Davison ◽  
G Koch ◽  
S Senger ◽  
M Metra ◽  
...  

Abstract Aims All phase 3 studies in patients with acute heart failure (AHF) and HF with preserved ejection fraction (HFpEF) have failed in the last decades. We explore the likelihood that the negative results are due to chance and/or to study size and dilution of statistical power. Methods and results First, using simulations, we examined the probability that a positive finding in phase 2 would result in studying truly effective drugs in phase 3. We simulated phase 2 studies under six scenarios where the range of true relative risk (RR) for an outcome of interest varied from 0.5 (major benefit) to 1.15 (some harm). The proportion of simulated studies where the RR <0.8 (we assumed that a 20% or greater risk reduction reflects an effective drug) ranged from 6% to 42% across the six scenarios studied. To further simulate “real life” clinical research, we simulated a continuous surrogate outcome that was linearly related to the true RR in each simulation of each scenario. Regardless of criteria considered for a positive phase 2 trial, results suggest that even in our worst-case scenario, where overall only 6% of drugs taken into phase 2 are effective, roughly 20% of phase 3 studies, if appropriately powered, should have yielded positive results. Given this, we then explored study size in AHF research, as a potential explanation for the high failure rate in these studies. Comparison of published phase 2 and 3 clinical trials with registries in AHF suggest that populations in both large and small trials differ from “real life”. Meta-regression models suggest that both control event rates, and in the serelaxin program as an example, treatment effects, decline with increasing study size greatly reducing power (figure). This effect dilution might be explained by an increasing proportion of patients enrolled in studies who cannot benefit from the study drug. Figure 1. Power at two-sided 0.05 significance level to detect an effect size of hazard ratio of 0.65 (left) or 0.8 (right) with a placebo event rate of 10% (top) and 20% (bottom) at N=100 at various treatment effect dilutions with increasing sample size. Conclusion These data suggest that it is unlikely that the very high rate of negative AHF phase III trials can be explained by chance alone. Potentially, our tendency to increase sample size does not necessarily increase statistical power, due to more heterogenous populations leading to reduced event rates and treatment effects.


2012 ◽  
Vol 22 (1) ◽  
pp. 63-69 ◽  
Author(s):  
Alexia Iasonos ◽  
Paul Sabbatini ◽  
David R. Spriggs ◽  
Carol A. Aghajanian ◽  
Roisin E. O’Cearbhaill ◽  
...  

ObjectiveEstimates of progression-free survival (PFS) from single-arm phase 2 consolidation/maintenance trials for recurrent ovarian cancer are usually interpreted in the context of historical controls. We illustrate how the duration of second-line therapy (SLT), the time on the investigational therapy (IT), and patient enrollment plan can affect efficacy measures from maintenance trials and might result in underpowered studies.MethodsEfficacy data from 3 published single-arm consolidation therapies in second remission in ovarian cancer were used for illustration. The studies were designed to show an increase in estimated median PFS from 9 to 13.5 months. We partitioned PFS as the sum of the duration of SLT, treatment-free interval, and duration of IT. We calculated the statistical power when IT is given concurrently with SLT or after SLT by varying the start of IT. We compared the sample sizes required when PFS includes the time on SLT versus PFS that starts after SLT at initiation of IT.ResultsRequired sample sizes varied with duration of SLT. If IT starts with initiation of SLT, only 34 patients are needed to provide 80% power to detect a 33% hazard reduction. In contrast, 104 patients are required for a single-arm study for 80% power, if IT begins 7.5 months after SLT initiation.ConclusionsDesigns of nonrandomized consolidation trials that aim to prolong PFS must consider the effect of the duration of SLT on the end point definition and on required sample size. If IT is given concurrently with SLT, and after SLT, then SLT duration must be restricted per protocol eligibility, so that a comparison with historical data from other single-arm phase 2 studies is unbiased. If IT is given after SLT, the duration of SLT should be taken into account in the design stage because it will affect statistical power and sample size.


Sign in / Sign up

Export Citation Format

Share Document