Comparison of a Two-Stage and Three-Stage Interim-Analysis Procedure

1992 ◽  
Vol 71 (1) ◽  
pp. 3-14 ◽  
Author(s):  
John E. Overall ◽  
Robert S. Atlas

A statistical model for combining p values from multiple tests of significance is used to define rejection and acceptance regions for two-stage and three-stage sampling plans. Type I error rates, power, frequencies of early termination decisions, and expected sample sizes are compared. Both the two-stage and three-stage procedures provide appropriate protection against Type I errors. The two-stage sampling plan with its single interim analysis entails minimal loss in power and provides substantial reduction in expected sample size as compared with a conventional single end-of-study test of significance for which power is in the adequate range. The three-stage sampling plan with its two interim analyses introduces somewhat greater reduction in power, but it compensates with greater reduction in expected sample size. Either interim-analysis strategy is more efficient than a single end-of-study analysis in terms of power per unit of sample size.

2015 ◽  
Vol 26 (4) ◽  
pp. 1671-1683 ◽  
Author(s):  
Cornelia U Kunz ◽  
James MS Wason ◽  
Meinhard Kieser

Phase II oncology trials are conducted to evaluate whether the tumour activity of a new treatment is promising enough to warrant further investigation. The most commonly used approach in this context is a two-stage single-arm design with binary endpoint. As for all designs with interim analysis, its efficiency strongly depends on the relation between recruitment rate and follow-up time required to measure the patients’ outcomes. Usually, recruitment is postponed after the sample size of the first stage is achieved up until the outcomes of all patients are available. This may lead to a considerable increase of the trial length and with it to a delay in the drug development process. We propose a design where an intermediate endpoint is used in the interim analysis to decide whether or not the study is continued with a second stage. Optimal and minimax versions of this design are derived. The characteristics of the proposed design in terms of type I error rate, power, maximum and expected sample size as well as trial duration are investigated. Guidance is given on how to select the most appropriate design. Application is illustrated by a phase II oncology trial in patients with advanced angiosarcoma, which motivated this research.


Biostatistics ◽  
2019 ◽  
Author(s):  
Jon Arni Steingrimsson ◽  
Joshua Betz ◽  
Tianchen Qian ◽  
Michael Rosenblum

Summary We consider the problem of designing a confirmatory randomized trial for comparing two treatments versus a common control in two disjoint subpopulations. The subpopulations could be defined in terms of a biomarker or disease severity measured at baseline. The goal is to determine which treatments benefit which subpopulations. We develop a new class of adaptive enrichment designs tailored to solving this problem. Adaptive enrichment designs involve a preplanned rule for modifying enrollment based on accruing data in an ongoing trial. At the interim analysis after each stage, for each subpopulation, the preplanned rule may decide to stop enrollment or to stop randomizing participants to one or more study arms. The motivation for this adaptive feature is that interim data may indicate that a subpopulation, such as those with lower disease severity at baseline, is unlikely to benefit from a particular treatment while uncertainty remains for the other treatment and/or subpopulation. We optimize these adaptive designs to have the minimum expected sample size under power and Type I error constraints. We compare the performance of the optimized adaptive design versus an optimized nonadaptive (single stage) design. Our approach is demonstrated in simulation studies that mimic features of a completed trial of a medical device for treating heart failure. The optimized adaptive design has $25\%$ smaller expected sample size compared to the optimized nonadaptive design; however, the cost is that the optimized adaptive design has $8\%$ greater maximum sample size. Open-source software that implements the trial design optimization is provided, allowing users to investigate the tradeoffs in using the proposed adaptive versus standard designs.


2018 ◽  
Vol 28 (7) ◽  
pp. 2179-2195 ◽  
Author(s):  
Chieh Chiang ◽  
Chin-Fu Hsiao

Multiregional clinical trials have been accepted in recent years as a useful means of accelerating the development of new drugs and abridging their approval time. The statistical properties of multiregional clinical trials are being widely discussed. In practice, variance of a continuous response may be different from region to region, but it leads to the assessment of the efficacy response falling into a Behrens–Fisher problem—there is no exact testing or interval estimator for mean difference with unequal variances. As a solution, this study applies interval estimations of the efficacy response based on Howe’s, Cochran–Cox’s, and Satterthwaite’s approximations, which have been shown to have well-controlled type I error rates. However, the traditional sample size determination cannot be applied to the interval estimators. The sample size determination to achieve a desired power based on these interval estimators is then presented. Moreover, the consistency criteria suggested by the Japanese Ministry of Health, Labour and Welfare guidance to decide whether the overall results from the multiregional clinical trial obtained via the proposed interval estimation were also applied. A real example is used to illustrate the proposed method. The results of simulation studies indicate that the proposed method can correctly determine the required sample size and evaluate the assurance probability of the consistency criteria.


2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Keisuke Ejima ◽  
Andrew Brown ◽  
Daniel Smith ◽  
Ufuk Beyaztas ◽  
David Allison

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.


2015 ◽  
Vol 9 (13) ◽  
pp. 1
Author(s):  
Tobi Kingsley Ochuko ◽  
Suhaida Abdullah ◽  
Zakiyah Binti Zain ◽  
Sharipah Soaad Syed Yahaya

<p class="zhengwen"><span lang="EN-GB">This study centres on the comparison of independent group tests in terms of power, by using parametric method, such</span><span lang="EN-GB"> as the Alexander-Govern test. The Alexander-Govern (<em>AG</em>) test uses mean as its central tendency measure. It is a better alternative compared to the Welch test, the James test and the <em>ANOVA</em>, because it produces high power and gives good control of Type I error rates for a normal data under variance heterogeneity. But this test is not robust for a non-normal data. When trimmed mean was applied on the test as its central tendency measure under non-normality, the test was only robust for two group condition, but as the number of groups increased more than two groups, the test was no more robust. As a result, a highly robust estimator known as the <em>MOM</em> estimator was applied on the test, as its central tendency measure. This test is not affected by the number of groups, but could not control Type I error rates under skewed heavy tailed distribution. In this study, the Winsorized <em>MOM</em> estimator was applied in the <em>AG</em> test, as its central tendency measure. A simulation of 5,000 data sets were generated and analysed on the test, using the <em>SAS</em> package. The result of the analysis, shows that with the pairing of unbalanced sample size of (15:15:20:30) with equal variance of (1:1:1:1) and the pairing of unbalanced sample size of (15:15:20:30) with unequal variance of (1:1:1:36) with effect size index (<em>f</em> = 0.8), the <em>AGWMOM </em>test only produced a high power value of 0.9562 and 0.8336 compared to the <em>AG </em>test, the <em>AGMOM </em>test and the <em>ANOVA </em>respectively and the test is considered to be sufficient.</span></p>


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Vahid Ebrahimi ◽  
Zahra Bagheri ◽  
Zahra Shayan ◽  
Peyman Jafari

Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF ( DIF = 0.4   and   0.8 ), sample size ( N ), sample size ratio ( R ), scale length ( I ), and weighting parameter ( w ). The simulation results revealed that for I = 5 and regardless of R , the elastic net regularized OLR model with w = 0.1 , as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF ( DIF = 0.4 ) approximately 35% and 21% for N = 100   and   150 , respectively. Moreover, for I = 10 and severe uniform DIF ( DIF = 0.8 ), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06 , as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100   and   150 , respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.


1988 ◽  
Vol 13 (3) ◽  
pp. 281-290 ◽  
Author(s):  
James Algina ◽  
Kezhen L. Tang

For Yao’s and James’ tests, Type I error rates were estimated for various combinations of the number of variables (p), samplesize ratio (n1: n2), sample-size-to-variables ratio, and degree of heteroscedasticity. These tests are alternatives to Hotelling’s T2 and are intended for use when the variance-covariance matrices are not equal in a study using two independent samples. The performance of Yao’s test was superior to that of James’. Yao’s test had appropriate Type I error rates when p ≥ 10, (n1 + n2)/p ≥ 10, and 1:2 ≤ n1:n2 ≤ 2:1. When (n1 + n2)/p = 20, Yao’s test was robust when n1: n2 was 5:1, 3:1, and 4:1 and p was 2, 6, and 10, respectively.


1996 ◽  
Vol 21 (2) ◽  
pp. 169-178 ◽  
Author(s):  
William T. Coombs ◽  
James Algina

Type I error rates for the Johansen test were estimated using simulated data for a variety of conditions. The design of the experiment was a 2 × 2× 2× 3× 9× 3 factorial. The factors were (a) type of distribution, (b) number of dependent variables, (c) number of groups, (d) ratio of the smallest sample size to the number of dependent variables, (e) sample size ratios, and (f) degree of heteroscedasticity. The results indicate that Type I error rates for the Johansen test depend heavily on the number of groups and the ratio of the smallest sample size to the number of dependent variables. Type I error rates depend to a lesser extent on the distribution types used in the study. Based on the results, sample size guidelines are presented.


2015 ◽  
Vol 26 (6) ◽  
pp. 2812-2820 ◽  
Author(s):  
Songshan Yang ◽  
James A Cranford ◽  
Runze Li ◽  
Robert A Zucker ◽  
Anne Buu

This study proposes a time-varying effect model that can be used to characterize gender-specific trajectories of health behaviors and conduct hypothesis testing for gender differences. The motivating examples demonstrate that the proposed model is applicable to not only multi-wave longitudinal studies but also short-term studies that involve intensive data collection. The simulation study shows that the accuracy of estimation of trajectory functions improves as the sample size and the number of time points increase. In terms of the performance of the hypothesis testing, the type I error rates are close to their corresponding significance levels under all combinations of sample size and number of time points. Furthermore, the power increases as the alternative hypothesis deviates more from the null hypothesis, and the rate of this increasing trend is higher when the sample size and the number of time points are larger.


2012 ◽  
Vol 55 (5) ◽  
pp. 506-518
Author(s):  
M. Mendeş

Abstract. This study was conducted to compare Type I error and test power of ANOVA, REML and ML methods by Monte Carlo simulation technique under different experimental conditions. Simulation results indicated that the variance ratios, sample size and number of groups were important factors in determining appropriate methods which were used to estimate variance components. The ML method was found slightly superior when compared to ANOVA and REML methods. On the other hand, ANOVA and REML methods generated similar results in general. As a results, regardless of distribution shapes and number of groups and if n<15; ML, REML methods might be preferred to the ANOVA. However, when either number of groups or sample size was increased (n≥15) ANOVA method may also be used along with ML and REML.


Sign in / Sign up

Export Citation Format

Share Document