A more efficient three-arm non-inferiority test based on pooled estimators of the homogeneous variance

2016 ◽  
Vol 27 (8) ◽  
pp. 2437-2446 ◽  
Author(s):  
Hezhi Lu ◽  
Hua Jin ◽  
Weixiong Zeng

Hida and Tango established a statistical testing framework for the three-arm non-inferiority trial including a placebo with a pre-specified non-inferiority margin to overcome the shortcomings of traditional two-arm non-inferiority trials (such as having to choose the non-inferiority margin). In this paper, we propose a new method that improves their approach with respect to two aspects. We construct our testing statistics based on the best unbiased pooled estimators of the homogeneous variance; and we use the principle of intersection-union tests to determine the rejection rule. We theoretically prove that our test is better than that of Hida and Tango for large sample sizes. Furthermore, when that sample size was small or moderate, our simulation studies showed that our approach performed better than Hida and Tango’s. Although both controlled the type I error rate, their test was more conservative and the statistical power of our test was higher.

2021 ◽  
pp. 174077452110101
Author(s):  
Jennifer Proper ◽  
John Connett ◽  
Thomas Murray

Background: Bayesian response-adaptive designs, which data adaptively alter the allocation ratio in favor of the better performing treatment, are often criticized for engendering a non-trivial probability of a subject imbalance in favor of the inferior treatment, inflating type I error rate, and increasing sample size requirements. The implementation of these designs using the Thompson sampling methods has generally assumed a simple beta-binomial probability model in the literature; however, the effect of these choices on the resulting design operating characteristics relative to other reasonable alternatives has not been fully examined. Motivated by the Advanced R2 Eperfusion STrategies for Refractory Cardiac Arrest trial, we posit that a logistic probability model coupled with an urn or permuted block randomization method will alleviate some of the practical limitations engendered by the conventional implementation of a two-arm Bayesian response-adaptive design with binary outcomes. In this article, we discuss up to what extent this solution works and when it does not. Methods: A computer simulation study was performed to evaluate the relative merits of a Bayesian response-adaptive design for the Advanced R2 Eperfusion STrategies for Refractory Cardiac Arrest trial using the Thompson sampling methods based on a logistic regression probability model coupled with either an urn or permuted block randomization method that limits deviations from the evolving target allocation ratio. The different implementations of the response-adaptive design were evaluated for type I error rate control across various null response rates and power, among other performance metrics. Results: The logistic regression probability model engenders smaller average sample sizes with similar power, better control over type I error rate, and more favorable treatment arm sample size distributions than the conventional beta-binomial probability model, and designs using the alternative randomization methods have a negligible chance of a sample size imbalance in the wrong direction. Conclusion: Pairing the logistic regression probability model with either of the alternative randomization methods results in a much improved response-adaptive design in regard to important operating characteristics, including type I error rate control and the risk of a sample size imbalance in favor of the inferior treatment.


2019 ◽  
Author(s):  
Rob Cribbie ◽  
Nataly Beribisky ◽  
Udi Alter

Many bodies recommend that a sample planning procedure, such as traditional NHST a priori power analysis, is conducted during the planning stages of a study. Power analysis allows the researcher to estimate how many participants are required in order to detect a minimally meaningful effect size at a specific level of power and Type I error rate. However, there are several drawbacks to the procedure that render it “a mess.” Specifically, the identification of the minimally meaningful effect size is often difficult but unavoidable for conducting the procedure properly, the procedure is not precision oriented, and does not guide the researcher to collect as many participants as feasibly possible. In this study, we explore how these three theoretical issues are reflected in applied psychological research in order to better understand whether these issues are concerns in practice. To investigate how power analysis is currently used, this study reviewed the reporting of 443 power analyses in high impact psychology journals in 2016 and 2017. It was found that researchers rarely use the minimally meaningful effect size as a rationale for the chosen effect in a power analysis. Further, precision-based approaches and collecting the maximum sample size feasible are almost never used in tandem with power analyses. In light of these findings, we offer that researchers should focus on tools beyond traditional power analysis when sample planning, such as collecting the maximum sample size feasible.


2020 ◽  
Vol 6 (2) ◽  
pp. 106-113
Author(s):  
A. M. Grjibovski ◽  
M. A. Gorbatova ◽  
A. N. Narkevich ◽  
K. A. Vinogradov

Sample size calculation in a planning phase is still uncommon in Russian research practice. This situation threatens validity of the conclusions and may introduce Type I error when the false null hypothesis is accepted due to lack of statistical power to detect the existing difference between the means. Comparing two means using unpaired Students’ ttests is the most common statistical procedure in the Russian biomedical literature. However, calculations of the minimal required sample size or retrospective calculation of the statistical power were observed only in very few publications. In this paper we demonstrate how to calculate required sample size for comparing means in unpaired samples using WinPepi and Stata software. In addition, we produced tables for minimal required sample size for studies when two means have to be compared and body mass index and blood pressure are the variables of interest. The tables were constructed for unpaired samples for different levels of statistical power and standard deviations obtained from the literature.


2017 ◽  
Vol 284 (1851) ◽  
pp. 20161850 ◽  
Author(s):  
Nick Colegrave ◽  
Graeme D. Ruxton

A common approach to the analysis of experimental data across much of the biological sciences is test-qualified pooling. Here non-significant terms are dropped from a statistical model, effectively pooling the variation associated with each removed term with the error term used to test hypotheses (or estimate effect sizes). This pooling is only carried out if statistical testing on the basis of applying that data to a previous more complicated model provides motivation for this model simplification; hence the pooling is test-qualified. In pooling, the researcher increases the degrees of freedom of the error term with the aim of increasing statistical power to test their hypotheses of interest. Despite this approach being widely adopted and explicitly recommended by some of the most widely cited statistical textbooks aimed at biologists, here we argue that (except in highly specialized circumstances that we can identify) the hoped-for improvement in statistical power will be small or non-existent, and there is likely to be much reduced reliability of the statistical procedures through deviation of type I error rates from nominal levels. We thus call for greatly reduced use of test-qualified pooling across experimental biology, more careful justification of any use that continues, and a different philosophy for initial selection of statistical models in the light of this change in procedure.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Lukas Landler ◽  
Graeme D. Ruxton ◽  
E. Pascal Malkemper

AbstractMany biological variables are recorded on a circular scale and therefore need different statistical treatment. A common question that is asked of such circular data involves comparison between two groups: Are the populations from which the two samples are drawn differently distributed around the circle? We compared 18 tests for such situations (by simulation) in terms of both abilities to control Type-I error rate near the nominal value, and statistical power. We found that only eight tests offered good control of Type-I error in all our simulated situations. Of these eight, we were able to identify the Watson’s U2 test and a MANOVA approach, based on trigonometric functions of the data, as offering the best power in the overwhelming majority of our test circumstances. There was often little to choose between these tests in terms of power, and no situation where either of the remaining six tests offered substantially better power than either of these. Hence, we recommend the routine use of either Watson’s U2 test or MANOVA approach when comparing two samples of circular data.


2020 ◽  
Author(s):  
Guosheng Yin ◽  
Chenyang Zhang ◽  
Huaqing Jin

BACKGROUND Recently, three randomized clinical trials on coronavirus disease (COVID-19) treatments were completed: one for lopinavir-ritonavir and two for remdesivir. One trial reported that remdesivir was superior to placebo in shortening the time to recovery, while the other two showed no benefit of the treatment under investigation. OBJECTIVE The aim of this paper is to, from a statistical perspective, identify several key issues in the design and analysis of three COVID-19 trials and reanalyze the data from the cumulative incidence curves in the three trials using more appropriate statistical methods. METHODS The lopinavir-ritonavir trial enrolled 39 additional patients due to insignificant results after the sample size reached the planned number, which led to inflation of the type I error rate. The remdesivir trial of Wang et al failed to reach the planned sample size due to a lack of eligible patients, and the bootstrap method was used to predict the quantity of clinical interest conditionally and unconditionally if the trial had continued to reach the originally planned sample size. Moreover, we used a terminal (or cure) rate model and a model-free metric known as the restricted mean survival time or the restricted mean time to improvement (RMTI) to analyze the reconstructed data. The remdesivir trial of Beigel et al reported the median recovery time of the remdesivir and placebo groups, and the rate ratio for recovery, while both quantities depend on a particular time point representing local information. We use the restricted mean time to recovery (RMTR) as a global and robust measure for efficacy. RESULTS For the lopinavir-ritonavir trial, with the increase of sample size from 160 to 199, the type I error rate was inflated from 0.05 to 0.071. The difference of RMTIs between the two groups evaluated at day 28 was –1.67 days (95% CI –3.62 to 0.28; <i>P</i>=.09) in favor of lopinavir-ritonavir but not statistically significant. For the remdesivir trial of Wang et al, the difference of RMTIs at day 28 was –0.89 days (95% CI –2.84 to 1.06; <i>P</i>=.37). The planned sample size was 453, yet only 236 patients were enrolled. The conditional prediction shows that the hazard ratio estimates would reach statistical significance if the target sample size had been maintained. For the remdesivir trial of Beigel et al, the difference of RMTRs between the remdesivir and placebo groups at day 30 was –2.7 days (95% CI –4.0 to –1.2; <i>P</i>&lt;.001), confirming the superiority of remdesivir. The difference in the recovery time at the 25th percentile (95% CI –3 to 0; <i>P</i>=.65) was insignificant, while the differences became more statistically significant at larger percentiles. CONCLUSIONS Based on the statistical issues and lessons learned from the recent three clinical trials on COVID-19 treatments, we suggest more appropriate approaches for the design and analysis of ongoing and future COVID-19 trials.


2011 ◽  
Vol 50 (03) ◽  
pp. 237-243 ◽  
Author(s):  
T. Friede ◽  
M. Kieser

SummaryObjectives: Analysis of covariance (ANCOVA) is widely applied in practice and its use is recommended by regulatory guidelines. However, the required sample size for ANCOVA depends on parameters that are usually uncertain in the planning phase of a study. Sample size recalculation within the internal pilot study design allows to cope with this problem. From a regulatory viewpoint it is preferable that the treatment group allocation remains masked and that the type I error is controlled at the specified significance level. The characteristics of blinded sample size reassessment for ANCOVA in non-inferiority studies have not been investigated yet. We propose an appropriate method and evaluate its performance.Methods: In a simulation study, the characteristics of the proposed method with respect to type I error rate, power and sample size are investigated. It is illustrated by a clinical trial example how strict control of the significance level can be achieved.Results: A slight excess of the type I error rate beyond the nominal significance level was observed. The extent of exceedance increases with increasing non-inferiority margin and increasing correlation between outcome and covariate. The procedure assures the desired power over a wide range of scenarios even if nuisance parameters affecting the sample size are initially mis-specified.Conclusions: The proposed blinded sample size recalculation procedure protects from insufficient sample sizes due to incorrect assumptions about nuisance parameters in the planning phase. The original procedure may lead to an elevated type I error rate, but methods are available to control the nominal significance level.


2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Keisuke Ejima ◽  
Andrew Brown ◽  
Daniel Smith ◽  
Ufuk Beyaztas ◽  
David Allison

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.


Sign in / Sign up

Export Citation Format

Share Document