Increasing the sample size during clinical trials witht-distributed test statistics without inflating the type I error rate

Background: Bayesian response-adaptive designs, which data adaptively alter the allocation ratio in favor of the better performing treatment, are often criticized for engendering a non-trivial probability of a subject imbalance in favor of the inferior treatment, inflating type I error rate, and increasing sample size requirements. The implementation of these designs using the Thompson sampling methods has generally assumed a simple beta-binomial probability model in the literature; however, the effect of these choices on the resulting design operating characteristics relative to other reasonable alternatives has not been fully examined. Motivated by the Advanced R2 Eperfusion STrategies for Refractory Cardiac Arrest trial, we posit that a logistic probability model coupled with an urn or permuted block randomization method will alleviate some of the practical limitations engendered by the conventional implementation of a two-arm Bayesian response-adaptive design with binary outcomes. In this article, we discuss up to what extent this solution works and when it does not. Methods: A computer simulation study was performed to evaluate the relative merits of a Bayesian response-adaptive design for the Advanced R2 Eperfusion STrategies for Refractory Cardiac Arrest trial using the Thompson sampling methods based on a logistic regression probability model coupled with either an urn or permuted block randomization method that limits deviations from the evolving target allocation ratio. The different implementations of the response-adaptive design were evaluated for type I error rate control across various null response rates and power, among other performance metrics. Results: The logistic regression probability model engenders smaller average sample sizes with similar power, better control over type I error rate, and more favorable treatment arm sample size distributions than the conventional beta-binomial probability model, and designs using the alternative randomization methods have a negligible chance of a sample size imbalance in the wrong direction. Conclusion: Pairing the logistic regression probability model with either of the alternative randomization methods results in a much improved response-adaptive design in regard to important operating characteristics, including type I error rate control and the risk of a sample size imbalance in favor of the inferior treatment.

Download Full-text

Statistical Issues and Lessons Learned From COVID-19 Clinical Trials With Lopinavir-Ritonavir and Remdesivir (Preprint)

10.2196/preprints.19538 ◽

2020 ◽

Author(s):

Guosheng Yin ◽

Chenyang Zhang ◽

Huaqing Jin

Keyword(s):

Clinical Trials ◽

Sample Size ◽

Error Rate ◽

Recovery Time ◽

Type I Error ◽

Lessons Learned ◽

Type I ◽

Statistical Issues ◽

The Difference ◽

Mean Time

BACKGROUND Recently, three randomized clinical trials on coronavirus disease (COVID-19) treatments were completed: one for lopinavir-ritonavir and two for remdesivir. One trial reported that remdesivir was superior to placebo in shortening the time to recovery, while the other two showed no benefit of the treatment under investigation. OBJECTIVE The aim of this paper is to, from a statistical perspective, identify several key issues in the design and analysis of three COVID-19 trials and reanalyze the data from the cumulative incidence curves in the three trials using more appropriate statistical methods. METHODS The lopinavir-ritonavir trial enrolled 39 additional patients due to insignificant results after the sample size reached the planned number, which led to inflation of the type I error rate. The remdesivir trial of Wang et al failed to reach the planned sample size due to a lack of eligible patients, and the bootstrap method was used to predict the quantity of clinical interest conditionally and unconditionally if the trial had continued to reach the originally planned sample size. Moreover, we used a terminal (or cure) rate model and a model-free metric known as the restricted mean survival time or the restricted mean time to improvement (RMTI) to analyze the reconstructed data. The remdesivir trial of Beigel et al reported the median recovery time of the remdesivir and placebo groups, and the rate ratio for recovery, while both quantities depend on a particular time point representing local information. We use the restricted mean time to recovery (RMTR) as a global and robust measure for efficacy. RESULTS For the lopinavir-ritonavir trial, with the increase of sample size from 160 to 199, the type I error rate was inflated from 0.05 to 0.071. The difference of RMTIs between the two groups evaluated at day 28 was –1.67 days (95% CI –3.62 to 0.28; P=.09) in favor of lopinavir-ritonavir but not statistically significant. For the remdesivir trial of Wang et al, the difference of RMTIs at day 28 was –0.89 days (95% CI –2.84 to 1.06; P=.37). The planned sample size was 453, yet only 236 patients were enrolled. The conditional prediction shows that the hazard ratio estimates would reach statistical significance if the target sample size had been maintained. For the remdesivir trial of Beigel et al, the difference of RMTRs between the remdesivir and placebo groups at day 30 was –2.7 days (95% CI –4.0 to –1.2; P<.001), confirming the superiority of remdesivir. The difference in the recovery time at the 25th percentile (95% CI –3 to 0; P=.65) was insignificant, while the differences became more statistically significant at larger percentiles. CONCLUSIONS Based on the statistical issues and lessons learned from the recent three clinical trials on COVID-19 treatments, we suggest more appropriate approaches for the design and analysis of ongoing and future COVID-19 trials.

Download Full-text

Sample Size Reassessment in Non-inferiority Trials

Methods of Information in Medicine ◽

10.3414/me09-01-0063 ◽

2011 ◽

Vol 50 (03) ◽

pp. 237-243 ◽

Cited By ~ 10

Author(s):

T. Friede ◽

M. Kieser

Keyword(s):

Sample Size ◽

Error Rate ◽

Type I Error ◽

Nuisance Parameters ◽

Type I ◽

Nominal Significance Level ◽

Significance Level ◽

Type I Error Rate ◽

Planning Phase ◽

Nominal Significance

SummaryObjectives: Analysis of covariance (ANCOVA) is widely applied in practice and its use is recommended by regulatory guidelines. However, the required sample size for ANCOVA depends on parameters that are usually uncertain in the planning phase of a study. Sample size recalculation within the internal pilot study design allows to cope with this problem. From a regulatory viewpoint it is preferable that the treatment group allocation remains masked and that the type I error is controlled at the specified significance level. The characteristics of blinded sample size reassessment for ANCOVA in non-inferiority studies have not been investigated yet. We propose an appropriate method and evaluate its performance.Methods: In a simulation study, the characteristics of the proposed method with respect to type I error rate, power and sample size are investigated. It is illustrated by a clinical trial example how strict control of the significance level can be achieved.Results: A slight excess of the type I error rate beyond the nominal significance level was observed. The extent of exceedance increases with increasing non-inferiority margin and increasing correlation between outcome and covariate. The procedure assures the desired power over a wide range of scenarios even if nuisance parameters affecting the sample size are initially mis-specified.Conclusions: The proposed blinded sample size recalculation procedure protects from insufficient sample sizes due to incorrect assumptions about nuisance parameters in the planning phase. The original procedure may lead to an elevated type I error rate, but methods are available to control the nominal significance level.

Download Full-text

Assessment of Type I Error Rates and Power of Common Analysis Methods in Murine Obesity-Related Study: ‘Plasmode-Based’ Simulation (P13-011-19)

Current Developments in Nutrition ◽

10.1093/cdn/nzz036.p13-011-19 ◽

2019 ◽

Vol 3 (Supplement_1) ◽

Author(s):

Keisuke Ejima ◽

Andrew Brown ◽

Daniel Smith ◽

Ufuk Beyaztas ◽

David Allison

Keyword(s):

Sample Size ◽

Error Rate ◽

Type I Error ◽

Error Rates ◽

T Test ◽

Small Samples ◽

Type I ◽

Type I Error Rates ◽

Type I Error Rate ◽

Weight Distributions

Abstract Objectives Rigor, reproducibility and transparency (RRT) awareness has expanded over the last decade. Although RRT can be improved from various aspects, we focused on type I error rates and power of commonly used statistical analyses testing mean differences of two groups, using small (n ≤ 5) to moderate sample sizes. Methods We compared data from five distinct, homozygous, monogenic, murine models of obesity with non-mutant controls of both sexes. Baseline weight (7–11 weeks old) was the outcome. To examine whether type I error rate could be affected by choice of statistical tests, we adjusted the empirical distributions of weights to ensure the null hypothesis (i.e., no mean difference) in two ways: Case 1) center both weight distributions on the same mean weight; Case 2) combine data from control and mutant groups into one distribution. From these cases, 3 to 20 mice were resampled to create a ‘plasmode’ dataset. We performed five common tests (Student's t-test, Welch's t-test, Wilcoxon test, permutation test and bootstrap test) on the plasmodes and computed type I error rates. Power was assessed using plasmodes, where the distribution of the control group was shifted by adding a constant value as in Case 1, but to realize nominal effect sizes. Results Type I error rates were unreasonably higher than the nominal significance level (type I error rate inflation) for Student's t-test, Welch's t-test and permutation especially when sample size was small for Case 1, whereas inflation was observed only for permutation for Case 2. Deflation was noted for bootstrap with small sample. Increasing sample size mitigated inflation and deflation, except for Wilcoxon in Case 1 because heterogeneity of weight distributions between groups violated assumptions for the purposes of testing mean differences. For power, a departure from the reference value was observed with small samples. Compared with the other tests, bootstrap was underpowered with small samples as a tradeoff for maintaining type I error rates. Conclusions With small samples (n ≤ 5), bootstrap avoided type I error rate inflation, but often at the cost of lower power. To avoid type I error rate inflation for other tests, sample size should be increased. Wilcoxon should be avoided because of heterogeneity of weight distributions between mutant and control mice. Funding Sources This study was supported in part by NIH and Japan Society for Promotion of Science (JSPS) KAKENHI grant.

Download Full-text

Simple procedures for blinded sample size adjustment that do not affect the type I error rate

Statistics in Medicine ◽

10.1002/sim.1585 ◽

2003 ◽

Vol 22 (23) ◽

pp. 3571-3581 ◽

Cited By ~ 77

Author(s):

Meinhard Kieser ◽

Tim Friede

Keyword(s):

Sample Size ◽

Error Rate ◽

Type I Error ◽

Type I ◽

Type I Error Rate ◽

Size Adjustment ◽

Sample Size Adjustment

Download Full-text

Maximum type I error rate inflation from sample size reassessment when investigators are blind to treatment labels

Statistics in Medicine ◽

10.1002/sim.6848 ◽

2015 ◽

Vol 35 (12) ◽

pp. 1972-1984 ◽

Cited By ~ 3

Author(s):

Magdalena Żebrowska ◽

Martin Posch ◽

Dominic Magirr

Keyword(s):

Sample Size ◽

Error Rate ◽

Type I Error ◽

Type I ◽

Type I Error Rate

Download Full-text

Sample size re-assessment leading to a raised sample size does not inflate type I error rate under mild conditions

BMC Medical Research Methodology ◽

10.1186/1471-2288-13-94 ◽

2013 ◽

Vol 13 (1) ◽

Cited By ~ 4

Author(s):

Per Broberg

Keyword(s):

Sample Size ◽

Error Rate ◽

Type I Error ◽

Type I ◽

Mild Conditions ◽

Type I Error Rate

Download Full-text

Interim analyses for monitoring clinical trials that do not materially affect the type I error rate

Statistics in Medicine ◽

10.1002/sim.4780110107 ◽

1992 ◽

Vol 11 (1) ◽

pp. 55-66 ◽

Cited By ~ 90

Author(s):

A. Lawrence Gould

Keyword(s):

Clinical Trials ◽

Error Rate ◽

Type I Error ◽

Type I ◽

Interim Analyses ◽

Type I Error Rate

Download Full-text

Adaptive designs in multi-reader multi-case clinical trials of imaging devices

Statistical Methods in Medical Research ◽

10.1177/0962280219869370 ◽

2019 ◽

Vol 29 (6) ◽

pp. 1592-1611

Author(s):

Zhipeng Huang ◽

Frank Samuelson ◽

Lucas Tcheuko ◽

Weijie Chen

Keyword(s):

Clinical Trials ◽

Error Rate ◽

Interim Analysis ◽

Type I Error ◽

Adaptive Methods ◽

Type I ◽

Type I Error Rate ◽

Power And Control ◽

Target Power ◽

And Control

Evaluation of medical imaging devices often involves clinical studies where multiple readers (MR) read images of multiple cases (MC) for a clinical task, which are often called MRMC studies. In addition to sizing patient cases as is required in most clinical trials, MRMC studies also require sizing readers, since both readers and cases contribute to the uncertainty of the estimated diagnostic performance, which is often measured by the area under the ROC curve (AUC). Due to limited prior information, sizing of such a study is often unreliable. It is desired to adaptively resize the study toward a target power after an interim analysis. Although adaptive methods are available in clinical trials where only the patient sample is sized, such methodologies have not been established for MRMC studies. The challenge lies in the fact that there is a correlation structure in MRMC data and the sizing involves both readers and cases. We develop adaptive MRMC design methodologies to enable study resizing. In particular, we resize the study and adjust the critical value for hypothesis testing simultaneously after an interim analysis to achieve a target power and control the type I error rate in comparing AUCs of two modalities. Analytical results have been derived. Simulations show that the type I error rate is controlled close to the nominal level and the power is adjusted toward the target value under a variety of simulation conditions. We demonstrate the use of our methods in a real-world application comparing two imaging modalities for breast cancer detection.

Download Full-text

Adding new experimental arms to randomised clinical trials: Impact on error rates

Clinical Trials ◽

10.1177/1740774520904346 ◽

2020 ◽

Vol 17 (3) ◽

pp. 273-284 ◽

Cited By ~ 1

Author(s):

Babak Choodari-Oskooei ◽

Daniel J Bratton ◽

Melissa R Gannon ◽

Angela M Meade ◽

Matthew R Sydes ◽

...

Keyword(s):

Error Rate ◽

Type I Error ◽

Controlled Trial ◽

Late Phase ◽

Pairwise Comparison ◽

Error Rates ◽

Pairwise Comparisons ◽

Type I ◽

Test Statistics ◽

Type I Error Rate

Background: Experimental treatments pass through various stages of development. If a treatment passes through early-phase experiments, the investigators may want to assess it in a late-phase randomised controlled trial. An efficient way to do this is adding it as a new research arm to an ongoing trial while the existing research arms continue, a so-called multi-arm platform trial. The familywise type I error rate is often a key quantity of interest in any multi-arm platform trial. We set out to clarify how it should be calculated when new arms are added to a trial some time after it has started. Methods: We show how the familywise type I error rate, any-pair and all-pairs powers can be calculated when a new arm is added to a platform trial. We extend the Dunnett probability and derive analytical formulae for the correlation between the test statistics of the existing pairwise comparison and that of the newly added arm. We also verify our analytical derivation via simulations. Results: Our results indicate that the familywise type I error rate depends on the shared control arm information (i.e. individuals in continuous and binary outcomes and primary outcome events in time-to-event outcomes) from the common control arm patients and the allocation ratio. The familywise type I error rate is driven more by the number of pairwise comparisons and the corresponding (pairwise) type I error rates than by the timing of the addition of the new arms. The familywise type I error rate can be estimated using Šidák’s correction if the correlation between the test statistics of pairwise comparisons is less than 0.30. Conclusions: The findings we present in this article can be used to design trials with pre-planned deferred arms or to add new pairwise comparisons within an ongoing platform trial where control of the pairwise error rate or familywise type I error rate (for a subset of pairwise comparisons) is required.

Download Full-text