Operating characteristics of sample size re-estimation with futility stopping based on conditional power

2006 ◽  
Vol 25 (19) ◽  
pp. 3348-3365 ◽  
Author(s):  
John M. Lachin
2021 ◽  
pp. 174077452110101
Author(s):  
Jennifer Proper ◽  
John Connett ◽  
Thomas Murray

Background: Bayesian response-adaptive designs, which data adaptively alter the allocation ratio in favor of the better performing treatment, are often criticized for engendering a non-trivial probability of a subject imbalance in favor of the inferior treatment, inflating type I error rate, and increasing sample size requirements. The implementation of these designs using the Thompson sampling methods has generally assumed a simple beta-binomial probability model in the literature; however, the effect of these choices on the resulting design operating characteristics relative to other reasonable alternatives has not been fully examined. Motivated by the Advanced R2 Eperfusion STrategies for Refractory Cardiac Arrest trial, we posit that a logistic probability model coupled with an urn or permuted block randomization method will alleviate some of the practical limitations engendered by the conventional implementation of a two-arm Bayesian response-adaptive design with binary outcomes. In this article, we discuss up to what extent this solution works and when it does not. Methods: A computer simulation study was performed to evaluate the relative merits of a Bayesian response-adaptive design for the Advanced R2 Eperfusion STrategies for Refractory Cardiac Arrest trial using the Thompson sampling methods based on a logistic regression probability model coupled with either an urn or permuted block randomization method that limits deviations from the evolving target allocation ratio. The different implementations of the response-adaptive design were evaluated for type I error rate control across various null response rates and power, among other performance metrics. Results: The logistic regression probability model engenders smaller average sample sizes with similar power, better control over type I error rate, and more favorable treatment arm sample size distributions than the conventional beta-binomial probability model, and designs using the alternative randomization methods have a negligible chance of a sample size imbalance in the wrong direction. Conclusion: Pairing the logistic regression probability model with either of the alternative randomization methods results in a much improved response-adaptive design in regard to important operating characteristics, including type I error rate control and the risk of a sample size imbalance in favor of the inferior treatment.


Healthcare ◽  
2019 ◽  
Vol 7 (4) ◽  
pp. 137 ◽  
Author(s):  
J. Blackston ◽  
Andrew Chapple ◽  
James McGree ◽  
Suzanne McDonald ◽  
Jane Nikles

Background: N-of-1 trials offer an innovative approach to delivering personalized clinical care together with population-level research. While increasingly used, these methods have raised some statistical concerns in the healthcare community. Methods: We discuss concerns of selection bias, carryover effects from treatment, and trial data analysis conceptually, then rigorously evaluate concerns of effect sizes, power and sample size through simulation study. Four variance structures for patient heterogeneity and model error are considered in a series of 5000 simulated trials with 3 cycles, which compare aggregated N-of-1 trials to parallel randomized controlled trials (RCTs) and crossover trials. Results: Aggregated N-of-1 trials outperformed both traditional parallel RCT and crossover designs when these trial designs were simulated in terms of power and required sample size to obtain a given power. N-of-1 designs resulted in a higher type-I error probability than parallel RCT and cross over designs when moderate-to-strong carryover effects were not considered or in the presence of modeled selection bias. However, N-of-1 designs allowed better estimation of patient-level random effects. These results reinforce the need to account for these factors when planning N-of-1 trials. Conclusion: N-of-1 trial designs offer a rigorous method for advancing personalized medicine and healthcare with the potential to minimize costs and resources. Interventions can be tested with adequate power with far fewer patients than traditional RCT and crossover designs. Operating characteristics compare favorably to both traditional RCT and crossover designs.


2018 ◽  
Vol 15 (5) ◽  
pp. 452-461 ◽  
Author(s):  
Satrajit Roychoudhury ◽  
Nicolas Scheuer ◽  
Beat Neuenschwander

Background Well-designed phase II trials must have acceptable error rates relative to a pre-specified success criterion, usually a statistically significant p-value. Such standard designs may not always suffice from a clinical perspective because clinical relevance may call for more. For example, proof-of-concept in phase II often requires not only statistical significance but also a sufficiently large effect estimate. Purpose We propose dual-criterion designs to complement statistical significance with clinical relevance, discuss their methodology, and illustrate their implementation in phase II. Methods Clinical relevance requires the effect estimate to pass a clinically motivated threshold (the decision value (DV)). In contrast to standard designs, the required effect estimate is an explicit design input, whereas study power is implicit. The sample size for a dual-criterion design needs careful considerations of the study’s operating characteristics (type I error, power). Results Dual-criterion designs are discussed for a randomized controlled and a single-arm phase II trial, including decision criteria, sample size calculations, decisions under various data scenarios, and operating characteristics. The designs facilitate GO/NO-GO decisions due to their complementary statistical–clinical criterion. Limitations While conceptually simple, implementing a dual-criterion design needs care. The clinical DV must be elicited carefully in collaboration with clinicians, and understanding similarities and differences to a standard design is crucial. Conclusion To improve evidence-based decision-making, a formal yet transparent quantitative framework is important. Dual-criterion designs offer an appealing statistical–clinical compromise, which may be preferable to standard designs if evidence against the null hypothesis alone does not suffice for an efficacy claim.


2013 ◽  
Vol 31 (15_suppl) ◽  
pp. TPS6099-TPS6099
Author(s):  
David Ira Rosenthal ◽  
Qiang Zhang ◽  
Merrill S. Kies ◽  
Minh-Tam Truong ◽  
Richard Jordan ◽  
...  

TPS6099 Background: Clinical trial results from phase II trials to select an experimental treatment arm for separate phase III trial comparison can require years. Cancer clinical trials also now aim at both survival and PRO/functional outcomes, especially in head and neck (HN) studies. We developed a unique seamless phase II/III trial design to save on sample size and trial duration. The initial multi-arm phase II trial selects the most effective regimen among multiple experimental arms by first comparing each of the new treatments to a common control arm, using chosen endpoints, such as progression free survival. The winner will be tested for overall survival in the phase III study. Methods: We propose a phase II/III design to test the efficacy of experimental arms of postoperative radiation (RT) + docetaxel or RT + docetaxel + cetuximab in patients with HN squamous cancer. These are compared to the control arm of RT + cisplatin in the phase II part. Only one arm will be selected to go on to phase III depending on efficacy (PFS), PRO and safety outcomes. One experimental arm must be sufficiently better than the common control arm and the winner not having increased toxicity or functional cost to be selected for phase III inclusion. If not, the trial is halted for futility. Patients in the phase II selected arm and the control arm are included in phase III testing. Group sequential method is used to design each component. Separate interim efficacy and futility analyses are built in such that each endpoint can be monitored as in separate phase II, III trials. Once sample sizes are derived, operating characteristics for the seamless II/III design are evaluated through simulations under the null and various alternative hypotheses. Savings on sample size and time are compared to typical separate phase II and III designs and to the design testing only the arm of RT + docetaxel + cetuximab in phase II. Conclusion: The phase II/III RTOG 1216 HNC trial offers cost effectiveness, operational efficiency and scientific innovation.


2016 ◽  
Vol 27 (1) ◽  
pp. 158-171 ◽  
Author(s):  
Haolun Shi ◽  
Guosheng Yin

Conventional phase II clinical trials use either a single- or multi-arm comparison scheme to examine the therapeutic effects of the experimental drug. Both single- and multi-arm evaluations have their own merits; for example, single-arm phase II trials are easy to conduct and often require a smaller sample size, while multiarm trials are randomized and typically lead to a more objective comparison. To bridge the single- and double-arm schemes in one trial, we propose a two-stage design, in which the first stage takes a single-arm comparison of the experimental drug with the standard response rate (no concurrent treatment) and the second stage imposes a two-arm comparison by adding an active control arm. The design is calibrated using a new concept, the detectable treatment difference, to balance the trade-offs between futility termination, power, and sample size. We conduct extensive simulation studies to examine the operating characteristics of the proposed method and provide an illustrative example of our design.


2014 ◽  
Vol 8 (1) ◽  
pp. 104-110 ◽  
Author(s):  
Jianping Xiang ◽  
Jihnhee Yu ◽  
Kenneth V Snyder ◽  
Elad I Levy ◽  
Adnan H Siddiqui ◽  
...  

BackgroundWe previously established three logistic regression models for discriminating intracranial aneurysm rupture status based on morphological and hemodynamic analysis of 119 aneurysms. In this study, we tested if these models would remain stable with increasing sample size, and investigated sample sizes required for various confidence levels (CIs).MethodsWe augmented our previous dataset of 119 aneurysms into a new dataset of 204 samples by collecting an additional 85 consecutive aneurysms, on which we performed flow simulation and calculated morphological and hemodynamic parameters, as done previously. We performed univariate significance tests on these parameters, and multivariate logistic regression on significant parameters. The new regression models were compared against the original models. Receiver operating characteristics analysis was applied to compare the performance of regression models. Furthermore, we performed regression analysis based on bootstrapping resampling statistical simulations to explore how many aneurysm cases were required to generate stable models.ResultsUnivariate tests of the 204 aneurysms generated an identical list of significant morphological and hemodynamic parameters as previously (from the analysis of 119 cases). Furthermore, multivariate regression analysis produced three parsimonious predictive models that were almost identical to the previous ones, with model coefficients that had narrower CIs than the original ones. Bootstrapping showed that 10%, 5%, 2%, and 1% convergence levels of CI required 120, 200, 500, and 900 aneurysms, respectively.ConclusionsOur original hemodynamic–morphological rupture prediction models are stable and improve with increasing sample size. Results from resampling statistical simulations provide guidance for designing future large multi-population studies.


2016 ◽  
Vol 14 (1) ◽  
pp. 48-58 ◽  
Author(s):  
Qiang Zhang ◽  
Boris Freidlin ◽  
Edward L Korn ◽  
Susan Halabi ◽  
Sumithra Mandrekar ◽  
...  

Background: Futility (inefficacy) interim monitoring is an important component in the conduct of phase III clinical trials, especially in life-threatening diseases. Desirable futility monitoring guidelines allow timely stopping if the new therapy is harmful or if it is unlikely to demonstrate to be sufficiently effective if the trial were to continue to its final analysis. There are a number of analytical approaches that are used to construct futility monitoring boundaries. The most common approaches are based on conditional power, sequential testing of the alternative hypothesis, or sequential confidence intervals. The resulting futility boundaries vary considerably with respect to the level of evidence required for recommending stopping the study. Purpose: We evaluate the performance of commonly used methods using event histories from completed phase III clinical trials of the Radiation Therapy Oncology Group, Cancer and Leukemia Group B, and North Central Cancer Treatment Group. Methods: We considered published superiority phase III trials with survival endpoints initiated after 1990. There are 52 studies available for this analysis from different disease sites. Total sample size and maximum number of events (statistical information) for each study were calculated using protocol-specified effect size, type I and type II error rates. In addition to the common futility approaches, we considered a recently proposed linear inefficacy boundary approach with an early harm look followed by several lack-of-efficacy analyses. For each futility approach, interim test statistics were generated for three schedules with different analysis frequency, and early stopping was recommended if the interim result crossed a futility stopping boundary. For trials not demonstrating superiority, the impact of each rule is summarized as savings on sample size, study duration, and information time scales. Results: For negative studies, our results show that the futility approaches based on testing the alternative hypothesis and repeated confidence interval rules yielded less savings (compared to the other two rules). These boundaries are too conservative, especially during the first half of the study (<50% of information). The conditional power rules are too aggressive during the second half of the study (>50% of information) and may stop a trial even when there is a clinically meaningful treatment effect. The linear inefficacy boundary with three or more interim analyses provided the best results. For positive studies, we demonstrated that none of the futility rules would have stopped the trials. Conclusion: The linear inefficacy boundary futility approach is attractive from statistical, clinical, and logistical standpoints in clinical trials evaluating new anti-cancer agents.


2021 ◽  
pp. 174077452110329
Author(s):  
Martin Forster ◽  
Stephen Brealey ◽  
Stephen Chick ◽  
Ada Keding ◽  
Belen Corbacho ◽  
...  

Background/Aims: There is growing interest in the use of adaptive designs to improve the efficiency of clinical trials. We apply a Bayesian decision-theoretic model of a sequential experiment using cost and outcome data from the ProFHER pragmatic trial. We assess the model’s potential for delivering value-based research. Methods: Using parameter values estimated from the ProFHER pragmatic trial, including the costs of carrying out the trial, we establish when the trial could have stopped, had the model’s value-based stopping rule been used. We use a bootstrap analysis and simulation study to assess a range of operating characteristics, which we compare with a fixed sample size design which does not allow for early stopping. Results: We estimate that application of the model could have stopped the ProFHER trial early, reducing the sample size by about 14%, saving about 5% of the research budget and resulting in a technology recommendation which was the same as that of the trial. The bootstrap analysis suggests that the expected sample size would have been 38% lower, saving around 13% of the research budget, with a probability of 0.92 of making the same technology recommendation decision. It also shows a large degree of variability in the trial’s sample size. Conclusions: Benefits to trial cost stewardship may be achieved by monitoring trial data as they accumulate and using a stopping rule which balances the benefit of obtaining more information through continued recruitment with the cost of obtaining that information. We present recommendations for further research investigating the application of value-based sequential designs.


Cancers ◽  
2021 ◽  
Vol 13 (12) ◽  
pp. 3088
Author(s):  
Federica Corso ◽  
Giulia Tini ◽  
Giuliana Lo Presti ◽  
Noemi Garau ◽  
Simone Pietro De Angelis ◽  
...  

Radiomics uses high-dimensional sets of imaging features to predict biological characteristics of tumors and clinical outcomes. The choice of the algorithm used to analyze radiomic features and perform predictions has a high impact on the results, thus the identification of adequate machine learning methods for radiomic applications is crucial. In this study we aim to identify suitable approaches of analysis for radiomic-based binary predictions, according to sample size, outcome balancing and the features–outcome association strength. Simulated data were obtained reproducing the correlation structure among 168 radiomic features extracted from Computed Tomography images of 270 Non-Small-Cell Lung Cancer (NSCLC) patients and the associated to lymph node status. Performances of six classifiers combined with six feature selection (FS) methods were assessed on the simulated data using AUC (Area Under the Receiver Operating Characteristics Curves), sensitivity, and specificity. For all the FS methods and regardless of the association strength, the tree-based classifiers Random Forest and Extreme Gradient Boosting obtained good performances (AUC ≥ 0.73), showing the best trade-off between sensitivity and specificity. On small samples, performances were generally lower than in large–medium samples and with larger variations. FS methods generally did not improve performances. Thus, in radiomic studies, we suggest evaluating the choice of FS and classifiers, considering specific sample size, balancing, and association strength.


Sign in / Sign up

Export Citation Format

Share Document