Abstract 17185: Modest Performance of Heart Failure Clinical Prediction Models: A Systematic External Validation Study

Circulation ◽  
2018 ◽  
Vol 138 (Suppl_1) ◽  
Author(s):  
Jenica N Upshaw ◽  
Jason Nelson ◽  
Benjamin Wessler ◽  
Benjamin Koethe ◽  
Christine Lundquist ◽  
...  

Introduction: Most heart failure (HF) clinical prediction models (CPMs] have not been independently externally validated. We sought to test the performance of HF models in a diverse population using a systematic approach. Methods: A systematic review identified CPMs predicting outcomes for patients with HF. Individual patient data from 5 large publicaly available clinical trials enrolling patients with chronic HF were matched to published CPMs based on similarity in populations and available outcome and predictor variables in the clinical trial databases. CPM performance was evaluated for discrimination (c-statistic, % relative change in c-statistic) and calibration (Harrell’s E and E 90 , the mean and the 90% quantile of the error distribution from the smoothed loess observed value) for the original and recalibrated models. Results: Out of 135 HF CPMs reviewed, we identified 45 CPM-trial pairs including 13 unique CPMs. The outcome was mortality for all of the models with a trial match. During external validations, median c-statistic was 0.595 (IQR 0.563 to 0.630) with a median relative decrease in the c-statistic of -57 % (IQR, -49% to -71%) compared to the c-statistic reported in the derivation cohort. Overall, the median Harrell’s E was 0.09 (IQR, 0.04 to 0.135) and E 90 was 0.11 (IQR, 0.07 to 0.21). Recalibration of the intercept and slope led to substantially improved calibration with median change in Harrell’s E of -35% [IQR 0 to -75%] for the intercept and -56% [IQR -17% to -75%] for the intercept and slope. Refitting model covariates improved the median c-statistic by 38% to 0.629 [IQR 0.613 to 0.649]. Conclusion: For HF CPMs, independent external validations demonstrate that CPMs perform significantly worse than originally presented; however with significant heterogeneity. Recalibration of the intercept and slope improved model calibration. These results underscore the need to carefully consider the derivation cohort characteristics when using published CPMs.

2021 ◽  
Author(s):  
Jenica N. Upshaw ◽  
Jason Nelson ◽  
Benjamin Koethe ◽  
Jinny G. Park ◽  
Hannah McGinnes ◽  
...  

BackgroundMost heart failure (HF) clinical prediction models (CPMs) have not been externally validated.MethodsWe performed a systematic review to identify CPMs predicting outcomes in HF, stratified by acute and chronic HF CPMs. External validations were performed using individual patient data from 8 large HF trials (1 acute, 7 chronic). CPM discrimination (c-statistic, % relative change in c-statistic), calibration (calibration slope, Harrell’s E, E90), and net benefit were evaluated for each CPM with and without recalibration.ResultsOf 135 HF CPMs screened, 24 (18%) were compatible with the population, predictors and outcomes to the trials and 42 external validations were performed (14 acute HF, 28 chronic HF). The median derivation c-statistic of acute HF CPMs was 0.76 (IQR, 0.75, 0.8), validation c-statistic was 0.67 (0.65, 0.68) and model-based c-statistic was 0.68 (0.66, 0.76), Hence, most of the apparent decrement in model performance was due to narrower case-mix in the validation cohort compared with the development cohort. The median derivation c-statistic for chronic HF CPMs was 0.76 (0.74, 0.8), validation c-statistic 0.61 (0.6, 0.63) and model-based c-statistic 0.68 (0.62, 0.71), suggesting that the decrement in model performance was only partially due to case-mix heterogeneity. Calibration was generally poor - median E (standardized by outcome rate) was 0.5 (0.4, 2.2) for acute HF CPMs and 0.5 (0.3, 0.7) for chronic HF CPMs. Updating the intercept alone led to a significant improvement in calibration in acute HF CPMs, but not in chronic HF CPMs. Net benefit analysis showed potential for harm in using CPMs when the decision threshold was not near the overall outcome rate but this improved with model recalibration.ConclusionsOnly a small minority of published CPMs contained variables and outcomes that were compatible with the clinical trial datasets. For acute HF CPMs, discrimination is largely preserved after adjusting for case-mix; however, the risk of net harm is substantial without model recalibration for both acute and chronic HF CPMs.


2021 ◽  
Author(s):  
Benjamin S. Wessler ◽  
Jason Nelson ◽  
Jinny G. Park ◽  
Hannah McGinnes ◽  
Jenica Upshaw ◽  
...  

AbstractPurposeIt is increasingly recognized that clinical prediction models (CPMs) often do not perform as expected when they are tested on new databases. Independent external validations of CPMs are recommended but often not performed. Here we conduct independent external validations of acute coronary syndrome (ACS) CPMs.MethodsA systematic review identified CPMs predicting outcomes for patients with ACS. Independent external validations were performed by evaluating model performance using individual patient data from 5 large clinical trials. CPM performance with and without various recalibration techniques was evaluated with a focus on CPM discrimination (c-statistic, % relative change in c-statistic) as well as calibration (Harrell’s Eavg, E90, Net Benefit).ResultsOf 269 ACS CPMs screened, 23 (8.5%) were compatible with at least one of the trials and 28 clinically appropriate external validations were performed. The median c statistic of the CPMs in the derivation cohorts was 0.76 (IQR, 0.74 to 0.78). The median c-statistic in these external validations was 0.70 (IQR, 0.66 to 0.71) reflecting a 24% decrement in discrimination. However, this decrement in discrimination was due mostly to narrower case-mix in the validation cohorts compared to derivation cohorts, as reflected in the median model based c-statistic [0.71 (IQR 0.66 to 0.75). The median calibration slope in external validations was 0.84 (IQR, 0.72 to 0.98) and the median Eavg (standardized by the outcome rate) was 0.4 (IQR, 0.3 to 0.8). Net benefit indicates that most CPMs had a high risk of causing net harm when not recalibrated, particularly for decision thresholds not near the overall outcome rate.ConclusionIndependent external validations of published ACS CPMs demonstrate that models tested in our sample had relatively well-preserved discrimination but poor calibration when externally validated. Applying ‘off-the-shelf’ CPMs often risks net harm unless models are recalibrated to the populations on which they are used.


2021 ◽  
Author(s):  
Gaurav Gulati ◽  
Riley J Brazil ◽  
Jason Nelson ◽  
David van Klaveren ◽  
Christine M. Lundquist ◽  
...  

AbstractBackgroundClinical prediction models (CPMs) are used to inform treatment decisions for the primary prevention of cardiovascular disease. We aimed to assess the performance of such CPMs in fully independent cohorts.Methods and Results63 models predicting outcomes for patients at risk of cardiovascular disease from the Tufts PACE CPM Registry were selected for external validation on publicly available data from up to 4 broadly inclusive primary prevention clinical trials. For each CPM-trial pair, we assessed model discrimination, calibration, and net benefit. Results were stratified based on the relatedness of derivation and validation cohorts, and net benefit was reassessed after updating model intercept, slope, or complete re-estimation. The median c statistic of the CPMs decreased from 0.77 (IQR 0.72-0.78) in the derivation cohorts to 0.63 (IQR 0.58-0.66) when externally validated. The validation c-statistic was higher when derivation and validation cohorts were considered related than when they were distantly related (0.67 vs 0.60, p < 0.001). The calibration slope was also higher in related cohorts than distantly related cohorts (0.69 vs 0.58, p < 0.001). Net benefit analysis suggested substantial likelihood of harm when models were externally applied, but this likelihood decreased after model updating.ConclusionsDiscrimination and calibration decrease significantly when CPMs for primary prevention of cardiovascular disease are tested in external populations, particularly when the population is only distantly related to the derivation population. Poorly calibrated predictions lead to poor decision making. Model updating can reduce the likelihood of harmful decision making, and is needed to realize the full potential of risk-based decision making in new settings.


2020 ◽  
Vol 35 (1) ◽  
pp. 100-116 ◽  
Author(s):  
M B Ratna ◽  
S Bhattacharya ◽  
B Abdulrahim ◽  
D J McLernon

Abstract STUDY QUESTION What are the best-quality clinical prediction models in IVF (including ICSI) treatment to inform clinicians and their patients of their chance of success? SUMMARY ANSWER The review recommends the McLernon post-treatment model for predicting the cumulative chance of live birth over and up to six complete cycles of IVF. WHAT IS KNOWN ALREADY Prediction models in IVF have not found widespread use in routine clinical practice. This could be due to their limited predictive accuracy and clinical utility. A previous systematic review of IVF prediction models, published a decade ago and which has never been updated, did not assess the methodological quality of existing models nor provided recommendations for the best-quality models for use in clinical practice. STUDY DESIGN, SIZE, DURATION The electronic databases OVID MEDLINE, OVID EMBASE and Cochrane library were searched systematically for primary articles published from 1978 to January 2019 using search terms on the development and/or validation (internal and external) of models in predicting pregnancy or live birth. No language or any other restrictions were applied. PARTICIPANTS/MATERIALS, SETTING, METHODS The PRISMA flowchart was used for the inclusion of studies after screening. All studies reporting on the development and/or validation of IVF prediction models were included. Articles reporting on women who had any treatment elements involving donor eggs or sperm and surrogacy were excluded. The CHARMS checklist was used to extract and critically appraise the methodological quality of the included articles. We evaluated models’ performance by assessing their c-statistics and plots of calibration in studies and assessed correct reporting by calculating the percentage of the TRIPOD 22 checklist items met in each study. MAIN RESULTS AND THE ROLE OF CHANCE We identified 33 publications reporting on 35 prediction models. Seventeen articles had been published since the last systematic review. The quality of models has improved over time with regard to clinical relevance, methodological rigour and utility. The percentage of TRIPOD score for all included studies ranged from 29 to 95%, and the c-statistics of all externally validated studies ranged between 0.55 and 0.77. Most of the models predicted the chance of pregnancy/live birth for a single fresh cycle. Six models aimed to predict the chance of pregnancy/live birth per individual treatment cycle, and three predicted more clinically relevant outcomes such as cumulative pregnancy/live birth. The McLernon (pre- and post-treatment) models predict the cumulative chance of live birth over multiple complete cycles of IVF per woman where a complete cycle includes all fresh and frozen embryo transfers from the same episode of ovarian stimulation. McLernon models were developed using national UK data and had the highest TRIPOD score, and the post-treatment model performed best on external validation. LIMITATIONS, REASONS FOR CAUTION To assess the reporting quality of all included studies, we used the TRIPOD checklist, but many of the earlier IVF prediction models were developed and validated before the formal TRIPOD reporting was published in 2015. It should also be noted that two of the authors of this systematic review are authors of the McLernon model article. However, we feel we have conducted our review and made our recommendations using a fair and transparent systematic approach. WIDER IMPLICATIONS OF THE FINDINGS This study provides a comprehensive picture of the evolving quality of IVF prediction models. Clinicians should use the most appropriate model to suit their patients’ needs. We recommend the McLernon post-treatment model as a counselling tool to inform couples of their predicted chance of success over and up to six complete cycles. However, it requires further external validation to assess applicability in countries with different IVF practices and policies. STUDY FUNDING/COMPETING INTEREST(S) The study was funded by the Elphinstone Scholarship Scheme and the Assisted Reproduction Unit, University of Aberdeen. Both D.J.M. and S.B. are authors of the McLernon model article and S.B. is Editor in Chief of Human Reproduction Open. They have completed and submitted the ICMJE forms for Disclosure of potential Conflicts of Interest. The other co-authors have no conflicts of interest to declare. REGISTRATION NUMBER N/A


Endocrine ◽  
2021 ◽  
Author(s):  
Olivier Zanier ◽  
Matteo Zoli ◽  
Victor E. Staartjes ◽  
Federica Guaraldi ◽  
Sofia Asioli ◽  
...  

Abstract Purpose Biochemical remission (BR), gross total resection (GTR), and intraoperative cerebrospinal fluid (CSF) leaks are important metrics in transsphenoidal surgery for acromegaly, and prediction of their likelihood using machine learning would be clinically advantageous. We aim to develop and externally validate clinical prediction models for outcomes after transsphenoidal surgery for acromegaly. Methods Using data from two registries, we develop and externally validate machine learning models for GTR, BR, and CSF leaks after endoscopic transsphenoidal surgery in acromegalic patients. For the model development a registry from Bologna, Italy was used. External validation was then performed using data from Zurich, Switzerland. Gender, age, prior surgery, as well as Hardy and Knosp classification were used as input features. Discrimination and calibration metrics were assessed. Results The derivation cohort consisted of 307 patients (43.3% male; mean [SD] age, 47.2 [12.7] years). GTR was achieved in 226 (73.6%) and BR in 245 (79.8%) patients. In the external validation cohort with 46 patients, 31 (75.6%) achieved GTR and 31 (77.5%) achieved BR. Area under the curve (AUC) at external validation was 0.75 (95% confidence interval: 0.59–0.88) for GTR, 0.63 (0.40–0.82) for BR, as well as 0.77 (0.62–0.91) for intraoperative CSF leaks. While prior surgery was the most important variable for prediction of GTR, age, and Hardy grading contributed most to the predictions of BR and CSF leaks, respectively. Conclusions Gross total resection, biochemical remission, and CSF leaks remain hard to predict, but machine learning offers potential in helping to tailor surgical therapy. We demonstrate the feasibility of developing and externally validating clinical prediction models for these outcomes after surgery for acromegaly and lay the groundwork for development of a multicenter model with more robust generalization.


2019 ◽  
Vol 14 (4) ◽  
pp. 506-514 ◽  
Author(s):  
Pavan Kumar Bhatraju ◽  
Leila R. Zelnick ◽  
Ronit Katz ◽  
Carmen Mikacenic ◽  
Susanna Kosamo ◽  
...  

Background and objectivesCritically ill patients with worsening AKI are at high risk for poor outcomes. Predicting which patients will experience progression of AKI remains elusive. We sought to develop and validate a risk model for predicting severe AKI within 72 hours after intensive care unit admission.Design, setting, participants, & measurementsWe applied least absolute shrinkage and selection operator regression methodology to two prospectively enrolled, critically ill cohorts of patients who met criteria for the systemic inflammatory response syndrome, enrolled within 24–48 hours after hospital admission. The risk models were derived and internally validated in 1075 patients and externally validated in 262 patients. Demographics and laboratory and plasma biomarkers of inflammation or endothelial dysfunction were used in the prediction models. Severe AKI was defined as Kidney Disease Improving Global Outcomes (KDIGO) stage 2 or 3.ResultsSevere AKI developed in 62 (8%) patients in the derivation, 26 (8%) patients in the internal validation, and 15 (6%) patients in the external validation cohorts. In the derivation cohort, a three-variable model (age, cirrhosis, and soluble TNF receptor-1 concentrations [ACT]) had a c-statistic of 0.95 (95% confidence interval [95% CI], 0.91 to 0.97). The ACT model performed well in the internal (c-statistic, 0.90; 95% CI, 0.82 to 0.96) and external (c-statistic, 0.93; 95% CI, 0.89 to 0.97) validation cohorts. The ACT model had moderate positive predictive values (0.50–0.95) and high negative predictive values (0.94–0.95) for severe AKI in all three cohorts.ConclusionsACT is a simple, robust model that could be applied to improve risk prognostication and better target clinical trial enrollment in critically ill patients with AKI.


Author(s):  
Marcus Taylor ◽  
Bartłomiej Szafron ◽  
Glen P Martin ◽  
Udo Abah ◽  
Matthew Smith ◽  
...  

Abstract OBJECTIVES National guidelines advocate the use of clinical prediction models to estimate perioperative mortality for patients undergoing lung resection. Several models have been developed that may potentially be useful but contemporary external validation studies are lacking. The aim of this study was to validate existing models in a multicentre patient cohort. METHODS The Thoracoscore, Modified Thoracoscore, Eurolung, Modified Eurolung, European Society Objective Score and Brunelli models were validated using a database of 6600 patients who underwent lung resection between 2012 and 2018. Models were validated for in-hospital or 30-day mortality (depending on intended outcome of each model) and also for 90-day mortality. Model calibration (calibration intercept, calibration slope, observed to expected ratio and calibration plots) and discrimination (area under receiver operating characteristic curve) were assessed as measures of model performance. RESULTS Mean age was 66.8 years (±10.9 years) and 49.7% (n = 3281) of patients were male. In-hospital, 30-day, perioperative (in-hospital or 30-day) and 90-day mortality were 1.5% (n = 99), 1.4% (n = 93), 1.8% (n = 121) and 3.1% (n = 204), respectively. Model area under the receiver operating characteristic curves ranged from 0.67 to 0.73. Calibration was inadequate in five models and mortality was significantly overestimated in five models. No model was able to adequately predict 90-day mortality. CONCLUSIONS Five of the validated models were poorly calibrated and had inadequate discriminatory ability. The modified Eurolung model demonstrated adequate statistical performance but lacked clinical validity. Development of accurate models that can be used to estimate the contemporary risk of lung resection is required.


Sign in / Sign up

Export Citation Format

Share Document