scholarly journals The Framing of machine learning risk prediction models illustrated by evaluation of sepsis in general wards

2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Simon Meyer Lauritsen ◽  
Bo Thiesson ◽  
Marianne Johansson Jørgensen ◽  
Anders Hammerich Riis ◽  
Ulrick Skipper Espelund ◽  
...  

AbstractProblem framing is critical to developing risk prediction models because all subsequent development work and evaluation takes place within the context of how a problem has been framed and explicit documentation of framing choices makes it easier to compare evaluation metrics between published studies. In this work, we introduce the basic concepts of framing, including prediction windows, observation windows, window shifts and event-triggers for a prediction that strongly affects the risk of clinician fatigue caused by false positives. Building on this, we apply four different framing structures to the same generic dataset, using a sepsis risk prediction model as an example, and evaluate how framing affects model performance and learning. Our results show that an apparently good model with strong evaluation results in both discrimination and calibration is not necessarily clinically usable. Therefore, it is important to assess the results of objective evaluations within the context of more subjective evaluations of how a model is framed.

Author(s):  
Byron C. Jaeger ◽  
Ryan Cantor ◽  
Venkata Sthanam ◽  
Rongbing Xie ◽  
James K. Kirklin ◽  
...  

Background: Risk prediction models play an important role in clinical decision making. When developing risk prediction models, practitioners often impute missing values to the mean. We evaluated the impact of applying other strategies to impute missing values on the prognostic accuracy of downstream risk prediction models, that is, models fitted to the imputed data. A secondary objective was to compare the accuracy of imputation methods based on artificially induced missing values. To complete these objectives, we used data from the Interagency Registry for Mechanically Assisted Circulatory Support. Methods: We applied 12 imputation strategies in combination with 2 different modeling strategies for mortality and transplant risk prediction following surgery to receive mechanical circulatory support. Model performance was evaluated using Monte-Carlo cross-validation and measured based on outcomes 6 months following surgery using the scaled Brier score, concordance index, and calibration error. We used Bayesian hierarchical models to compare model performance. Results: Multiple imputation with random forests emerged as a robust strategy to impute missing values, increasing model concordance by 0.0030 (25th–75th percentile: 0.0008–0.0052) compared with imputation to the mean for mortality risk prediction using a downstream proportional hazards model. The posterior probability that single and multiple imputation using random forests would improve concordance versus mean imputation was 0.464 and >0.999, respectively. Conclusions: Selecting an optimal strategy to impute missing values such as random forests and applying multiple imputation can improve the prognostic accuracy of downstream risk prediction models.


2021 ◽  
Author(s):  
Harvineet Singh ◽  
Vishwali Mhasawade ◽  
Rumi Chunara

Importance: Modern predictive models require large amounts of data for training and evaluation which can result in building models that are specific to certain locations, populations in them and clinical practices. Yet, best practices and guidelines for clinical risk prediction models have not yet considered such challenges to generalizability. Objectives: To investigate changes in measures of predictive discrimination, calibration, and algorithmic fairness when transferring models for predicting in-hospital mortality across ICUs in different populations. Also, to study the reasons for the lack of generalizability in these measures. Design, Setting, and Participants: In this multi-center cross-sectional study, electronic health records from 179 hospitals across the US with 70,126 hospitalizations were analyzed. Time of data collection ranged from 2014 to 2015. Main Outcomes and Measures: The main outcome is in-hospital mortality. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for discrimination and calibration metrics, namely area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm "Fast Causal Inference" (FCI) that infers paths of causal influence while identifying potential influences associated with unmeasured variables. Results: In-hospital mortality rates differed in the range of 3.9%-9.3% (1st-3rd quartile) across hospitals. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st to 3rd quartile; median 0.801); calibration slope from 0.725 to 0.983 (1st to 3rd quartile; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (1st to 3rd quartile; median 0.092). When transferring models across geographies, AUC ranged from 0.795 to 0.813 (1st to 3rd quartile; median 0.804); calibration slope from 0.904 to 1.018 (1st to 3rd quartile; median 0.968); and disparity in false negative rates from 0.018 to 0.074 (1st to 3rd quartile; median 0.040). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. Shifts in the race variable distribution and some clinical (vitals, labs and surgery) variables by hospital or region. Race variable also mediates differences in the relationship between clinical variables and mortality, by hospital/region. Conclusions and Relevance: Group-specific metrics should be assessed during generalizability checks to identify potential harms to the groups. In order to develop methods to improve and guarantee performance of prediction models in new environments for groups and individuals, better understanding and provenance of health processes as well as data generating processes by sub-group are needed to identify and mitigate sources of variation.


2021 ◽  
Vol 8 (Supplement_1) ◽  
pp. S716-S717
Author(s):  
Andras Farkas ◽  
Arsheena Yassin ◽  
Hendrik Sy ◽  
Kristy Huang ◽  
Iana Stein ◽  
...  

Abstract Background Accurately predicting the presence of a carbapenem resistant enterobacterales (CRE) in hospitalized patients presents itself as an opportunity that would support timely initiation of CRE active agents. The aim of this study is to determine how reliably the existing risk prediction models identify patients likely to require empiric anti-CRE treatment, preliminary results of which are presented herein. Methods A systematic search identified all existing CRE prediction models for validation in our patient population. Medical records of hospitalized patients within the Mount Sinai Health System in New York were subsequently reviewed. Data was gathered on model predictors, baseline demographics, clinical information, microbiology results, antibiotic utilization history and index infection. Besides calculating the AUROC, the main outcome of our study was to establish optimal prediction score cutoffs and false positive rates (FPR) where corresponding model performance maintains a false negative rate (FNR) of < 10%, < 20% and < 30%, respectively. Results 12 models were retained for validation. We identified 106 patients, 41 of which were treated for a CRE infection. Previous admission, organ transplantation, CKD, infection type, and carbapenem use were baseline variables that significantly differed between the groups treated for a CRE or non-CRE related infection (Table 1). The models ability to discriminate varied as evidenced by the AUROC range of 0.5 to 0.77 (Figure 1), suggesting the Seligmen et al. model as the overall best. When evaluated at the pre-specified FNR intervals of < 10%, < 20% and < 30%, the model by Lodise et al., Seligman et al., and Vazquez-Guillamet et al. produced the best FPR, respectively (Table 2). Table 1. Baseline characchteristics Table 2. Model Performance Figure 1. AUROCs Conclusion Discriminative ability of the risk prediction models showed varying performance. The model by Lodise et al. appears to be most useful when a low risk level is deemed acceptable for failure rate, while at a moderate to high risk of missing a CRE case (20% and 30% FNR), the methods by Seligman and Vazquez-Guillamet et al. are most desirable as they minimize the chance of over-treatment. Additional work to increase sample size and to evaluate the models inter-rater reliability is currently on going. Disclosures All Authors: No reported disclosures


2010 ◽  
Vol 24 (6) ◽  
pp. 602-607 ◽  
Author(s):  
Sander M. J. Van Kuijk ◽  
Simone J. S. Sep ◽  
Patty J. Nelemans ◽  
Luc J. M. Smits

Circulation ◽  
2014 ◽  
Vol 129 (suppl_1) ◽  
Author(s):  
Mary E Lacy ◽  
Gregory Wellenius ◽  
Charles B Eaton ◽  
Eric B Loucks ◽  
Adolfo Correa ◽  
...  

Background: In 2010, the American Diabetes Association (ADA) updated diagnostic criteria for diabetes to include hemoglobin A1c (A1c). However, the appropriateness of these criteria in African Americans (AAs) is unclear as A1c may not reflect glycemic control as accurately in AAs as in whites. Moreover, existing diabetes risk prediction models have been developed in populations composed primarily of whites. Objectives were to (1) examine the predictive power of existing diabetes risk prediction models in the Jackson Heart Study (JHS), a prospective cohort of 5,301 AA adults and (2) explore the impact of incorporating A1c into these models. Methods: We selected 3 widely-used diabetes risk prediction models and examined their ability to predict 5-year diabetes risk among 3,185 JHS participants free of diabetes at baseline and who returned for the 5 year follow-up visit. Incident diabetes was identified at follow-up based on current antidiabetic medications, fasting glucose ≥126 mg/dl or A1c ≥6.5%. We evaluated model performance using model discrimination (C-statistic) and reclassification (net reclassification index (NRI) and integrated discrimination improvement (IDI)). For each of the 3 models, model performance in JHS was evaluated using (1) covariates identified in the original published model and (2) published covariates plus A1c. Results: Of 3,185 participants (mean age 53.7; 64.0% female), 9.8% (n=311) developed diabetes over 5 years of follow-up. Each diabetes prediction model suffered a drop in predictive power when applied to JHS using ADA 2010 criteria (Table 1). The performance of all 3 models improved significantly with the addition of A1c, as evidenced by the increase in C-statistic and improvement in reclassification. Conclusion: Despite evidence that A1c may not accurately reflect glycemic control in AAs as well as in whites, adding A1c to existing diabetes risk prediction models developed in primarily white populations significantly improved 5-year predictive power of all 3 models among AAs in the JHS.


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
B Arshi ◽  
J C Van Den Berge ◽  
B Van Dijk ◽  
J W Deckers ◽  
M A Ikram ◽  
...  

Abstract Background In 2013, the American College of Cardiology (ACC) and the American Heart Association (AHA) developed a score for assessment of cardiovascular risk. Due to between study variability in ascertainment and adjudication of heart failure (HF), incident HF was not included as an endpoint in the ACC/AHA risk score. Purpose To assess the performance of the ACC/AHA risk score for HF risk prediction in a large population-based cohort and to compare its performance with the existing HF risk prediction models including the Atherosclerosis Risk in Communities (ARIC) model and the Health Aging and Body Composition (Health ABC) model. Methods The study included 2743 men and 3646 women from a prospective population-based cohort study. Cox proportional hazards models were fitted using risk factors applied by the ACC/AHA model for cardiovascular risk, the ARIC model and the Health ABC model. Independent relationship of each predictor with 10-year HF incidence was estimated in men and women. Next, N-terminal pro-b-type natriuretic peptide (NT-pro-BNP) was added to the ACC/AHA model. The performance of all fitted models was evaluated and compared in terms of discrimination, calibration and the Akaike Information Criterion (AIC). In addition, area under the receiver operator characteristic curve (AUC), sensitivity and specificity of each model in predicting 10-year incident of HF was assessed. The incremental value of NT-pro-BNP to the ACC/AHA model, was assessed using the continuous net reclassification improvement index (NRI). Results During a median follow-up of 13 years (63127 person-years), 387 HF events in women and 259 in men were recorded. The Optimism-corrected c-statistic for ACC/AHA model was 0.76 (95% confidence interval (CI): 0.73–0.79) for men and 0.76 (95% CI: 0.74–0.79) for women. The ARIC model provided the largest c-statistic for both men [0.82 (95% CI: 0.80–0.84)] and women [95% CI: 0.81 (0.79–0.83)] among the three models. Calibration of the models was reasonable. Addition of NT-pro-BNP to the ACC/AHA model considerably improved model fitness for men and for women. The AIC improved from 3104.62 to 2976.28 among men and from 5161.63 to 4921.51 among women. The c-statistic also improved to 0.81 (0.78–0.84) in men and 0.79 (0.77–0.81) in women. The continuous NRI for the addition of NT-pro-BNP to the base model was 5.3% (95% CI: −12.3–28.6%) for men and 15.9% (95% CI: 2.7–24.7%) for women. Conclusions Compared to HF-specific models, the ACC/AHA model, containing routine clinically available risk factors, had a reasonable performance in prediction of HF risk. Inclusion of NT-pro-BNP in the ACC/AHA model strongly increased the model performance. To achieve a better model performance for 10-year prediction of incident HF, updating the simple ACC/AHA risk score with the addition of NT-pro-BNP is recommended.


Author(s):  
Marcus Taylor ◽  
Syed F Hashmi ◽  
Glen P Martin ◽  
Michael Shackcloth ◽  
Rajesh Shah ◽  
...  

Abstract OBJECTIVES Guidelines advocate that patients being considered for thoracic surgery should undergo a comprehensive preoperative risk assessment. Multiple risk prediction models to estimate the risk of mortality after thoracic surgery have been developed, but their quality and performance has not been reviewed in a systematic way. The objective was to systematically review these models and critically appraise their performance. METHODS The Cochrane Library and the MEDLINE database were searched for articles published between 1990 and 2019. Studies that developed or validated a model predicting perioperative mortality after thoracic surgery were included. Data were extracted based on the checklist for critical appraisal and data extraction for systematic reviews of prediction modelling studies. RESULTS A total of 31 studies describing 22 different risk prediction models were identified. There were 20 models developed specifically for thoracic surgery with two developed in other surgical specialties. A total of 57 different predictors were included across the identified models. Age, sex and pneumonectomy were the most frequently included predictors in 19, 13 and 11 models, respectively. Model performance based on either discrimination or calibration was inadequate for all externally validated models. The most recent data included in validation studies were from 2018. Risk of bias (assessed using Prediction model Risk Of Bias ASsessment Tool) was high for all except two models. CONCLUSIONS Despite multiple risk prediction models being developed to predict perioperative mortality after thoracic surgery, none could be described as appropriate for contemporary thoracic surgery. Contemporary validation of available models or new model development is required to ensure that appropriate estimates of operative risk are available for contemporary thoracic surgical practice.


Sign in / Sign up

Export Citation Format

Share Document