Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data

Author(s):  
Jaouja Maiga ◽  
Gilbert Gutabaga Hungilo ◽  
Pranowo
Author(s):  
Emily Kogan ◽  
Kathryn Twyman ◽  
Jesse Heap ◽  
Dejan Milentijevic ◽  
Jennifer H. Lin ◽  
...  

Abstract Background Stroke severity is an important predictor of patient outcomes and is commonly measured with the National Institutes of Health Stroke Scale (NIHSS) scores. Because these scores are often recorded as free text in physician reports, structured real-world evidence databases seldom include the severity. The aim of this study was to use machine learning models to impute NIHSS scores for all patients with newly diagnosed stroke from multi-institution electronic health record (EHR) data. Methods NIHSS scores available in the Optum© de-identified Integrated Claims-Clinical dataset were extracted from physician notes by applying natural language processing (NLP) methods. The cohort analyzed in the study consists of the 7149 patients with an inpatient or emergency room diagnosis of ischemic stroke, hemorrhagic stroke, or transient ischemic attack and a corresponding NLP-extracted NIHSS score. A subset of these patients (n = 1033, 14%) were held out for independent validation of model performance and the remaining patients (n = 6116, 86%) were used for training the model. Several machine learning models were evaluated, and parameters optimized using cross-validation on the training set. The model with optimal performance, a random forest model, was ultimately evaluated on the holdout set. Results Leveraging machine learning we identified the main factors in electronic health record data for assessing stroke severity, including death within the same month as stroke occurrence, length of hospital stay following stroke occurrence, aphagia/dysphagia diagnosis, hemiplegia diagnosis, and whether a patient was discharged to home or self-care. Comparing the imputed NIHSS scores to the NLP-extracted NIHSS scores on the holdout data set yielded an R2 (coefficient of determination) of 0.57, an R (Pearson correlation coefficient) of 0.76, and a root-mean-squared error of 4.5. Conclusions Machine learning models built on EHR data can be used to determine proxies for stroke severity. This enables severity to be incorporated in studies of stroke patient outcomes using administrative and EHR databases.


Circulation ◽  
2020 ◽  
Vol 142 (Suppl_3) ◽  
Author(s):  
Ashish Sarraju ◽  
Andrew Ward ◽  
Sukyung Chung ◽  
Jiang Li ◽  
David Scheinker ◽  
...  

Introduction: Patients with atherosclerotic cardiovascular disease (ASCVD) have high risk for recurrent ASCVD events despite statin use. Pooled cohort equations (PCE) are used for ASCVD risk prediction in primary prevention but there are no validated models for recurrent risk prediction in secondary prevention. Machine learning (ML) demonstrates promise in developing novel risk prediction models using electronic health record (EHR) data. Methods: We included adults with prior ASCVD from EHR data from an outpatient Northern California system between January 1, 2009 and December 31, 2018 with at least 2 visits at least 1 year apart and 5 years of follow up. The outcome was a recurrent ASCVD event defined as the first myocardial infarction, stroke, or fatal coronary artery disease in the 5 year follow-up period. We trained ML models to predict recurrent ASCVD risk: random forests (RF), gradient boosted machines (GBM), extreme gradient boosted models (XGBoost), and logistic regression with a standard L 2 penalty (LR) and an L 1 penalty (Lasso). We evaluated performance of ML models and the PCE on a 20% held-out test cohort using the areas under the receiver operating characteristic curves (AUCs). Results: Our cohort consisted of 32,192 patients with ASCVD (Mean age 70 years, 46% women, 12% Asian and 6% Hispanic). Less than half (49%) were on guideline directed statins. XGBoost and GBM were the best performing models for recurrent ASCVD risk prediction, while the PCE performed poorly (Figure). The top 20 predictive variables for recurrent ASCVD risk included prior events (ischemic stroke, myocardial infarction), traditional risk factors (age, blood pressure, lipid levels) and socioeconomic factors (income, education). Conclusions: EHR-trained machine learning models facilitated recurrent ASCVD risk prediction in real-world secondary prevention patients. Machine learning models developed from large datasets may help bridge contemporary gaps in ASCVD risk prediction.


2020 ◽  
Author(s):  
Christine Giang ◽  
Jacob Calvert ◽  
Gina Barnes ◽  
Anna Siefkas ◽  
Abigail Green-Saxena ◽  
...  

Abstract Objective Ventilator-associated pneumonia (VAP) is the most common and fatal nosocomial infection in intensive care units (ICUs). Existing methods for identifying VAP display low accuracy, and their use may delay antimicrobial therapy. ​VAP diagnostics derived from machine learning methods that utilize electronic health record data have not yet been explored. The objective of this study is to compare the performance of a variety of machine learning models trained to predict whether VAP will be diagnosed during the patient stay.Methods A retrospective study examined data from 6,129 adult ICU encounters lasting at least 48 hours following the initiation of mechanical ventilation. The gold standard was the presence of a diagnostic code for VAP. Five different machine learning models were trained to predict VAP 48 hours after initiation of mechanical ventilation. Model performance was evaluated with regard to area under the receiver operating characteristic curve (AUROC) on a 10% hold-out test set. Feature importance was measured in terms of Shapley values.Results The highest performing model achieved an AUROC value of 0.827. The most important features for the best-performing model were the length of time on mechanical ventilation, presence of antibiotics, sputum test frequency, and most recent Glasgow Coma Scale assessment.Discussion Supervised machine learning using patient electronic health record data is promising for VAP diagnosis and warrants further validation. Conclusion This tool has the potential to aid the timely diagnosis of VAP.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Moojung Kim ◽  
Young Jae Kim ◽  
Sung Jin Park ◽  
Kwang Gi Kim ◽  
Pyung Chun Oh ◽  
...  

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.


Sign in / Sign up

Export Citation Format

Share Document