A Machine Learning Approach to Understanding HIV and Comorbidities in Electronic Health Record Data

Abstract Introduction The Consortium for Clinical Characterization of COVID-19 by EHR (4CE) is an international collaboration addressing COVID-19 with federated analyses of electronic health record (EHR) data. Objective We sought to develop and validate a computable phenotype for COVID-19 severity. Methods Twelve 4CE sites participated. First we developed an EHR-based severity phenotype consisting of six code classes, and we validated it on patient hospitalization data from the 12 4CE clinical sites against the outcomes of ICU admission and/or death. We also piloted an alternative machine-learning approach and compared selected predictors of severity to the 4CE phenotype at one site. Results The full 4CE severity phenotype had pooled sensitivity of 0.73 and specificity 0.83 for the combined outcome of ICU admission and/or death. The sensitivity of individual code categories for acuity had high variability - up to 0.65 across sites. At one pilot site, the expert-derived phenotype had mean AUC 0.903 (95% CI: 0.886, 0.921), compared to AUC 0.956 (95% CI: 0.952, 0.959) for the machine-learning approach. Billing codes were poor proxies of ICU admission, with as low as 49% precision and recall compared to chart review. Discussion We developed a severity phenotype using 6 code classes that proved resilient to coding variability across international institutions. In contrast, machine-learning approaches may overfit hospital-specific orders. Manual chart review revealed discrepancies even in the gold-standard outcomes, possibly due to heterogeneous pandemic conditions. Conclusion We developed an EHR-based severity phenotype for COVID-19 in hospitalized patients and validated it at 12 international sites.

Download Full-text

Utilizing electronic health record data to understand comorbidity burden among people living with HIV: a machine learning approach

AIDS ◽

10.1097/qad.0000000000002736 ◽

2021 ◽

Vol 35 (1) ◽

pp. S39-S51

Author(s):

Xueying Yang ◽

Jiajia Zhang ◽

Shujie Chen ◽

Sharon Weissman ◽

Bankole Olatosi ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Record ◽

People Living With Hiv ◽

Learning Approach ◽

Health Record ◽

Electronic Health Record Data ◽

Comorbidity Burden ◽

Machine Learning Approach ◽

Record Data ◽

Living With Hiv

Download Full-text

Assessing stroke severity using electronic health record data: a machine learning approach

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-1010-x ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Emily Kogan ◽

Kathryn Twyman ◽

Jesse Heap ◽

Dejan Milentijevic ◽

Jennifer H. Lin ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Record ◽

Patient Outcomes ◽

Stroke Severity ◽

Health Record ◽

Learning Models ◽

Electronic Health Record Data ◽

Record Data ◽

Electronic Health ◽

Machine Learning Models

Abstract Background Stroke severity is an important predictor of patient outcomes and is commonly measured with the National Institutes of Health Stroke Scale (NIHSS) scores. Because these scores are often recorded as free text in physician reports, structured real-world evidence databases seldom include the severity. The aim of this study was to use machine learning models to impute NIHSS scores for all patients with newly diagnosed stroke from multi-institution electronic health record (EHR) data. Methods NIHSS scores available in the Optum© de-identified Integrated Claims-Clinical dataset were extracted from physician notes by applying natural language processing (NLP) methods. The cohort analyzed in the study consists of the 7149 patients with an inpatient or emergency room diagnosis of ischemic stroke, hemorrhagic stroke, or transient ischemic attack and a corresponding NLP-extracted NIHSS score. A subset of these patients (n = 1033, 14%) were held out for independent validation of model performance and the remaining patients (n = 6116, 86%) were used for training the model. Several machine learning models were evaluated, and parameters optimized using cross-validation on the training set. The model with optimal performance, a random forest model, was ultimately evaluated on the holdout set. Results Leveraging machine learning we identified the main factors in electronic health record data for assessing stroke severity, including death within the same month as stroke occurrence, length of hospital stay following stroke occurrence, aphagia/dysphagia diagnosis, hemiplegia diagnosis, and whether a patient was discharged to home or self-care. Comparing the imputed NIHSS scores to the NLP-extracted NIHSS scores on the holdout data set yielded an R2 (coefficient of determination) of 0.57, an R (Pearson correlation coefficient) of 0.76, and a root-mean-squared error of 4.5. Conclusions Machine learning models built on EHR data can be used to determine proxies for stroke severity. This enables severity to be incorporated in studies of stroke patient outcomes using administrative and EHR databases.

Download Full-text