scholarly journals Assessing stroke severity using electronic health record data: a machine learning approach

Author(s):  
Emily Kogan ◽  
Kathryn Twyman ◽  
Jesse Heap ◽  
Dejan Milentijevic ◽  
Jennifer H. Lin ◽  
...  

Abstract Background Stroke severity is an important predictor of patient outcomes and is commonly measured with the National Institutes of Health Stroke Scale (NIHSS) scores. Because these scores are often recorded as free text in physician reports, structured real-world evidence databases seldom include the severity. The aim of this study was to use machine learning models to impute NIHSS scores for all patients with newly diagnosed stroke from multi-institution electronic health record (EHR) data. Methods NIHSS scores available in the Optum© de-identified Integrated Claims-Clinical dataset were extracted from physician notes by applying natural language processing (NLP) methods. The cohort analyzed in the study consists of the 7149 patients with an inpatient or emergency room diagnosis of ischemic stroke, hemorrhagic stroke, or transient ischemic attack and a corresponding NLP-extracted NIHSS score. A subset of these patients (n = 1033, 14%) were held out for independent validation of model performance and the remaining patients (n = 6116, 86%) were used for training the model. Several machine learning models were evaluated, and parameters optimized using cross-validation on the training set. The model with optimal performance, a random forest model, was ultimately evaluated on the holdout set. Results Leveraging machine learning we identified the main factors in electronic health record data for assessing stroke severity, including death within the same month as stroke occurrence, length of hospital stay following stroke occurrence, aphagia/dysphagia diagnosis, hemiplegia diagnosis, and whether a patient was discharged to home or self-care. Comparing the imputed NIHSS scores to the NLP-extracted NIHSS scores on the holdout data set yielded an R2 (coefficient of determination) of 0.57, an R (Pearson correlation coefficient) of 0.76, and a root-mean-squared error of 4.5. Conclusions Machine learning models built on EHR data can be used to determine proxies for stroke severity. This enables severity to be incorporated in studies of stroke patient outcomes using administrative and EHR databases.

2020 ◽  
Author(s):  
Christine Giang ◽  
Jacob Calvert ◽  
Gina Barnes ◽  
Anna Siefkas ◽  
Abigail Green-Saxena ◽  
...  

Abstract Objective Ventilator-associated pneumonia (VAP) is the most common and fatal nosocomial infection in intensive care units (ICUs). Existing methods for identifying VAP display low accuracy, and their use may delay antimicrobial therapy. ​VAP diagnostics derived from machine learning methods that utilize electronic health record data have not yet been explored. The objective of this study is to compare the performance of a variety of machine learning models trained to predict whether VAP will be diagnosed during the patient stay.Methods A retrospective study examined data from 6,129 adult ICU encounters lasting at least 48 hours following the initiation of mechanical ventilation. The gold standard was the presence of a diagnostic code for VAP. Five different machine learning models were trained to predict VAP 48 hours after initiation of mechanical ventilation. Model performance was evaluated with regard to area under the receiver operating characteristic curve (AUROC) on a 10% hold-out test set. Feature importance was measured in terms of Shapley values.Results The highest performing model achieved an AUROC value of 0.827. The most important features for the best-performing model were the length of time on mechanical ventilation, presence of antibiotics, sputum test frequency, and most recent Glasgow Coma Scale assessment.Discussion Supervised machine learning using patient electronic health record data is promising for VAP diagnosis and warrants further validation. Conclusion This tool has the potential to aid the timely diagnosis of VAP.


2019 ◽  
Vol 6 (10) ◽  
pp. e688-e695 ◽  
Author(s):  
Julia L Marcus ◽  
Leo B Hurley ◽  
Douglas S Krakower ◽  
Stacey Alexeeff ◽  
Michael J Silverberg ◽  
...  

2018 ◽  
Vol 5 ◽  
pp. 205435811877632 ◽  
Author(s):  
Hamid Mohamadlou ◽  
Anna Lynn-Palevsky ◽  
Christopher Barton ◽  
Uli Chettipally ◽  
Lisa Shieh ◽  
...  

2021 ◽  
Vol 4 (3) ◽  
pp. e213460
Author(s):  
Rebekah W. Moehring ◽  
Matthew Phelan ◽  
Eric Lofgren ◽  
Alicia Nelson ◽  
Elizabeth Dodds Ashley ◽  
...  

2020 ◽  
Author(s):  
Tjardo D Maarseveen ◽  
Timo Meinderink ◽  
Marcel J T Reinders ◽  
Johannes Knitza ◽  
Tom W J Huizinga ◽  
...  

BACKGROUND Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. OBJECTIVE The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. METHODS Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. RESULTS For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). CONCLUSIONS We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.


Sign in / Sign up

Export Citation Format

Share Document