Abstract P261: Machine Learning-Enabled Prediction of Long-Term Stroke Recurrence Using Data From Electronic Health Records

Vida Abedi; Venkatesh Avula; Durgesh Chaudhary; Shima Shahjouei; Ayesha Khan; Christoph J Griessenauer; Jiang Li; Ramin Zand

doi:10.1161/str.52.suppl_1.p261

Abstract P261: Machine Learning-Enabled Prediction of Long-Term Stroke Recurrence Using Data From Electronic Health Records

Stroke ◽

10.1161/str.52.suppl_1.p261 ◽

2021 ◽

Vol 52 (Suppl_1) ◽

Author(s):

Vida Abedi ◽

Venkatesh Avula ◽

Durgesh Chaudhary ◽

Shima Shahjouei ◽

Ayesha Khan ◽

...

Keyword(s):

Machine Learning ◽

Ischemic Stroke ◽

Electronic Health Records ◽

Performance Metrics ◽

Gradient Boosting ◽

Stroke Recurrence ◽

Sampling Strategies ◽

Health Records ◽

Electronic Health

Objective: The long-term risk of recurrent ischemic stroke, estimated to be between 17%-30%, cannot be reliably assessed. Our goal was to study whether machine-learning can be trained to predict stroke recurrence and identify key clinical variables and assess whether performance metrics can be optimized. Methods: We used patient-level data from electronic health records, 6 algorithms (Logistic Regression, Extreme Gradient Boosting, Gradient Boosting Machine, Random Forest, Support Vector Machine, Decision Tree), 4 feature selection strategies, 5 prediction windows, and 2 sampling strategies to develop 288 models for up to 5-year stroke recurrence prediction. We further identified important clinical features and different optimization strategies. Results: We included 2,091 ischemic stroke patients for this study. Model AUROC was stable for prediction windows of 1,2,3,4 and 5 years, with the highest score for the 1-year (0.79) and the lowest score for the 5-year prediction window (0.69). A total of 21(7%) models reached an AUROC above 0.73 while 110(38%) models reached an AUROC greater than 0.7. Among the 53 features analyzed, age, body mass index, and laboratory-based features (such as high-density lipoprotein, hemoglobin A1C, and creatinine) had the highest overall importance scores. The balance between specificity and sensitivity improved through sampling strategies. Conclusion: All the selected six modeling algorithms could be trained to predict the long-term stroke recurrence and laboratory-based variables were highly associated with stroke recurrence. The latter could be targeted for personalized interventions. Model performance metrics could be optimized, and models can be implemented in the same healthcare system as intelligent decision support to improve outcomes.

Download Full-text

Prediction of Long-Term Stroke Recurrence Using Machine Learning Models

Journal of Clinical Medicine ◽

10.3390/jcm10061286 ◽

2021 ◽

Vol 10 (6) ◽

pp. 1286

Author(s):

Vida Abedi ◽

Venkatesh Avula ◽

Durgesh Chaudhary ◽

Shima Shahjouei ◽

Ayesha Khan ◽

...

Keyword(s):

Machine Learning ◽

Ischemic Stroke ◽

Performance Metrics ◽

Gradient Boosting ◽

Stroke Recurrence ◽

Support Vector ◽

Sampling Strategies ◽

Specificity And Sensitivity ◽

Extreme Gradient Boosting

Background: The long-term risk of recurrent ischemic stroke, estimated to be between 17% and 30%, cannot be reliably assessed at an individual level. Our goal was to study whether machine-learning can be trained to predict stroke recurrence and identify key clinical variables and assess whether performance metrics can be optimized. Methods: We used patient-level data from electronic health records, six interpretable algorithms (Logistic Regression, Extreme Gradient Boosting, Gradient Boosting Machine, Random Forest, Support Vector Machine, Decision Tree), four feature selection strategies, five prediction windows, and two sampling strategies to develop 288 models for up to 5-year stroke recurrence prediction. We further identified important clinical features and different optimization strategies. Results: We included 2091 ischemic stroke patients. Model area under the receiver operating characteristic (AUROC) curve was stable for prediction windows of 1, 2, 3, 4, and 5 years, with the highest score for the 1-year (0.79) and the lowest score for the 5-year prediction window (0.69). A total of 21 (7%) models reached an AUROC above 0.73 while 110 (38%) models reached an AUROC greater than 0.7. Among the 53 features analyzed, age, body mass index, and laboratory-based features (such as high-density lipoprotein, hemoglobin A1c, and creatinine) had the highest overall importance scores. The balance between specificity and sensitivity improved through sampling strategies. Conclusion: All of the selected six algorithms could be trained to predict the long-term stroke recurrence and laboratory-based variables were highly associated with stroke recurrence. The latter could be targeted for personalized interventions. Model performance metrics could be optimized, and models can be implemented in the same healthcare system as intelligent decision support for targeted intervention.

Download Full-text

The prediction of asymptomatic carotid atherosclerosis with electronic health records: a comparative study of six machine learning models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01480-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jiaxin Fan ◽

Mengying Chen ◽

Jian Luo ◽

Shusen Yang ◽

Jinming Shi ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Carotid Atherosclerosis ◽

Characteristic Curve ◽

Gradient Boosting ◽

Learning Models ◽

Health Records ◽

Extreme Gradient Boosting ◽

Electronic Health ◽

Machine Learning Models

Abstract Background Screening carotid B-mode ultrasonography is a frequently used method to detect subjects with carotid atherosclerosis (CAS). Due to the asymptomatic progression of most CAS patients, early identification is challenging for clinicians, and it may trigger ischemic stroke. Recently, machine learning has shown a strong ability to classify data and a potential for prediction in the medical field. The combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and precise method to identify asymptomatic CAS. Methods Retrospective cohort study using routine clinical data of medical check-up subjects from April 19, 2010 to November 15, 2019. Six machine learning models (logistic regression [LR], random forest [RF], decision tree [DT], eXtreme Gradient Boosting [XGB], Gaussian Naïve Bayes [GNB], and K-Nearest Neighbour [KNN]) were used to predict asymptomatic CAS and compared their predictability in terms of the area under the receiver operating characteristic curve (AUCROC), accuracy (ACC), and F1 score (F1). Results Of the 18,441 subjects, 6553 were diagnosed with asymptomatic CAS. Compared to DT (AUCROC 0.628, ACC 65.4%, and F1 52.5%), the other five models improved prediction: KNN + 7.6% (0.704, 68.8%, and 50.9%, respectively), GNB + 12.5% (0.753, 67.0%, and 46.8%, respectively), XGB + 16.0% (0.788, 73.4%, and 55.7%, respectively), RF + 16.6% (0.794, 74.5%, and 56.8%, respectively) and LR + 18.1% (0.809, 74.7%, and 59.9%, respectively). The highest achieving model, LR predicted 1045/1966 cases (sensitivity 53.2%) and 3088/3566 non-cases (specificity 86.6%). A tenfold cross-validation scheme further verified the predictive ability of the LR. Conclusions Among machine learning models, LR showed optimal performance in predicting asymptomatic CAS. Our findings set the stage for an early automatic alarming system, allowing a more precise allocation of CAS prevention measures to individuals probably to benefit most.

Download Full-text

Machine Learning for Personalized Medicine: Predicting Primary Myocardial Infarction from Electronic Health Records

AI Magazine ◽

10.1609/aimag.v33i4.2438 ◽

2012 ◽

Vol 33 (4) ◽

pp. 33 ◽

Cited By ~ 33

Author(s):

Jeremy C. Weiss ◽

Sriraam Natarajan ◽

Peggy L. Peissig ◽

Catherine A. McCarty ◽

David Page

Keyword(s):

Machine Learning ◽

Myocardial Infarction ◽

Electronic Health Records ◽

Relational Learning ◽

Statistical Relational Learning ◽

Gradient Boosting ◽

Health Records ◽

High Recall ◽

Electronic Health ◽

Better Than

Electronic health records (EHRs) are an emerging relational domain with large potential to improve clinical outcomes. We apply two statistical relational learning (SRL) algorithms to the task of predicting primary myocardial infarction. We show that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region. We observe that both SRL algorithms predict outcomes better than their propositional analogs and suggest how our methods can augment current epidemiological practices.

Download Full-text

Predicting dementia diagnosis from cognitive footprints in electronic health records: a case–control study protocol

BMJ Open ◽

10.1136/bmjopen-2020-043487 ◽

2020 ◽

Vol 10 (11) ◽

pp. e043487

Author(s):

Hao Luo ◽

Kui Kai Lau ◽

Gloria H Y Wong ◽

Wai-Chi Chan ◽

Henry K F Mak ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Hong Kong ◽

Electronic Health Records ◽

Case Control Study ◽

Case Control ◽

Dementia Diagnosis ◽

Health Records ◽

Electronic Health ◽

Control Study

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).

Download Full-text

Comparative analysis of machine learning methods for analyzing security practice in electronic health records’ logs.

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378353 ◽

2020 ◽

Author(s):

Prosper K Yeng ◽

Muhammad Ali Fauzi ◽

Bian Yang

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Electronic Health Records ◽

Learning Methods ◽

Health Records ◽

Machine Learning Methods ◽

Electronic Health

Download Full-text

A Global Socio-economic-medico-legal Model for the Sustainability of Longitudinal Electronic Health Records

Methods of Information in Medicine ◽

10.1055/s-0038-1634081 ◽

2006 ◽

Vol 45 (03) ◽

pp. 240-245 ◽

Cited By ~ 21

Author(s):

A. Shabo

Keyword(s):

Electronic Health Records ◽

Healthcare Providers ◽

Financial Assets ◽

Healthcare Organizations ◽

Health Records ◽

Record Keeping ◽

Proposed Model ◽

Legal Model ◽

Electronic Health

Summary Objectives: This paper pursues the challenge of sustaining lifetime electronic health records (EHRs) based on a comprehensive socio-economic-medico-legal model. The notion of a lifetime EHR extends the emerging concept of a longitudinal and cross-institutional EHR and is invaluable information for increasing patient safety and quality of care. Methods: The challenge is how to compile and sustain a coherent EHR across the lifetime of an individual. Several existing and hypothetical models are described, analyzed and compared in an attempt to suggest a preferred approach. Results: The vision is that lifetime EHRs should be sustained by new players in the healthcare arena, who will function as independent health record banks (IHRBs). Multiple competing IHRBs would be established and regulated following preemptive legislation. They should be neither owned by healthcare providers nor by health insurer/payers or government agencies. The new legislation should also stipulate that the records located in these banks be considered the medico-legal copies of an individual’s records, and that healthcare providers no longer serve as the legal record keepers. Conclusions: The proposed model is not centered on any of the current players in the field; instead, it is focussed on the objective service of sustaining individual EHRs, much like financial banks maintain and manage financial assets. This revolutionary structure provides two main benefits: 1) Healthcare organizations will be able to cut the costs of long-term record keeping, and 2) healthcare providers will be able to provide better care based on the availability of a lifelong EHR of their new patients.

Download Full-text

An authentication and authorization mechanism for long-term electronic health records management

Procedia Computer Science ◽

10.1016/j.procs.2017.06.021 ◽

2017 ◽

Vol 111 ◽

pp. 145-153 ◽

Cited By ~ 3

Author(s):

Nai-Wei Lo ◽

Chia-Yi Wu ◽

Yo-Hsuan Chuang

Keyword(s):

Electronic Health Records ◽

Records Management ◽

Health Records ◽

Authentication And Authorization ◽

Electronic Health

Download Full-text

Workflow-based anomaly detection using machine learning on electronic health records’ logs: A Comparative Study

2020 International Conference on Computational Science and Computational Intelligence (CSCI) ◽

10.1109/csci51800.2020.00143 ◽

2020 ◽

Author(s):

Prosper K Yeng ◽

Muhammad Ali Fauzi ◽

Bian Yang

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Anomaly Detection ◽

Comparative Study ◽

Health Records ◽

Electronic Health

Download Full-text

The process of sourcing and preparing electronic health records data to implement a machine-learning algorithm for early identification of maternal cardiovascular risk (Preprint)

10.2196/preprints.34932 ◽

2021 ◽

Author(s):

Nawar Shara ◽

Kelley M. Anderson ◽

Noor Falah ◽

Maryam F. Ahmad ◽

Darya Tavazoei ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Risk ◽

Electronic Health Records ◽

Patient Care ◽

Early Identification ◽

Health Data ◽

Patient Specific ◽

Patient Records ◽

Health Records ◽

Electronic Health

BACKGROUND Healthcare data are fragmenting as patients seek care from diverse sources. Consequently, patient care is negatively impacted by disparate health records. Machine learning (ML) offers a disruptive force in its ability to inform and improve patient care and outcomes [6]. However, the differences that exist in each individual’s health records, combined with the lack of health-data standards, in addition to systemic issues that render the data unreliable and that fail to create a single view of each patient, create challenges for ML. While these problems exist throughout healthcare, they are especially prevalent within maternal health, and exacerbate the maternal morbidity and mortality (MMM) crisis in the United States. OBJECTIVE Maternal patient records were extracted from the electronic health records (EHRs) of a large tertiary healthcare system and made into patient-specific, complete datasets through a systematic method so that a machine-learning-based (ML-based) risk-assessment algorithm could effectively identify maternal cardiovascular risk prior to evidence of diagnosis or intervention within the patient’s record. METHODS We outline the effort that was required to define the specifications of the computational systems, the dataset, and access to relevant systems, while ensuring data security, privacy laws, and policies were met. Data acquisition included the concatenation, anonymization, and normalization of health data across multiple EHRs in preparation for its use by a proprietary risk-stratification algorithm designed to establish patient-specific baselines to identify and establish cardiovascular risk based on deviations from the patient’s baselines to inform early interventions. RESULTS Patient records can be made actionable for the goal of effectively employing machine learning (ML), specifically to identify cardiovascular risk in pregnant patients. CONCLUSIONS Upon acquiring data, including the concatenation, anonymization, and normalization of said data across multiple EHRs, the use of a machine-learning-based (ML-based) tool can provide early identification of cardiovascular risk in pregnant patients. CLINICALTRIAL N/A

Download Full-text

Leveraging Genetic Reports and Electronic Health Records for Predicting Primary Cancers Based on FHIR and RDF (Preprint)

10.2196/preprints.23586 ◽

2020 ◽

Author(s):

Nansu Zong ◽

Victoria Ngo ◽

Daniel J. Stone ◽

Andrew Wen ◽

Yiqing Zhao ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Cancer Patients ◽

Genetic Data ◽

Precision Oncology ◽

Primary Cancer ◽

Health Records ◽

Web Resource ◽

Cancer Types ◽

Electronic Health

BACKGROUND Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnose, and treatments. A key research area focuses on early detection of primary cancers and the potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. OBJECTIVE This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict unknown primaries. METHODS We extracted the genetic data elements from a collection of oncology genetic reports of 1,011 cancer patients, and corresponding phenotypical data from the Mayo Clinic electronic health records (EHRs). We modeled both genetic and EHR data with HL7 Fast Healthcare Interoperability Resources (FHIR). The semantic web Resource Description Framework (RDF) was employed to generate the network-based data representation (i.e., patient-phenotypic-genetic network). Based on RDF data graph, graph embedding algorithm Node2vec was applied to generate features, and then multiple machine learning and deep learning backbone models were adopted for cancer prediction. RESULTS With six machine-learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types and predicting unknown primaries. To demonstrate the interpretability, phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. CONCLUSIONS Accurate prediction of cancer types can be achieved with existing EHR data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnose stage for cancer patients.

Download Full-text