Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease

AbstractPrognostic modelling is important in clinical practice and epidemiology for patient management and research. Electronic health records (EHR) provide large quantities of data for such models, but conventional epidemiological approaches require significant researcher time to implement. Expert selection of variables, fine-tuning of variable transformations and interactions, and imputing missing values in datasets are time-consuming and could bias subsequent analysis, particularly given that missingness in EHR is both high, and may carry meaning.Using a cohort of over 80,000 patients from the CALIBER programme, we performed a systematic comparison of several machine-learning approaches in EHR. We used Cox models and random survival forests with and without imputation on 27 expert-selected variables to predict all-cause mortality. We also used Cox models, random forests and elastic net regression on an extended dataset with 586 variables to build prognostic models and identify novel prognostic factors without prior expert input.We observed that data-driven models used on an extended dataset can outperform conventional models for prognosis, without data preprocessing or imputing missing values, and with no need to scale or transform continuous data. An elastic net Cox regression based with 586 unimputed variables with continuous values discretised achieved a C-index of 0.801 (bootstrapped 95% CI 0.799 to 0.802), compared to 0.793 (0.791 to 0.794) for a traditional Cox model comprising 27 expert-selected variables with imputation for missing values.We also found that data-driven models allow identification of novel prognostic variables; that the absence of values for particular variables carries meaning, and can have significant implications for prognosis; and that variables often have a nonlinear association with mortality, which discretised Cox models and random forests can elucidate.This demonstrates that machine-learning approaches applied to raw EHR data can be used to build reliable models for use in research and clinical practice, and identify novel predictive variables and their effects to inform future research.

Download Full-text

Early prediction of clinical deterioration using data-driven machine learning modeling of electronic health records

Journal of Thoracic and Cardiovascular Surgery ◽

10.1016/j.jtcvs.2021.10.060 ◽

2021 ◽

Author(s):

Victor M. Ruiz ◽

Michael P. Goldsmith ◽

Lingyun Shi ◽

Allan F. Simpao ◽

Jorge A. Gálvez ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Clinical Deterioration ◽

Data Driven ◽

Early Prediction ◽

Health Records ◽

Electronic Health ◽

Using Data

Download Full-text

Development and validation of a pancreatic cancer prediction model from electronic health records using machine learning.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.4_suppl.679 ◽

2020 ◽

Vol 38 (4_suppl) ◽

pp. 679-679

Author(s):

Limor Appelbaum ◽

Jose Pablo Cambronero ◽

Karla Pollick ◽

George Silva ◽

Jennifer P. Stevens ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Electronic Health Records ◽

Medical Center ◽

Characteristic Curve ◽

External Validation ◽

Data Driven ◽

Health Records ◽

Clinical Correlates ◽

Electronic Health

679 Background: Pancreatic Adenocarcinoma (PDAC) is often diagnosed at an advanced stage. We sought to develop a model for early PDAC prediction in the general population, using electronic health records (EHRs) and machine learning. Methods: We used three EHR datasets from Beth-Israel Deaconess Medical Center (BIDMC) and Partners Healthcare (PHC): 1. “BIDMC-Development-Data” (BIDMC-DD) for model development, using a feed-forward neural network (NN) and L2-regularized logistic regression,randomly split (80:20) into training and test groups. We tuned hyperparameters using cross-validation in training, and report performance on the test split. 2. “BIDMC-Large-Data” (BIDMC-LD) to re-fit and calibrate models. 3. “PHC-Data” for external validation. We evaluate using Area Under the Receiver Operating Characteristic Curve (AUC) and compute 95% CI using empirical bootstrap over test data. PDAC patients were selected using ICD9/-10 codes and validated with tumor registries. In contrast to prior work, we did not predefine feature sets based on known clinical correlates and instead employed data-driven feature selection, specifically importance-based feature pruning, regularization, and manual validation, to identify diagnostic-based features. Results: BIDMC-DD included demographics, diagnoses, labs and medications for 1018 patients (cases = 509; age-sex paired controls). BIDMC-LD included diagnoses for 547,917 patients (cases = 509), and PHC included diagnoses for 160,593 patients (cases = 408). We compared our approach to adapted and re-fitted published baselines. With a 365-day lead-time, NN obtained a BIDMC-DD test AUC of 0.84 (CI 0.79 - 0.90) versus the previous best baseline AUC of 0.70 (CI 0.62 - 0.78). We also validated using BIDMC-DD’s test cancer patients and BIDMC LD controls. The AUC was 0.71 (CI 0.67 - 0.76) at the 365-day cutoff. NN’s external validation AUC on PHC-Data was 0.71 (CI 0.63 - 0.79), outperforming an existing model’s AUC of 0.61 (CI 0.52 - 0.70) (Baecker et al, 2019). Conclusions: Models based on data-driven feature selection outperform models that use predefined sets of known clinical correlates and can help in early prediction of PDAC development.

Download Full-text

Acronym Disambiguation in Clinical Notes from Electronic Health Records

10.1101/2020.11.25.20221648 ◽

2020 ◽

Author(s):

Nicholas B. Link ◽

Selena Huang ◽

Tianrun Cai ◽

Zeling He ◽

Jiehuan Sun ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Electronic Health Records ◽

Language Processing ◽

Learning Approaches ◽

Health Records ◽

Hospital System ◽

Clinical Notes ◽

Unsupervised Method ◽

Electronic Health

ABSTRACTObjectiveThe use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce an unsupervised method for acronym disambiguation, the task of classifying the correct sense of acronyms in the clinical EHR notes.MethodsWe developed an unsupervised ensemble machine learning (CASEml) algorithm to automatically classify acronyms by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard unsupervised method and a baseline metric selecting the most frequent acronym sense. We additionally evaluated the effects of RA disambiguation on NLP-driven phenotyping of rheumatoid arthritis.ResultsCASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art unsupervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis.ConclusionCASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and unsupervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.

Download Full-text

Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data

Pharmacoepidemiology and Drug Safety ◽

10.1002/pds.5199 ◽

2021 ◽

Author(s):

Lin Li ◽

Chuang‐Chung Lee ◽

Fang Liz Zhou ◽

Cliona Molony ◽

Zoran Doder ◽

...

Keyword(s):

Machine Learning ◽

Type 1 Diabetes ◽

Electronic Health Records ◽

Performance Assessment ◽

Diabetic Ketoacidosis ◽

Learning Approaches ◽

Health Records ◽

Electronic Health

Download Full-text

Early Prediction of Alzheimer’s Disease and Related Dementias Using Electronic Health Records

10.1101/2020.06.13.20130401 ◽

2020 ◽

Author(s):

Xi Yang ◽

Qian Li ◽

Yonghui Wu ◽

Jiang Bian ◽

Tianchen Lyu ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Data Driven ◽

Early Prediction ◽

Learning Models ◽

Health Records ◽

Domain Experts ◽

Data Driven Approach ◽

Electronic Health ◽

Machine Learning Models

AbstractAlzheimer’s disease (AD) and AD-related dementias (ADRD) are a class of neurodegenerative diseases affecting about 5.7 million Americans. There is no cure for AD/ADRD. Current interventions have modest effects and focus on attenuating cognitive impairment. Detection of patients at high risk of AD/ADRD is crucial for timely interventions to modify risk factors and primarily prevent cognitive decline and dementia, and thus to enhance the quality of life and reduce health care costs. This study seeks to investigate both knowledge-driven (where domain experts identify useful features) and data-driven (where machine learning models select useful features among all available data elements) approaches for AD/ADRD early prediction using real-world electronic health records (EHR) data from the University of Florida (UF) Health system. We identified a cohort of 59,799 patients and examined four widely used machine learning algorithms following a standard case-control study. We also examined the early prediction of AD/ADRD using patient information 0-years, 1-year, 3-years, and 5-years before the disease onset date. The experimental results showed that models based on the Gradient Boosting Trees (GBT) achieved the best performance for the data-driven approach and the Random Forests (RF) achieved the best performance for the knowledge-driven approach. Among all models, GBT using a data-driven approach achieved the best area under the curve (AUC) score of 0.7976, 0.7192, 0.6985, and 0.6798 for 0, 1, 3, 5-years prediction, respectively. We also examined the top features identified by the machine learning models and compared them with the knowledge-driven features identified by domain experts. Our study demonstrated the feasibility of using electronic health records for the early prediction of AD/ADRD and discovered potential challenges for future investigations.

Download Full-text

Predictors of remission from body dysmorphic disorder after internet-delivered cognitive behavior therapy: a machine learning approach

10.31234/osf.io/eqcdx ◽

2019 ◽

Author(s):

Oskar Flygare ◽

Jesper Enander ◽

Erik Andersson ◽

Brjánn Ljótsson ◽

Volen Z Ivanov ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Clinical Utility ◽

Body Dysmorphic Disorder ◽

Prediction Models ◽

Behavioral Therapy ◽

Learning Approach ◽

Learning Approaches ◽

Machine Learning Approach

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.

Download Full-text

Predicting dementia diagnosis from cognitive footprints in electronic health records: a case–control study protocol

BMJ Open ◽

10.1136/bmjopen-2020-043487 ◽

2020 ◽

Vol 10 (11) ◽

pp. e043487

Author(s):

Hao Luo ◽

Kui Kai Lau ◽

Gloria H Y Wong ◽

Wai-Chi Chan ◽

Henry K F Mak ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Hong Kong ◽

Electronic Health Records ◽

Case Control Study ◽

Case Control ◽

Dementia Diagnosis ◽

Health Records ◽

Electronic Health ◽

Control Study

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).

Download Full-text

Comparative analysis of machine learning methods for analyzing security practice in electronic health records’ logs.

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378353 ◽

2020 ◽

Author(s):

Prosper K Yeng ◽

Muhammad Ali Fauzi ◽

Bian Yang

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Electronic Health Records ◽

Learning Methods ◽

Health Records ◽

Machine Learning Methods ◽

Electronic Health

Download Full-text

Workflow-based anomaly detection using machine learning on electronic health records’ logs: A Comparative Study

2020 International Conference on Computational Science and Computational Intelligence (CSCI) ◽

10.1109/csci51800.2020.00143 ◽

2020 ◽

Author(s):

Prosper K Yeng ◽

Muhammad Ali Fauzi ◽

Bian Yang

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Anomaly Detection ◽

Comparative Study ◽

Health Records ◽

Electronic Health

Download Full-text

The process of sourcing and preparing electronic health records data to implement a machine-learning algorithm for early identification of maternal cardiovascular risk (Preprint)

10.2196/preprints.34932 ◽

2021 ◽

Author(s):

Nawar Shara ◽

Kelley M. Anderson ◽

Noor Falah ◽

Maryam F. Ahmad ◽

Darya Tavazoei ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Risk ◽

Electronic Health Records ◽

Patient Care ◽

Early Identification ◽

Health Data ◽

Patient Specific ◽

Patient Records ◽

Health Records ◽

Electronic Health

BACKGROUND Healthcare data are fragmenting as patients seek care from diverse sources. Consequently, patient care is negatively impacted by disparate health records. Machine learning (ML) offers a disruptive force in its ability to inform and improve patient care and outcomes [6]. However, the differences that exist in each individual’s health records, combined with the lack of health-data standards, in addition to systemic issues that render the data unreliable and that fail to create a single view of each patient, create challenges for ML. While these problems exist throughout healthcare, they are especially prevalent within maternal health, and exacerbate the maternal morbidity and mortality (MMM) crisis in the United States. OBJECTIVE Maternal patient records were extracted from the electronic health records (EHRs) of a large tertiary healthcare system and made into patient-specific, complete datasets through a systematic method so that a machine-learning-based (ML-based) risk-assessment algorithm could effectively identify maternal cardiovascular risk prior to evidence of diagnosis or intervention within the patient’s record. METHODS We outline the effort that was required to define the specifications of the computational systems, the dataset, and access to relevant systems, while ensuring data security, privacy laws, and policies were met. Data acquisition included the concatenation, anonymization, and normalization of health data across multiple EHRs in preparation for its use by a proprietary risk-stratification algorithm designed to establish patient-specific baselines to identify and establish cardiovascular risk based on deviations from the patient’s baselines to inform early interventions. RESULTS Patient records can be made actionable for the goal of effectively employing machine learning (ML), specifically to identify cardiovascular risk in pregnant patients. CONCLUSIONS Upon acquiring data, including the concatenation, anonymization, and normalization of said data across multiple EHRs, the use of a machine-learning-based (ML-based) tool can provide early identification of cardiovascular risk in pregnant patients. CLINICALTRIAL N/A

Download Full-text