Early prediction of clinical deterioration using data-driven machine learning modeling of electronic health records

AbstractAlzheimer’s disease (AD) and AD-related dementias (ADRD) are a class of neurodegenerative diseases affecting about 5.7 million Americans. There is no cure for AD/ADRD. Current interventions have modest effects and focus on attenuating cognitive impairment. Detection of patients at high risk of AD/ADRD is crucial for timely interventions to modify risk factors and primarily prevent cognitive decline and dementia, and thus to enhance the quality of life and reduce health care costs. This study seeks to investigate both knowledge-driven (where domain experts identify useful features) and data-driven (where machine learning models select useful features among all available data elements) approaches for AD/ADRD early prediction using real-world electronic health records (EHR) data from the University of Florida (UF) Health system. We identified a cohort of 59,799 patients and examined four widely used machine learning algorithms following a standard case-control study. We also examined the early prediction of AD/ADRD using patient information 0-years, 1-year, 3-years, and 5-years before the disease onset date. The experimental results showed that models based on the Gradient Boosting Trees (GBT) achieved the best performance for the data-driven approach and the Random Forests (RF) achieved the best performance for the knowledge-driven approach. Among all models, GBT using a data-driven approach achieved the best area under the curve (AUC) score of 0.7976, 0.7192, 0.6985, and 0.6798 for 0, 1, 3, 5-years prediction, respectively. We also examined the top features identified by the machine learning models and compared them with the knowledge-driven features identified by domain experts. Our study demonstrated the feasibility of using electronic health records for the early prediction of AD/ADRD and discovered potential challenges for future investigations.

Download Full-text

Development and validation of a pancreatic cancer prediction model from electronic health records using machine learning.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.4_suppl.679 ◽

2020 ◽

Vol 38 (4_suppl) ◽

pp. 679-679

Author(s):

Limor Appelbaum ◽

Jose Pablo Cambronero ◽

Karla Pollick ◽

George Silva ◽

Jennifer P. Stevens ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Electronic Health Records ◽

Medical Center ◽

Characteristic Curve ◽

External Validation ◽

Data Driven ◽

Health Records ◽

Clinical Correlates ◽

Electronic Health

679 Background: Pancreatic Adenocarcinoma (PDAC) is often diagnosed at an advanced stage. We sought to develop a model for early PDAC prediction in the general population, using electronic health records (EHRs) and machine learning. Methods: We used three EHR datasets from Beth-Israel Deaconess Medical Center (BIDMC) and Partners Healthcare (PHC): 1. “BIDMC-Development-Data” (BIDMC-DD) for model development, using a feed-forward neural network (NN) and L2-regularized logistic regression,randomly split (80:20) into training and test groups. We tuned hyperparameters using cross-validation in training, and report performance on the test split. 2. “BIDMC-Large-Data” (BIDMC-LD) to re-fit and calibrate models. 3. “PHC-Data” for external validation. We evaluate using Area Under the Receiver Operating Characteristic Curve (AUC) and compute 95% CI using empirical bootstrap over test data. PDAC patients were selected using ICD9/-10 codes and validated with tumor registries. In contrast to prior work, we did not predefine feature sets based on known clinical correlates and instead employed data-driven feature selection, specifically importance-based feature pruning, regularization, and manual validation, to identify diagnostic-based features. Results: BIDMC-DD included demographics, diagnoses, labs and medications for 1018 patients (cases = 509; age-sex paired controls). BIDMC-LD included diagnoses for 547,917 patients (cases = 509), and PHC included diagnoses for 160,593 patients (cases = 408). We compared our approach to adapted and re-fitted published baselines. With a 365-day lead-time, NN obtained a BIDMC-DD test AUC of 0.84 (CI 0.79 - 0.90) versus the previous best baseline AUC of 0.70 (CI 0.62 - 0.78). We also validated using BIDMC-DD’s test cancer patients and BIDMC LD controls. The AUC was 0.71 (CI 0.67 - 0.76) at the 365-day cutoff. NN’s external validation AUC on PHC-Data was 0.71 (CI 0.63 - 0.79), outperforming an existing model’s AUC of 0.61 (CI 0.52 - 0.70) (Baecker et al, 2019). Conclusions: Models based on data-driven feature selection outperform models that use predefined sets of known clinical correlates and can help in early prediction of PDAC development.

Download Full-text

Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-0805-0 ◽

2019 ◽

Vol 19 (1) ◽

Cited By ~ 15

Author(s):

Maria Pikoula ◽

Jennifer Kathleen Quint ◽

Francis Nissen ◽

Harry Hemingway ◽

Liam Smeeth ◽

...

Keyword(s):

Primary Care ◽

Electronic Health Records ◽

Population Based ◽

Data Driven ◽

Health Records ◽

Primary Care Population ◽

Electronic Health ◽

Using Data

Download Full-text

Predicting dementia diagnosis from cognitive footprints in electronic health records: a case–control study protocol

BMJ Open ◽

10.1136/bmjopen-2020-043487 ◽

2020 ◽

Vol 10 (11) ◽

pp. e043487

Author(s):

Hao Luo ◽

Kui Kai Lau ◽

Gloria H Y Wong ◽

Wai-Chi Chan ◽

Henry K F Mak ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Hong Kong ◽

Electronic Health Records ◽

Case Control Study ◽

Case Control ◽

Dementia Diagnosis ◽

Health Records ◽

Electronic Health ◽

Control Study

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).

Download Full-text

Comparative analysis of machine learning methods for analyzing security practice in electronic health records’ logs.

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378353 ◽

2020 ◽

Author(s):

Prosper K Yeng ◽

Muhammad Ali Fauzi ◽

Bian Yang

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Electronic Health Records ◽

Learning Methods ◽

Health Records ◽

Machine Learning Methods ◽

Electronic Health

Download Full-text

Workflow-based anomaly detection using machine learning on electronic health records’ logs: A Comparative Study

2020 International Conference on Computational Science and Computational Intelligence (CSCI) ◽

10.1109/csci51800.2020.00143 ◽

2020 ◽

Author(s):

Prosper K Yeng ◽

Muhammad Ali Fauzi ◽

Bian Yang

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Anomaly Detection ◽

Comparative Study ◽

Health Records ◽

Electronic Health

Download Full-text

The process of sourcing and preparing electronic health records data to implement a machine-learning algorithm for early identification of maternal cardiovascular risk (Preprint)

10.2196/preprints.34932 ◽

2021 ◽

Author(s):

Nawar Shara ◽

Kelley M. Anderson ◽

Noor Falah ◽

Maryam F. Ahmad ◽

Darya Tavazoei ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Risk ◽

Electronic Health Records ◽

Patient Care ◽

Early Identification ◽

Health Data ◽

Patient Specific ◽

Patient Records ◽

Health Records ◽

Electronic Health

BACKGROUND Healthcare data are fragmenting as patients seek care from diverse sources. Consequently, patient care is negatively impacted by disparate health records. Machine learning (ML) offers a disruptive force in its ability to inform and improve patient care and outcomes [6]. However, the differences that exist in each individual’s health records, combined with the lack of health-data standards, in addition to systemic issues that render the data unreliable and that fail to create a single view of each patient, create challenges for ML. While these problems exist throughout healthcare, they are especially prevalent within maternal health, and exacerbate the maternal morbidity and mortality (MMM) crisis in the United States. OBJECTIVE Maternal patient records were extracted from the electronic health records (EHRs) of a large tertiary healthcare system and made into patient-specific, complete datasets through a systematic method so that a machine-learning-based (ML-based) risk-assessment algorithm could effectively identify maternal cardiovascular risk prior to evidence of diagnosis or intervention within the patient’s record. METHODS We outline the effort that was required to define the specifications of the computational systems, the dataset, and access to relevant systems, while ensuring data security, privacy laws, and policies were met. Data acquisition included the concatenation, anonymization, and normalization of health data across multiple EHRs in preparation for its use by a proprietary risk-stratification algorithm designed to establish patient-specific baselines to identify and establish cardiovascular risk based on deviations from the patient’s baselines to inform early interventions. RESULTS Patient records can be made actionable for the goal of effectively employing machine learning (ML), specifically to identify cardiovascular risk in pregnant patients. CONCLUSIONS Upon acquiring data, including the concatenation, anonymization, and normalization of said data across multiple EHRs, the use of a machine-learning-based (ML-based) tool can provide early identification of cardiovascular risk in pregnant patients. CLINICALTRIAL N/A

Download Full-text

The Postencounter Form System: Viewpoint on Efficient Data Collection Within Electronic Health Records (Preprint)

10.2196/preprints.17429 ◽

2019 ◽

Author(s):

Philip Held ◽

Randy A Boley ◽

Walter G Faig ◽

John A O'Toole ◽

Imran Desai ◽

...

Keyword(s):

Electronic Health Records ◽

Medical Center ◽

Health Records ◽

Clinical Notes ◽

The Road ◽

Clinical Encounters ◽

Efficient Data ◽

Data Points ◽

Electronic Health ◽

Using Data

UNSTRUCTURED Electronic health records (EHRs) offer opportunities for research and improvements in patient care. However, challenges exist in using data from EHRs due to the volume of information existing within clinical notes, which can be labor intensive and costly to transform into usable data with existing strategies. This case report details the collaborative development and implementation of the postencounter form (PEF) system into the EHR at the Road Home Program at Rush University Medical Center in Chicago, IL to address these concerns with limited burden to clinical workflows. The PEF system proved to be an effective tool with over 98% of all clinical encounters including a completed PEF within 5 months of implementation. In addition, the system has generated over 325,188 unique, readily-accessible data points in under 4 years of use. The PEF system has since been deployed to other settings demonstrating that the system may have broader clinical utility.

Download Full-text

Leveraging Genetic Reports and Electronic Health Records for Predicting Primary Cancers Based on FHIR and RDF (Preprint)

10.2196/preprints.23586 ◽

2020 ◽

Author(s):

Nansu Zong ◽

Victoria Ngo ◽

Daniel J. Stone ◽

Andrew Wen ◽

Yiqing Zhao ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Cancer Patients ◽

Genetic Data ◽

Precision Oncology ◽

Primary Cancer ◽

Health Records ◽

Web Resource ◽

Cancer Types ◽

Electronic Health

BACKGROUND Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnose, and treatments. A key research area focuses on early detection of primary cancers and the potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. OBJECTIVE This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict unknown primaries. METHODS We extracted the genetic data elements from a collection of oncology genetic reports of 1,011 cancer patients, and corresponding phenotypical data from the Mayo Clinic electronic health records (EHRs). We modeled both genetic and EHR data with HL7 Fast Healthcare Interoperability Resources (FHIR). The semantic web Resource Description Framework (RDF) was employed to generate the network-based data representation (i.e., patient-phenotypic-genetic network). Based on RDF data graph, graph embedding algorithm Node2vec was applied to generate features, and then multiple machine learning and deep learning backbone models were adopted for cancer prediction. RESULTS With six machine-learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types and predicting unknown primaries. To demonstrate the interpretability, phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. CONCLUSIONS Accurate prediction of cancer types can be achieved with existing EHR data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnose stage for cancer patients.

Download Full-text

Synchronization of Machine Learning into Electronic Health Records

International Journal of Computer Applications ◽

10.5120/ijca2019919751 ◽

2019 ◽

Vol 177 (26) ◽

pp. 40-47

Author(s):

Meet N. ◽

Eshan Vatsa ◽

Nitin S.

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Health Records ◽

Electronic Health

Download Full-text