FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

Suparno Datta; Jan Philipp Sachs; Harry FreitasDa Cruz; Tom Martensen; Philipp Bode; Ariane Morassi Sasso; Benjamin S Glicksberg; Erwin Böttinger

doi:10.1093/jamiaopen/ooab048

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling

JAMIA Open ◽

10.1093/jamiaopen/ooab048 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Suparno Datta ◽

Jan Philipp Sachs ◽

Harry FreitasDa Cruz ◽

Tom Martensen ◽

Philipp Bode ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Clinical Data ◽

Heart Surgery ◽

Data Warehouses ◽

Learning Models ◽

Health Records ◽

Star Schema ◽

Electronic Health ◽

Machine Learning Models

Abstract Objectives The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames. Materials and Methods FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER’s capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models. Results Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case. Conclusion FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.

Download Full-text

Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease

PLoS ONE ◽

10.1371/journal.pone.0202344 ◽

2018 ◽

Vol 13 (8) ◽

pp. e0202344 ◽

Cited By ~ 42

Author(s):

Andrew J. Steele ◽

Spiros C. Denaxas ◽

Anoop D. Shah ◽

Harry Hemingway ◽

Nicholas M. Luscombe

Keyword(s):

Machine Learning ◽

Coronary Artery Disease ◽

Coronary Artery ◽

Electronic Health Records ◽

Survival Models ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Artery Disease ◽

Machine Learning Models

Download Full-text

Developing Machine Learning Models to Identify Acute Respiratory Distress Syndrome Criteria in Electronic Health Records

10.1164/ajrccm-conference.2020.201.1_meetingabstracts.a1010 ◽

2020 ◽

Author(s):

A. Modi ◽

H. Lee ◽

M. Bechel ◽

O. Keaveny ◽

L.A.N. Amaral ◽

...

Keyword(s):

Machine Learning ◽

Acute Respiratory Distress Syndrome ◽

Electronic Health Records ◽

Respiratory Distress Syndrome ◽

Respiratory Distress ◽

Distress Syndrome ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

Download Full-text

The prediction of asymptomatic carotid atherosclerosis with electronic health records: a comparative study of six machine learning models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01480-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jiaxin Fan ◽

Mengying Chen ◽

Jian Luo ◽

Shusen Yang ◽

Jinming Shi ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Carotid Atherosclerosis ◽

Characteristic Curve ◽

Gradient Boosting ◽

Learning Models ◽

Health Records ◽

Extreme Gradient Boosting ◽

Electronic Health ◽

Machine Learning Models

Abstract Background Screening carotid B-mode ultrasonography is a frequently used method to detect subjects with carotid atherosclerosis (CAS). Due to the asymptomatic progression of most CAS patients, early identification is challenging for clinicians, and it may trigger ischemic stroke. Recently, machine learning has shown a strong ability to classify data and a potential for prediction in the medical field. The combined use of machine learning and the electronic health records of patients could provide clinicians with a more convenient and precise method to identify asymptomatic CAS. Methods Retrospective cohort study using routine clinical data of medical check-up subjects from April 19, 2010 to November 15, 2019. Six machine learning models (logistic regression [LR], random forest [RF], decision tree [DT], eXtreme Gradient Boosting [XGB], Gaussian Naïve Bayes [GNB], and K-Nearest Neighbour [KNN]) were used to predict asymptomatic CAS and compared their predictability in terms of the area under the receiver operating characteristic curve (AUCROC), accuracy (ACC), and F1 score (F1). Results Of the 18,441 subjects, 6553 were diagnosed with asymptomatic CAS. Compared to DT (AUCROC 0.628, ACC 65.4%, and F1 52.5%), the other five models improved prediction: KNN + 7.6% (0.704, 68.8%, and 50.9%, respectively), GNB + 12.5% (0.753, 67.0%, and 46.8%, respectively), XGB + 16.0% (0.788, 73.4%, and 55.7%, respectively), RF + 16.6% (0.794, 74.5%, and 56.8%, respectively) and LR + 18.1% (0.809, 74.7%, and 59.9%, respectively). The highest achieving model, LR predicted 1045/1966 cases (sensitivity 53.2%) and 3088/3566 non-cases (specificity 86.6%). A tenfold cross-validation scheme further verified the predictive ability of the LR. Conclusions Among machine learning models, LR showed optimal performance in predicting asymptomatic CAS. Our findings set the stage for an early automatic alarming system, allowing a more precise allocation of CAS prevention measures to individuals probably to benefit most.

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study (Preprint)

10.2196/preprints.25022 ◽

2020 ◽

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

BACKGROUND Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. OBJECTIVE The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. METHODS A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. RESULTS Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. CONCLUSIONS This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study

JMIR Medical Informatics ◽

10.2196/25022 ◽

2021 ◽

Vol 9 (12) ◽

pp. e25022

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

Background Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. Objective The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. Methods A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. Results Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. Conclusions This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

Delirium Prediction using Machine Learning Models on Predictive Electronic Health Records Data

2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE) ◽

10.1109/bibe.2017.00014 ◽

2017 ◽

Cited By ~ 9

Author(s):

Anis Davoudi ◽

Tezcan Ozrazgat-Baslanti ◽

Ashkan Ebadi ◽

Alberto C. Bursian ◽

Azra Bihorac ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

Download Full-text

Comparing Machine Learning Models for Identifying Chronic Cough Using Diagnosis and Medication in the Electronic Health Records

Journal of Allergy and Clinical Immunology ◽

10.1016/j.jaci.2020.12.241 ◽

2021 ◽

Vol 147 (2) ◽

pp. AB61

Author(s):

Vishal Bali ◽

Xiao Luo ◽

Priyanka Gandhi ◽

Zuoyi Zhang ◽

Wei Shao ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Chronic Cough ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

Download Full-text

Early Prediction of Alzheimer’s Disease and Related Dementias Using Electronic Health Records

10.1101/2020.06.13.20130401 ◽

2020 ◽

Author(s):

Xi Yang ◽

Qian Li ◽

Yonghui Wu ◽

Jiang Bian ◽

Tianchen Lyu ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Data Driven ◽

Early Prediction ◽

Learning Models ◽

Health Records ◽

Domain Experts ◽

Data Driven Approach ◽

Electronic Health ◽

Machine Learning Models

AbstractAlzheimer’s disease (AD) and AD-related dementias (ADRD) are a class of neurodegenerative diseases affecting about 5.7 million Americans. There is no cure for AD/ADRD. Current interventions have modest effects and focus on attenuating cognitive impairment. Detection of patients at high risk of AD/ADRD is crucial for timely interventions to modify risk factors and primarily prevent cognitive decline and dementia, and thus to enhance the quality of life and reduce health care costs. This study seeks to investigate both knowledge-driven (where domain experts identify useful features) and data-driven (where machine learning models select useful features among all available data elements) approaches for AD/ADRD early prediction using real-world electronic health records (EHR) data from the University of Florida (UF) Health system. We identified a cohort of 59,799 patients and examined four widely used machine learning algorithms following a standard case-control study. We also examined the early prediction of AD/ADRD using patient information 0-years, 1-year, 3-years, and 5-years before the disease onset date. The experimental results showed that models based on the Gradient Boosting Trees (GBT) achieved the best performance for the data-driven approach and the Random Forests (RF) achieved the best performance for the knowledge-driven approach. Among all models, GBT using a data-driven approach achieved the best area under the curve (AUC) score of 0.7976, 0.7192, 0.6985, and 0.6798 for 0, 1, 3, 5-years prediction, respectively. We also examined the top features identified by the machine learning models and compared them with the knowledge-driven features identified by domain experts. Our study demonstrated the feasibility of using electronic health records for the early prediction of AD/ADRD and discovered potential challenges for future investigations.

Download Full-text

Towards Validating the Effectiveness of Obstructive Sleep Apnea Classification from Electronic Health Records Using Machine Learning

Healthcare ◽

10.3390/healthcare9111450 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1450

Author(s):

Jayroop Ramesh ◽

Niha Keeran ◽

Assim Sagahyroon ◽

Fadi Aloul

Keyword(s):

Machine Learning ◽

Obstructive Sleep Apnea ◽

Sleep Apnea ◽

Electronic Health Records ◽

Clinical Data ◽

Bayesian Optimization ◽

Support Vector ◽

Health Records ◽

Obstructive Sleep ◽

Electronic Health

Obstructive sleep apnea (OSA) is a common, chronic, sleep-related breathing disorder characterized by partial or complete airway obstruction in sleep. The gold standard diagnosis method is polysomnography, which estimates disease severity through the Apnea-Hypopnea Index (AHI). However, this is expensive and not widely accessible to the public. For effective screening, this work implements machine learning algorithms for classification of OSA. The model is trained with routinely acquired clinical data of 1479 records from the Wisconsin Sleep Cohort dataset. Extracted features from the electronic health records include patient demographics, laboratory blood reports, physical measurements, habitual sleep history, comorbidities, and general health questionnaire scores. For distinguishing between OSA and non-OSA patients, feature selection methods reveal the primary important predictors as waist-to-height ratio, waist circumference, neck circumference, body-mass index, lipid accumulation product, excessive daytime sleepiness, daily snoring frequency and snoring volume. Optimal hyperparameters were selected using a hybrid tuning method consisting of Bayesian Optimization and Genetic Algorithms through a five-fold cross-validation strategy. Support vector machines achieved the highest evaluation scores with accuracy: 68.06%, sensitivity: 88.76%, specificity: 40.74%, F1-score: 75.96%, PPV: 66.36% and NPV: 73.33%. We conclude that routine clinical data can be useful in prioritization of patient referral for further sleep studies.

Download Full-text

Predicting dementia diagnosis from cognitive footprints in electronic health records: a case–control study protocol

BMJ Open ◽

10.1136/bmjopen-2020-043487 ◽

2020 ◽

Vol 10 (11) ◽

pp. e043487

Author(s):

Hao Luo ◽

Kui Kai Lau ◽

Gloria H Y Wong ◽

Wai-Chi Chan ◽

Henry K F Mak ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Hong Kong ◽

Electronic Health Records ◽

Case Control Study ◽

Case Control ◽

Dementia Diagnosis ◽

Health Records ◽

Electronic Health ◽

Control Study

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).

Download Full-text