Dense phenotyping from electronic health records enables machine-learning-based prediction of preterm birth

Abstract: Identifying pregnancies at risk for preterm birth, one of the leading causes of worldwide infant mortality, has the potential to improve prenatal care. However, we lack broadly applicable methods to accurately predict preterm birth risk. The dense longitudinal information present in electronic health records (EHRs) is enabling scalable and cost-efficient risk modeling of many diseases, but EHR resources have been largely untapped in the study of pregnancy. Here, we apply machine learning to diverse data from EHRs to predict singleton preterm birth. Leveraging a large cohort of 35,282 deliveries, we find that a prediction model based on billing codes alone can predict preterm birth at 28 weeks of gestation (ROC-AUC=0.75, PR-AUC=0.40) and outperforms a comparable model trained using known risk factors (ROC-AUC=0.59, PR-AUC=0.21). Our machine learning approach is also able to accurately predict preterm birth sub-types (spontaneous vs. indicated), mode of delivery, and recurrent preterm birth. We demonstrate the portability of our approach by showing that the prediction models maintain their accuracy on a large, independent cohort (5,978 deliveries) with only a modest decrease in performance. Interpreting the features identified by the model as most informative for risk stratification demonstrates that they capture non-linear combinations of known risk factors and patterns of care. The strong performance of our approach across multiple clinical contexts and an independent cohort highlights the potential of machine learning algorithms to improve medical care during pregnancy.

Download Full-text

Predicting dementia diagnosis from cognitive footprints in electronic health records: a case–control study protocol

BMJ Open ◽

10.1136/bmjopen-2020-043487 ◽

2020 ◽

Vol 10 (11) ◽

pp. e043487

Author(s):

Hao Luo ◽

Kui Kai Lau ◽

Gloria H Y Wong ◽

Wai-Chi Chan ◽

Henry K F Mak ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Hong Kong ◽

Electronic Health Records ◽

Case Control Study ◽

Case Control ◽

Dementia Diagnosis ◽

Health Records ◽

Electronic Health ◽

Control Study

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).

Download Full-text

Accurate COVID-19 Health Outcome Prediction and Risk Factors Identification through an Innovative Machine Learning Framework Using Longitudinal Electronic Health Records

10.1109/ichi52183.2021.00099 ◽

2021 ◽

Author(s):

Alice Feng

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Health Outcome ◽

Electronic Health Records ◽

Outcome Prediction ◽

Health Records ◽

Learning Framework ◽

Electronic Health

Download Full-text

Application of Machine Learning in Chronic Kidney Disease Risk Prediction Using Electronic Health Records (EHR)

Applications of Big Data in Large- and Small-Scale Systems - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-6673-2.ch014 ◽

2021 ◽

pp. 213-233

Author(s):

Laxmi Kumari Pathak ◽

Pooja Jha

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Kidney Disease ◽

Electronic Health Records ◽

Bone Diseases ◽

Machine Learning Algorithms ◽

Health Records ◽

Chronic Kidney Disease Risk ◽

Electronic Health ◽

Heart Disorders

Chronic kidney disease (CKD) is a disorder in which the kidneys are weakened and become unable to filter blood. It lowers the human ability to remain healthy. The field of biosciences has progressed and produced vast volumes of knowledge from electronic health records. Heart disorders, anemia, bone diseases, elevated potassium, and calcium are the very prevalent complications that arise from kidney failure. Early identification of CKD can improve the quality of life greatly. To achieve this, various machine learning techniques have been introduced so far that use the data in electronic health record (EHR) to predict CKD. This chapter studies various machine learning algorithms like support vector machine, random forest, probabilistic neural network, Apriori, ZeroR, OneR, naive Bayes, J48, IBk (k-nearest neighbor), ensemble method, etc. and compares their accuracy. The study aims in finding the best-suited technique from different methods of machine learning for the early detection of CKD by which medical professionals can interpret model predictions easily.

Download Full-text

Predicting Acute Kidney Injury: A Machine Learning Approach Using Electronic Health Records

Information ◽

10.3390/info11080386 ◽

2020 ◽

Vol 11 (8) ◽

pp. 386

Author(s):

Sheikh S. Abdullah ◽

Neda Rostamzadeh ◽

Kamran Sedig ◽

Amit X. Garg ◽

Eric McArthur

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Acute Kidney Injury ◽

Electronic Health Records ◽

Prediction Models ◽

Kidney Injury ◽

Machine Learning Techniques ◽

Health Records ◽

Mortality And Morbidity ◽

Electronic Health

Acute kidney injury (AKI) is a common complication in hospitalized patients and can result in increased hospital stay, health-related costs, mortality and morbidity. A number of recent studies have shown that AKI is predictable and avoidable if early risk factors can be identified by analyzing Electronic Health Records (EHRs). In this study, we employ machine learning techniques to identify older patients who have a risk of readmission with AKI to the hospital or emergency department within 90 days after discharge. One million patients’ records are included in this study who visited the hospital or emergency department in Ontario between 2014 and 2016. The predictor variables include patient demographics, comorbid conditions, medications and diagnosis codes. We developed 31 prediction models based on different combinations of two sampling techniques, three ensemble methods, and eight classifiers. These models were evaluated through 10-fold cross-validation and compared based on the AUROC metric. The performances of these models were consistent, and the AUROC ranged between 0.61 and 0.88 for predicting AKI among 31 prediction models. In general, the performances of ensemble-based methods were higher than the cost-sensitive logistic regression. We also validated features that are most relevant in predicting AKI with a healthcare expert to improve the performance and reliability of the models. This study predicts the risk of AKI for a patient after being discharged, which provides healthcare providers enough time to intervene before the onset of AKI.

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study (Preprint)

10.2196/preprints.25022 ◽

2020 ◽

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

BACKGROUND Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. OBJECTIVE The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. METHODS A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. RESULTS Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. CONCLUSIONS This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

Novel electronic health records applied for prediction of pre-eclampsia: machine-learning algorithms

Pregnancy Hypertension ◽

10.1016/j.preghy.2021.10.006 ◽

2021 ◽

Author(s):

Yi-xin Li ◽

Xiao-ping Shen ◽

Chao Yang ◽

Zuo-zeng Cao ◽

Rui Du ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Health Records ◽

Electronic Health

Download Full-text

Identifying lupus patients in electronic health records: Development and validation of machine learning algorithms and application of rule-based algorithms

Seminars in Arthritis and Rheumatism ◽

10.1016/j.semarthrit.2019.01.002 ◽

2019 ◽

Vol 49 (1) ◽

pp. 84-90 ◽

Cited By ~ 7

Author(s):

April Jorge ◽

Victor M. Castro ◽

April Barnado ◽

Vivian Gainer ◽

Chuan Hong ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Health Records ◽

Rule Based ◽

Electronic Health ◽

Development And Validation

Download Full-text

Predicting risk of stroke from lab tests using machine learning algorithms (Preprint)

10.2196/preprints.23440 ◽

2020 ◽

Author(s):

Eman Alanazi ◽

Alaa Abdou ◽

Jake Luo

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Prediction Models ◽

Machine Learning Algorithms ◽

Data Sets ◽

Health Records ◽

Time Prediction ◽

Health And Nutrition ◽

Different Types ◽

Potential Risk Factors

UNSTRUCTURED Stroke, a cerebrovascular disease, is one of the major causes of death. It is also causing a health burden for both the patients and the healthcare systems. One of the important risk factors of stroke is health behavior which is an increasing focus of prevention. In addition, chronic diseases such as hypertension, diabetes, cardiac diseases, and asthma are potential risk factors for stroke. There are a lot of machine learning that built using predictors such as lifestyle or radiology imaging. However, there are no models built using lab tests. The aim of the study is to fill this gap by building prediction models to predict stroke from lab tests. We utilized the National Health and Nutrition Examination Survey (NHNES) data sets to develop models that would predict stroke from patient lab tests. We found that accurate and sensitive machine learning models can be created to predict stroke from lab tests. The results showed that prediction with the best tested algorithm random forest could reach the highest accuracy (ACC = 0.96) when all the attributes were used. The model proposed can be integrated with electronic health records to provide a real-time prediction of stroke from lab tests. Due to the data, we could not predict the type of stroke wither hemorrigic or ischemic. In future studies, we aim to use data that provide different types of stroke and explore the data to build a prediction model of each type.

Download Full-text

On Missingness Features in Machine Learning Models for Critical Care: Observational Study

JMIR Medical Informatics ◽

10.2196/25022 ◽

2021 ◽

Vol 9 (12) ◽

pp. e25022

Author(s):

Janmajay Singh ◽

Masahiro Sato ◽

Tomoko Ohkuma

Keyword(s):

Machine Learning ◽

Length Of Stay ◽

Electronic Health Records ◽

Prediction Models ◽

External Validation ◽

Model Performance ◽

Learning Models ◽

Health Records ◽

Electronic Health ◽

Machine Learning Models

Background Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. Objective The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. Methods A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. Results Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. Conclusions This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.

Download Full-text

Development of Phenotyping Algorithms for the Identification of Organ Transplant Recipients: Cohort Study

JMIR Medical Informatics ◽

10.2196/18001 ◽

2020 ◽

Vol 8 (12) ◽

pp. e18001

Author(s):

Lee Wheless ◽

Laura Baker ◽

LaVar Edwards ◽

Nimay Anand ◽

Kelly Birdwell ◽

...

Keyword(s):

Machine Learning ◽

Electronic Health Records ◽

Organ Transplantation ◽

Learning Algorithms ◽

Organ Transplant ◽

Machine Learning Algorithms ◽

Transplant Recipients ◽

Health Records ◽

Organ Transplant Recipients ◽

Electronic Health

Background Studies involving organ transplant recipients (OTRs) are often limited to the variables collected in the national Scientific Registry of Transplant Recipients database. Electronic health records contain additional variables that can augment this data source if OTRs can be identified accurately. Objective The aim of this study was to develop phenotyping algorithms to identify OTRs from electronic health records. Methods We used Vanderbilt’s deidentified version of its electronic health record database, which contains nearly 3 million subjects, to develop algorithms to identify OTRs. We identified all 19,817 individuals with at least one International Classification of Diseases (ICD) or Current Procedural Terminology (CPT) code for organ transplantation. We performed a chart review on 1350 randomly selected individuals to determine the transplant status. We constructed machine learning models to calculate positive predictive values and sensitivity for combinations of codes by using classification and regression trees, random forest, and extreme gradient boosting algorithms. Results Of the 1350 reviewed patient charts, 827 were organ transplant recipients while 511 had no record of a transplant, and 12 were equivocal. Most patients with only 1 or 2 transplant codes did not have a transplant. The most common reasons for being labeled a nontransplant patient were the lack of data (229/511, 44.8%) or the patient being evaluated for an organ transplant (174/511, 34.1%). All 3 machine learning algorithms identified OTRs with overall >90% positive predictive value and >88% sensitivity. Conclusions Electronic health records linked to biobanks are increasingly used to conduct large-scale studies but have not been well-utilized in organ transplantation research. We present rigorously evaluated methods for phenotyping OTRs from electronic health records that will enable the use of the full spectrum of clinical data in transplant research. Using several different machine learning algorithms, we were able to identify transplant cases with high accuracy by using only ICD and CPT codes.

Download Full-text