Accurate COVID-19 Health Outcome Prediction and Risk Factors Identification through an Innovative Machine Learning Framework Using Longitudinal Electronic Health Records

Author(s):  
Alice Feng
BMJ Open ◽  
2020 ◽  
Vol 10 (11) ◽  
pp. e043487
Author(s):  
Hao Luo ◽  
Kui Kai Lau ◽  
Gloria H Y Wong ◽  
Wai-Chi Chan ◽  
Henry K F Mak ◽  
...  

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).


2020 ◽  
Author(s):  
Abin Abraham ◽  
Brian L Le ◽  
Idit Kosti ◽  
Peter Straub ◽  
Digna R Velez Edwards ◽  
...  

Abstract: Identifying pregnancies at risk for preterm birth, one of the leading causes of worldwide infant mortality, has the potential to improve prenatal care. However, we lack broadly applicable methods to accurately predict preterm birth risk. The dense longitudinal information present in electronic health records (EHRs) is enabling scalable and cost-efficient risk modeling of many diseases, but EHR resources have been largely untapped in the study of pregnancy. Here, we apply machine learning to diverse data from EHRs to predict singleton preterm birth. Leveraging a large cohort of 35,282 deliveries, we find that a prediction model based on billing codes alone can predict preterm birth at 28 weeks of gestation (ROC-AUC=0.75, PR-AUC=0.40) and outperforms a comparable model trained using known risk factors (ROC-AUC=0.59, PR-AUC=0.21). Our machine learning approach is also able to accurately predict preterm birth sub-types (spontaneous vs. indicated), mode of delivery, and recurrent preterm birth. We demonstrate the portability of our approach by showing that the prediction models maintain their accuracy on a large, independent cohort (5,978 deliveries) with only a modest decrease in performance. Interpreting the features identified by the model as most informative for risk stratification demonstrates that they capture non-linear combinations of known risk factors and patterns of care. The strong performance of our approach across multiple clinical contexts and an independent cohort highlights the potential of machine learning algorithms to improve medical care during pregnancy.


Author(s):  
Dimitris Bertsimas ◽  
Galit Lukin ◽  
Luca Mingardi ◽  
Omid Nohadani ◽  
Agni Orfanoudaki ◽  
...  

AbstractBackgroundTimely identification of COVID-19 patients at high risk of mortality can significantly improve patient management and resource allocation within hospitals. This study seeks to develop and validate a data-driven personalized mortality risk calculator for hospitalized COVID-19 patients.MethodsDe-identified data was obtained for 3,927 COVID-19 positive patients from six independent centers, comprising 33 different hospitals. Demographic, clinical, and laboratory variables were collected at hospital admission. The COVID-19 Mortality Risk (CMR) tool was developed using the XGBoost algorithm to predict mortality. Its discrimination performance was subsequently evaluated on three validation cohorts.FindingsThe derivation cohort of 3,062 patients has an observed mortality rate of 26.84%. Increased age, decreased oxygen saturation (≤ 93%), elevated levels of C-reactive protein (≥ 130 mg/L), blood urea nitrogen (≥ 18 mg/dL), and blood creatinine (≥ 1.2 mg/dL) were identified as primary risk factors, validating clinical findings. The model obtains out-of-sample AUCs of 0.90 (95% CI, 0.87-0.94) on the derivation cohort. In the validation cohorts, the model obtains AUCs of 0.92 (95% CI, 0.88-0.95) on Seville patients, 0.87 (95% CI, 0.84-0.91) on Hellenic COVID-19 Study Group patients, and 0.81 (95% CI, 0.76-0.85) on Hartford Hospital patients. The CMR tool is available as an online application at covidanalytics.io/mortality_calculator and is currently in clinical use.InterpretationThe CMR model leverages machine learning to generate accurate mortality predictions using commonly available clinical features. This is the first risk score trained and validated on a cohort of COVID-19 patients from Europe and the United States.Research in contextEvidence before this studyWe searched PubMed, BioRxiv, MedRxiv, arXiv, and SSRN for peer-reviewed articles, preprints, and research reports in English from inception to March 25th, 2020 focusing on disease severity and mortality risk scores for patients that had been infected with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Earlier investigations showed promise at predicting COVID-19 disease severity using data at admission. However, existing work was limited by its data scope, either relying on a single center with rich clinical information or broader cohort with sparse clinical information. No analysis has leveraged Electronic Health Records data from an international multi-center cohort from both Europe and the United States.Added value of this studyWe present the first multi-center COVID-19 mortality risk study that uses Electronic Health Records data from 3,062 patients across four different countries, including Greece, Italy, Spain, and the United States, encompassing 33 hospitals. We employed state-of-the-art machine learning techniques to develop a personalized COVID-19 mortality risk (CMR) score for hospitalized patients upon admission based on clinical features including vitals, lab results, and comorbidities. The model validates clinical findings of mortality risk factors and exhibits strong performance, with AUCs ranging from 0.81 to 0.92 across external validation cohorts. The model identifies increased age as a primary mortality predictor, consistent with observed disease trends and subsequent public health guidelines. Additionally, among the vital and lab values collected at admission, decreased oxygen saturation (≤ 93%) and elevated levels of C-reactive protein (≥ 130 mg/L), blood urea nitrogen (≥ 18 mg/dL), blood creatinine (≥ 1.2 mg/dL), and blood glucose (≥180 mg/dL) are highlighted as key biomarkers of mortality risk. These findings corroborate previous studies that link COVID-19 severity to hypoxemia, impaired kidney function, and diabetes. These features are also consistent with risk factors used in severity risk scores for related respiratory conditions such as community-acquired pneumonia.Implications of all the available evidenceOur work presents the development and validation of a personalized mortality risk score. We take a data-driven approach to derive insights from Electronic Health Records data spanning Europe and the United States. While many existing papers on COVID-19 clinical characteristics and risk factors are based on Chinese hospital data, the similarities in our findings suggest consistency in the disease characteristics across international cohorts. Additionally, our machine learning model offers a novel approach to understanding the disease and its risk factors. By creating a single comprehensive risk score that integrates various admission data components, the calculator offers a streamlined way of evaluating COVID-19 patients upon admission to augment clinical expertise. The CMR model provides a valuable clinical decision support tool for patient triage and care management, improving risk estimation early within admission, that can significantly affect the daily practice of physicians.


2020 ◽  
Vol 7 (Supplement_1) ◽  
pp. S819-S820
Author(s):  
Jonathan Todd ◽  
Jon Puro ◽  
Matthew Jones ◽  
Jee Oakley ◽  
Laura A Vonnahme ◽  
...  

Abstract Background Over 80% of tuberculosis (TB) cases in the United States are attributed to reactivation of latent TB infection (LTBI). Eliminating TB in the United States requires expanding identification and treatment of LTBI. Centralized electronic health records (EHRs) are an unexplored data source to identify persons with LTBI. We explored EHR data to evaluate TB and LTBI screening and diagnoses within OCHIN, Inc., a U.S. practice-based research network with a high proportion of Federally Qualified Health Centers. Methods From the EHRs of patients who had an encounter at an OCHIN member clinic between January 1, 2012 and December 31, 2016, we extracted demographic variables, TB risk factors, TB screening tests, International Classification of Diseases (ICD) 9 and 10 codes, and treatment regimens. Based on test results, ICD codes, and treatment regimens, we developed a novel algorithm to classify patient records into LTBI categories: definite, probable or possible. We used multivariable logistic regression, with a referent group of all cohort patients not classified as having LTBI or TB, to identify associations between TB risk factors and LTBI. Results Among 2,190,686 patients, 6.9% (n=151,195) had a TB screening test; among those, 8% tested positive. Non-U.S. –born or non-English–speaking persons comprised 24% of our cohort; 11% were tested for TB infection, and 14% had a positive test. Risk factors in the multivariable model significantly associated with being classified as having LTBI included preferring non-English language (adjusted odds ratio [aOR] 4.20, 95% confidence interval [CI] 4.09–4.32); non-Hispanic Asian (aOR 5.17, 95% CI 4.94–5.40), non-Hispanic black (aOR 3.02, 95% CI 2.91–3.13), or Native Hawaiian/other Pacific Islander (aOR 3.35, 95% CI 2.92–3.84) race; and HIV infection (aOR 3.09, 95% CI 2.84–3.35). Conclusion This study demonstrates the utility of EHR data for understanding TB screening practices and as an important data source that can be used to enhance public health surveillance of LTBI prevalence. Increasing screening among high-risk populations remains an important step toward eliminating TB in the United States. These results underscore the importance of offering TB screening in non-U.S.–born populations. Disclosures All Authors: No reported disclosures


Sign in / Sign up

Export Citation Format

Share Document