Applying machine learning algorithms to electronic health records predicted pneumonia after respiratory tract infection

Author(s):  
Xiaohui Sun ◽  
Abdel Douiri ◽  
Martin Gulliford
10.2196/18001 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e18001
Author(s):  
Lee Wheless ◽  
Laura Baker ◽  
LaVar Edwards ◽  
Nimay Anand ◽  
Kelly Birdwell ◽  
...  

Background Studies involving organ transplant recipients (OTRs) are often limited to the variables collected in the national Scientific Registry of Transplant Recipients database. Electronic health records contain additional variables that can augment this data source if OTRs can be identified accurately. Objective The aim of this study was to develop phenotyping algorithms to identify OTRs from electronic health records. Methods We used Vanderbilt’s deidentified version of its electronic health record database, which contains nearly 3 million subjects, to develop algorithms to identify OTRs. We identified all 19,817 individuals with at least one International Classification of Diseases (ICD) or Current Procedural Terminology (CPT) code for organ transplantation. We performed a chart review on 1350 randomly selected individuals to determine the transplant status. We constructed machine learning models to calculate positive predictive values and sensitivity for combinations of codes by using classification and regression trees, random forest, and extreme gradient boosting algorithms. Results Of the 1350 reviewed patient charts, 827 were organ transplant recipients while 511 had no record of a transplant, and 12 were equivocal. Most patients with only 1 or 2 transplant codes did not have a transplant. The most common reasons for being labeled a nontransplant patient were the lack of data (229/511, 44.8%) or the patient being evaluated for an organ transplant (174/511, 34.1%). All 3 machine learning algorithms identified OTRs with overall >90% positive predictive value and >88% sensitivity. Conclusions Electronic health records linked to biobanks are increasingly used to conduct large-scale studies but have not been well-utilized in organ transplantation research. We present rigorously evaluated methods for phenotyping OTRs from electronic health records that will enable the use of the full spectrum of clinical data in transplant research. Using several different machine learning algorithms, we were able to identify transplant cases with high accuracy by using only ICD and CPT codes.


2020 ◽  
Author(s):  
Lee Wheless ◽  
Laura Baker ◽  
LaVar Edwards ◽  
Nimay Anand ◽  
Kelly Birdwell ◽  
...  

BACKGROUND Studies involving organ transplant recipients (OTRs) are often limited to the variables collected in the national Scientific Registry of Transplant Recipients database. Electronic health records contain additional variables that can augment this data source if OTRs can be identified accurately. OBJECTIVE The aim of this study was to develop phenotyping algorithms to identify OTRs from electronic health records. METHODS We used Vanderbilt’s deidentified version of its electronic health record database, which contains nearly 3 million subjects, to develop algorithms to identify OTRs. We identified all 19,817 individuals with at least one International Classification of Diseases (ICD) or Current Procedural Terminology (CPT) code for organ transplantation. We performed a chart review on 1350 randomly selected individuals to determine the transplant status. We constructed machine learning models to calculate positive predictive values and sensitivity for combinations of codes by using classification and regression trees, random forest, and extreme gradient boosting algorithms. RESULTS Of the 1350 reviewed patient charts, 827 were organ transplant recipients while 511 had no record of a transplant, and 12 were equivocal. Most patients with only 1 or 2 transplant codes did not have a transplant. The most common reasons for being labeled a nontransplant patient were the lack of data (229/511, 44.8%) or the patient being evaluated for an organ transplant (174/511, 34.1%). All 3 machine learning algorithms identified OTRs with overall >90% positive predictive value and >88% sensitivity. CONCLUSIONS Electronic health records linked to biobanks are increasingly used to conduct large-scale studies but have not been well-utilized in organ transplantation research. We present rigorously evaluated methods for phenotyping OTRs from electronic health records that will enable the use of the full spectrum of clinical data in transplant research. Using several different machine learning algorithms, we were able to identify transplant cases with high accuracy by using only ICD and CPT codes.


Author(s):  
Laxmi Kumari Pathak ◽  
Pooja Jha

Chronic kidney disease (CKD) is a disorder in which the kidneys are weakened and become unable to filter blood. It lowers the human ability to remain healthy. The field of biosciences has progressed and produced vast volumes of knowledge from electronic health records. Heart disorders, anemia, bone diseases, elevated potassium, and calcium are the very prevalent complications that arise from kidney failure. Early identification of CKD can improve the quality of life greatly. To achieve this, various machine learning techniques have been introduced so far that use the data in electronic health record (EHR) to predict CKD. This chapter studies various machine learning algorithms like support vector machine, random forest, probabilistic neural network, Apriori, ZeroR, OneR, naive Bayes, J48, IBk (k-nearest neighbor), ensemble method, etc. and compares their accuracy. The study aims in finding the best-suited technique from different methods of machine learning for the early detection of CKD by which medical professionals can interpret model predictions easily.


2020 ◽  
Author(s):  
Abin Abraham ◽  
Brian L Le ◽  
Idit Kosti ◽  
Peter Straub ◽  
Digna R Velez Edwards ◽  
...  

Abstract: Identifying pregnancies at risk for preterm birth, one of the leading causes of worldwide infant mortality, has the potential to improve prenatal care. However, we lack broadly applicable methods to accurately predict preterm birth risk. The dense longitudinal information present in electronic health records (EHRs) is enabling scalable and cost-efficient risk modeling of many diseases, but EHR resources have been largely untapped in the study of pregnancy. Here, we apply machine learning to diverse data from EHRs to predict singleton preterm birth. Leveraging a large cohort of 35,282 deliveries, we find that a prediction model based on billing codes alone can predict preterm birth at 28 weeks of gestation (ROC-AUC=0.75, PR-AUC=0.40) and outperforms a comparable model trained using known risk factors (ROC-AUC=0.59, PR-AUC=0.21). Our machine learning approach is also able to accurately predict preterm birth sub-types (spontaneous vs. indicated), mode of delivery, and recurrent preterm birth. We demonstrate the portability of our approach by showing that the prediction models maintain their accuracy on a large, independent cohort (5,978 deliveries) with only a modest decrease in performance. Interpreting the features identified by the model as most informative for risk stratification demonstrates that they capture non-linear combinations of known risk factors and patterns of care. The strong performance of our approach across multiple clinical contexts and an independent cohort highlights the potential of machine learning algorithms to improve medical care during pregnancy.


2018 ◽  
Author(s):  
Gondy Leroy ◽  
Yang Gu ◽  
Sydney Pettygrove ◽  
Maureen K Galindo ◽  
Ananyaa Arora ◽  
...  

BACKGROUND Electronic health records (EHRs) bring many opportunities for information utilization. One such use is the surveillance conducted by the Centers for Disease Control and Prevention to track cases of autism spectrum disorder (ASD). This process currently comprises manual collection and review of EHRs of 4- and 8-year old children in 11 US states for the presence of ASD criteria. The work is time-consuming and expensive. OBJECTIVE Our objective was to automatically extract from EHRs the description of behaviors noted by the clinicians in evidence of the diagnostic criteria in the Diagnostic and Statistical Manual of Mental Disorders (DSM). Previously, we reported on the classification of entire EHRs as ASD or not. In this work, we focus on the extraction of individual expressions of the different ASD criteria in the text. We intend to facilitate large-scale surveillance efforts for ASD and support analysis of changes over time as well as enable integration with other relevant data. METHODS We developed a natural language processing (NLP) parser to extract expressions of 12 DSM criteria using 104 patterns and 92 lexicons (1787 terms). The parser is rule-based to enable precise extraction of the entities from the text. The entities themselves are encompassed in the EHRs as very diverse expressions of the diagnostic criteria written by different people at different times (clinicians, speech pathologists, among others). Due to the sparsity of the data, a rule-based approach is best suited until larger datasets can be generated for machine learning algorithms. RESULTS We evaluated our rule-based parser and compared it with a machine learning baseline (decision tree). Using a test set of 6636 sentences (50 EHRs), we found that our parser achieved 76% precision, 43% recall (ie, sensitivity), and >99% specificity for criterion extraction. The performance was better for the rule-based approach than for the machine learning baseline (60% precision and 30% recall). For some individual criteria, precision was as high as 97% and recall 57%. Since precision was very high, we were assured that criteria were rarely assigned incorrectly, and our numbers presented a lower bound of their presence in EHRs. We then conducted a case study and parsed 4480 new EHRs covering 10 years of surveillance records from the Arizona Developmental Disabilities Surveillance Program. The social criteria (A1 criteria) showed the biggest change over the years. The communication criteria (A2 criteria) did not distinguish the ASD from the non-ASD records. Among behaviors and interests criteria (A3 criteria), 1 (A3b) was present with much greater frequency in the ASD than in the non-ASD EHRs. CONCLUSIONS Our results demonstrate that NLP can support large-scale analysis useful for ASD surveillance and research. In the future, we intend to facilitate detailed analysis and integration of national datasets.


BMJ Open ◽  
2020 ◽  
Vol 10 (11) ◽  
pp. e043487
Author(s):  
Hao Luo ◽  
Kui Kai Lau ◽  
Gloria H Y Wong ◽  
Wai-Chi Chan ◽  
Henry K F Mak ◽  
...  

IntroductionDementia is a group of disabling disorders that can be devastating for persons living with it and for their families. Data-informed decision-making strategies to identify individuals at high risk of dementia are essential to facilitate large-scale prevention and early intervention. This population-based case–control study aims to develop and validate a clinical algorithm for predicting dementia diagnosis, based on the cognitive footprint in personal and medical history.Methods and analysisWe will use territory-wide electronic health records from the Clinical Data Analysis and Reporting System (CDARS) in Hong Kong between 1 January 2001 and 31 December 2018. All individuals who were at least 65 years old by the end of 2018 will be identified from CDARS. A random sample of control individuals who did not receive any diagnosis of dementia will be matched with those who did receive such a diagnosis by age, gender and index date with 1:1 ratio. Exposure to potential protective/risk factors will be included in both conventional logistic regression and machine-learning models. Established risk factors of interest will include diabetes mellitus, midlife hypertension, midlife obesity, depression, head injuries and low education. Exploratory risk factors will include vascular disease, infectious disease and medication. The prediction accuracy of several state-of-the-art machine-learning algorithms will be compared.Ethics and disseminationThis study was approved by Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (UW 18-225). Patients’ records are anonymised to protect privacy. Study results will be disseminated through peer-reviewed publications. Codes of the resulted dementia risk prediction algorithm will be made publicly available at the website of the Tools to Inform Policy: Chinese Communities’ Action in Response to Dementia project (https://www.tip-card.hku.hk/).


Sign in / Sign up

Export Citation Format

Share Document