scholarly journals Phe2vec: Automated Disease Phenotyping based on Unsupervised Embeddings from Electronic Health Records

Author(s):  
Jessica K. De Freitas ◽  
Kipp W. Johnson ◽  
Eddye Golden ◽  
Girish N. Nadkarni ◽  
Joel T. Dudley ◽  
...  

AbstractObjectiveWe introduce Phe2vec, an automated framework for disease phenotyping from electronic health records (EHRs) based on unsupervised learning. We assess its effectiveness against standard rule-based algorithms from the Phenotype KnowledgeBase (PheKB).Materials and MethodsPhe2vec is based on pre-computing embeddings of medical concepts and patients’ longitudinal clinical history. Disease phenotypes are then derived from a seed concept and its neighbors in the embedding space. Patients are similarly linked to a disease if their embedded representation is close to the phenotype. We evaluated Phe2vec using 49,234 medical concepts from structured EHRs and clinical notes from 1,908,741 patients in the Mount Sinai Health System. We assessed performance on ten diverse diseases having a PheKB algorithm, and one disease without, namely Lyme disease.ResultsPhe2vec phenotypes derived using Word2vec, GloVe, and Fasttext embeddings led to promising performance in disease definition and patient cohort identification as compared with standard PheKB definitions. When comparing head-to-head Phe2vec and PheKB disease patient cohorts using chart review, Phe2vec performed on par or better in nine out of ten diseases in terms of predictive positive values. Additionally, Phe2vec effectively identified phenotype definition and patient cohort for Lyme disease, a condition not covered in PheKB.DiscussionPhe2vec offers a solution to improve time-consuming phenotyping pipelines. Differently from other automated approaches in the literature, it is fully unsupervised, can easily scale to any disease and was validated against widely adopted expert-based standards.ConclusionPhe2vec aims to optimize clinical informatics research by augmenting current frameworks to characterize patients by condition and derive reliable disease cohorts.

2019 ◽  
Vol 10 (2) ◽  
pp. 241-250 ◽  
Author(s):  
Katherine A. Moon ◽  
Jonathan Pollak ◽  
Annemarie G. Hirsch ◽  
John N. Aucott ◽  
Cara Nordberg ◽  
...  

2019 ◽  
Vol 35 (21) ◽  
pp. 4515-4518 ◽  
Author(s):  
Benjamin S Glicksberg ◽  
Boris Oskotsky ◽  
Phyllis M Thangaraj ◽  
Nicholas Giangreco ◽  
Marcus A Badgeley ◽  
...  

AbstractMotivationElectronic health records (EHRs) are quickly becoming omnipresent in healthcare, but interoperability issues and technical demands limit their use for biomedical and clinical research. Interactive and flexible software that interfaces directly with EHR data structured around a common data model (CDM) could accelerate more EHR-based research by making the data more accessible to researchers who lack computational expertise and/or domain knowledge.ResultsWe present PatientExploreR, an extensible application built on the R/Shiny framework that interfaces with a relational database of EHR data in the Observational Medical Outcomes Partnership CDM format. PatientExploreR produces patient-level interactive and dynamic reports and facilitates visualization of clinical data without any programming required. It allows researchers to easily construct and export patient cohorts from the EHR for analysis with other software. This application could enable easier exploration of patient-level data for physicians and researchers. PatientExploreR can incorporate EHR data from any institution that employs the CDM for users with approved access. The software code is free and open source under the MIT license, enabling institutions to install and users to expand and modify the application for their own purposes.Availability and implementationPatientExploreR can be freely obtained from GitHub: https://github.com/BenGlicksberg/PatientExploreR. We provide instructions for how researchers with approved access to their institutional EHR can use this package. We also release an open sandbox server of synthesized patient data for users without EHR access to explore: http://patientexplorer.ucsf.edu.Supplementary informationSupplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Ignacio Hernández-Medrano ◽  
Marisa Serrano ◽  
Sergio Collazo ◽  
Ana López-Ballesteros ◽  
Blai Coll ◽  
...  

BACKGROUND Research efforts to develop strategies to effectively identify patients and to reduce the burden of cardiovascular diseases is essential for the future of the health system. Most research studies have used only coded parts of electronic health records (EHRs) for case-detection, obtaining missed data cases and reducing study quality. Incorporating information from free-text into case-detection through Natural Language Processing (NLP) techniques improves research quality. SAVANA was born as an innovating data-driven system based on NLP and big data techniques designed to retrieve prominent biomedical information from narratives clinic notes and to maximize the huge amount of information contained in Spanish EHRs. OBJECTIVE The aim of this work if to assess the performance of SAVANA when identifying concepts within the cardiovascular domain in Spanish EHRs. METHODS SAVANA is a platform for acceleration of clinical research, based on real-time dynamic exploitation of all the information contained in EHRs corpora that uses its own technology (EHRead) to allow unstructured information contained in EHRs to be analysed and expressed by means of medical concepts that contain the most significant information in the text. RESULTS The evaluation corpus consisted of a stratified random sample of patients from 3 Spanish sites. For site 01, the corpus contained a total of 280 mentions of cardiovascular clinical entities, where 249 were correctly identified, obtaining a P=0.93. In site 02, SAVANA correctly detected 53 mentions of cardiovascular entities among 57 annotations, achieving a P=0.98; and in site 03, among 165 manual annotations, 75 were correctly identified, yielding a P= 0.99. CONCLUSIONS This research clearly demonstrates the ability of SAVANA at identifying mentions of atherosclerotic/cardiovascular clinical phenotype in Spanish EHRs, as well as retrieving patients and records related to this pathology.


2021 ◽  
Author(s):  
Kyunghoon Hur ◽  
Jiyoung Lee ◽  
Jungwoo Oh ◽  
Wesley Price ◽  
Young-Hak Kim ◽  
...  

BACKGROUND Substantial increase in the use of Electronic Health Records (EHRs) has opened new frontiers for predictive healthcare. However, while EHR systems are nearly ubiquitous, they lack a unified code system for representing medical concepts. Heterogeneous formats of EHR present a substantial barrier for the training and deployment of state-of-the-art deep learning models at scale. OBJECTIVE The aim of this study is to suggest a novel text embedding approach to overcome heterogeneity of EHR structure among different EHR systems. METHODS We introduce Description-based Embedding, DescEmb, a code-agnostic description-based representation learning framework for predictive modeling on EHR. DescEmb takes advantage of the flexibility of neural language understanding models while maintaining a neutral approach that can be combined with prior frameworks for task-specific representation learning or predictive modeling. RESULTS Based on five prediction tasks with two heterogeneous EHR datasets, DescEmb achieves comparable or superior performance to the traditional code-based embedding approach, especially under the zero-shot and few-shot transfer learning scenarios. We also demonstrate that DescEmb enables us to train a single model on a pooled dataset from heterogeneous EHR systems and achieve the same, if not better performance compared to training separate models for each EHR system. CONCLUSIONS Based on the promising results, we believe the description-based embedding approach on EHR will open a new direction for large-scale predictive modeling in healthcare.


Sign in / Sign up

Export Citation Format

Share Document