Unifying Heterogenous Electronic Health Records Systems via Text-Based Code Embedding: Study of Predictive Modeling (Preprint)
BACKGROUND Substantial increase in the use of Electronic Health Records (EHRs) has opened new frontiers for predictive healthcare. However, while EHR systems are nearly ubiquitous, they lack a unified code system for representing medical concepts. Heterogeneous formats of EHR present a substantial barrier for the training and deployment of state-of-the-art deep learning models at scale. OBJECTIVE The aim of this study is to suggest a novel text embedding approach to overcome heterogeneity of EHR structure among different EHR systems. METHODS We introduce Description-based Embedding, DescEmb, a code-agnostic description-based representation learning framework for predictive modeling on EHR. DescEmb takes advantage of the flexibility of neural language understanding models while maintaining a neutral approach that can be combined with prior frameworks for task-specific representation learning or predictive modeling. RESULTS Based on five prediction tasks with two heterogeneous EHR datasets, DescEmb achieves comparable or superior performance to the traditional code-based embedding approach, especially under the zero-shot and few-shot transfer learning scenarios. We also demonstrate that DescEmb enables us to train a single model on a pooled dataset from heterogeneous EHR systems and achieve the same, if not better performance compared to training separate models for each EHR system. CONCLUSIONS Based on the promising results, we believe the description-based embedding approach on EHR will open a new direction for large-scale predictive modeling in healthcare.