Representation on feature and patient levels from structured Electronic Medical Records based on Skip-gram algorithm (Preprint)
BACKGROUND The secondary utilization of the structured electronic medical record (sEMR) data has become a challenge due to the diversity, sparsity, and high-dimensionality of the data representation. OBJECTIVE We aimed to explore the feasibility of the embedding-based feature and patient representation for sEMR data and demonstrate the efficiency and superiority of the embedding-based patient representation. METHODS The entire training corpus consisted of records of 104752 hospitalized patients with 21 variables, including demographic characteristics, disease diagnoses, procedures, medications, laboratory tests, and other hospitalization indicators. Discrete values for original categorical variables and binned continuous variables were considered as words (concepts), and thus a record as a sentence in a text. To eliminate the influence the concept sequence played on the embedding algorithm, we randomly shuffled the concepts within a sentence 20 times. For a patient record, each feature concept was embedded into a 200-dimensional real number vector using the Skip-gram algorithm. Then the average of all the embedding concept vectors represented the patient. To assess the effectiveness of these embedding-based feature representations, we used the cosine distances among features’ embedding vectors to capture the latent relationship among the concepts of different features. We further conducted cluster analysis on stroke patients to evaluate and compare the efficiency and superiority of the embedding-based patient representation, where the embedding vectors were trained using the overall patients and just the stroke patients with and without the concept shuffling respectively. The representations of both multi-hot codes and one-hot codes plus original continuous numbers were used as the benchmark representations. RESULTS According to the Silhouette index, stroke patients were clustered into two groups, characterizing in patients with a primary diagnosis of hemorrhage stroke (HS) and ischemic stroke (IS), respectively. Cluster analyses conducted on patients with the embedding representations showed higher applicability (Hopkins Statistics, 0.925), higher aggregation (Silhouette index, 0.862), and lower dispersion (Davies Bouldin index, 0.551) than those conducted on patients with the benchmark representations. The two clusters for patients with the embedding-based representation learned from all the records after the concept shuffling achieved the highest F1-scores of 0.944 for IS and 0.717 for HS, respectively. CONCLUSIONS The feature-level embeddings can reflect the potential associations among medical concepts to some degree. The patient-level embeddings can be easily used as continuous input to standard machine learning algorithms and bring performance improvement. We expect that the embedding-based representation will be helpful in a wide range of the secondary use of the sEMR data.