scholarly journals Disease Concept-Embedding Based on the Self-Supervised Method for Medical Information Extraction from Electronic Health Records and Disease Retrieval: Algorithm Development and Validation Study

10.2196/25113 ◽  
2021 ◽  
Vol 23 (1) ◽  
pp. e25113
Author(s):  
Yen-Pin Chen ◽  
Yuan-Hsun Lo ◽  
Feipei Lai ◽  
Chien-Hua Huang

Background The electronic health record (EHR) contains a wealth of medical information. An organized EHR can greatly help doctors treat patients. In some cases, only limited patient information is collected to help doctors make treatment decisions. Because EHRs can serve as a reference for this limited information, doctors’ treatment capabilities can be enhanced. Natural language processing and deep learning methods can help organize and translate EHR information into medical knowledge and experience. Objective In this study, we aimed to create a model to extract concept embeddings from EHRs for disease pattern retrieval and further classification tasks. Methods We collected 1,040,989 emergency department visits from the National Taiwan University Hospital Integrated Medical Database and 305,897 samples from the National Hospital and Ambulatory Medical Care Survey Emergency Department data. After data cleansing and preprocessing, the data sets were divided into training, validation, and test sets. We proposed a Transformer-based model to embed EHRs and used Bidirectional Encoder Representations from Transformers (BERT) to extract features from free text and concatenate features with structural data as input to our proposed model. Then, Deep InfoMax (DIM) and Simple Contrastive Learning of Visual Representations (SimCLR) were used for the unsupervised embedding of the disease concept. The pretrained disease concept-embedding model, named EDisease, was further finetuned to adapt to the critical care outcome prediction task. We evaluated the performance of embedding using t-distributed stochastic neighbor embedding (t-SNE) to perform dimension reduction for visualization. The performance of the finetuned predictive model was evaluated against published models using the area under the receiver operating characteristic (AUROC). Results The performance of our model on the outcome prediction had the highest AUROC of 0.876. In the ablation study, the use of a smaller data set or fewer unsupervised methods for pretraining deteriorated the prediction performance. The AUROCs were 0.857, 0.870, and 0.868 for the model without pretraining, the model pretrained by only SimCLR, and the model pretrained by only DIM, respectively. On the smaller finetuning set, the AUROC was 0.815 for the proposed model. Conclusions Through contrastive learning methods, disease concepts can be embedded meaningfully. Moreover, these methods can be used for disease retrieval tasks to enhance clinical practice capabilities. The disease concept model is also suitable as a pretrained model for subsequent prediction tasks.

2020 ◽  
Author(s):  
Yen-Pin Chen ◽  
Yuan-Hsun Lo ◽  
Feipei Lai ◽  
Chien-Hua Huang

BACKGROUND The electronic health record (EHR) contains a wealth of medical information. An organized EHR can greatly help doctors treat patients. In some cases, only limited patient information is collected to help doctors make treatment decisions. Because EHRs can serve as a reference for this limited information, doctors’ treatment capabilities can be enhanced. Natural language processing and deep learning methods can help organize and translate EHR information into medical knowledge and experience. OBJECTIVE In this study, we aimed to create a model to extract concept embeddings from EHRs for disease pattern retrieval and further classification tasks. METHODS We collected 1,040,989 emergency department visits from the National Taiwan University Hospital Integrated Medical Database and 305,897 samples from the National Hospital and Ambulatory Medical Care Survey Emergency Department data. After data cleansing and preprocessing, the data sets were divided into training, validation, and test sets. We proposed a Transformer-based model to embed EHRs and used Bidirectional Encoder Representations from Transformers (BERT) to extract features from free text and concatenate features with structural data as input to our proposed model. Then, Deep InfoMax (DIM) and Simple Contrastive Learning of Visual Representations (SimCLR) were used for the unsupervised embedding of the disease concept. The pretrained disease concept-embedding model, named EDisease, was further finetuned to adapt to the critical care outcome prediction task. We evaluated the performance of embedding using t-distributed stochastic neighbor embedding (t-SNE) to perform dimension reduction for visualization. The performance of the finetuned predictive model was evaluated against published models using the area under the receiver operating characteristic (AUROC). RESULTS The performance of our model on the outcome prediction had the highest AUROC of 0.876. In the ablation study, the use of a smaller data set or fewer unsupervised methods for pretraining deteriorated the prediction performance. The AUROCs were 0.857, 0.870, and 0.868 for the model without pretraining, the model pretrained by only SimCLR, and the model pretrained by only DIM, respectively. On the smaller finetuning set, the AUROC was 0.815 for the proposed model. CONCLUSIONS Through contrastive learning methods, disease concepts can be embedded meaningfully. Moreover, these methods can be used for disease retrieval tasks to enhance clinical practice capabilities. The disease concept model is also suitable as a pretrained model for subsequent prediction tasks.


2022 ◽  
Vol 112 (1) ◽  
pp. 98-106
Author(s):  
Lara Schwarz ◽  
Edward M. Castillo ◽  
Theodore C. Chan ◽  
Jesse J. Brennan ◽  
Emily S. Sbiroli ◽  
...  

Objectives. To determine the effect of heat waves on emergency department (ED) visits for individuals experiencing homelessness and explore vulnerability factors. Methods. We used a unique highly detailed data set on sociodemographics of ED visits in San Diego, California, 2012 to 2019. We applied a time-stratified case–crossover design to study the association between various heat wave definitions and ED visits. We compared associations with a similar population not experiencing homelessness using coarsened exact matching. Results. Of the 24 688 individuals identified as experiencing homelessness who visited an ED, most were younger than 65 years (94%) and of non-Hispanic ethnicity (84%), and 14% indicated the need for a psychiatric consultation. Results indicated a positive association, with the strongest risk of ED visits during daytime (e.g., 99th percentile, 2 days) heat waves (odds ratio = 1.29; 95% confidence interval = 1.02, 1.64). Patients experiencing homelessness who were younger or elderly and who required a psychiatric consultation were particularly vulnerable to heat waves. Odds of ED visits were higher for individuals experiencing homelessness after matching to nonhomeless individuals based on age, gender, and race/ethnicity. Conclusions. It is important to prioritize individuals experiencing homelessness in heat action plans and consider vulnerability factors to reduce their burden. (Am J Public Health. 2022;112(1):98–106. https://doi.org/10.2105/AJPH.2021.306557 )


2021 ◽  
pp. 1106-1126
Author(s):  
Dylan J. Peterson ◽  
Nicolai P. Ostberg ◽  
Douglas W. Blayney ◽  
James D. Brooks ◽  
Tina Hernandez-Boussard

PURPOSE Acute care use (ACU) is a major driver of oncologic costs and is penalized by a Centers for Medicare & Medicaid Services quality measure, OP-35. Targeted interventions reduce preventable ACU; however, identifying which patients might benefit remains challenging. Prior predictive models have made use of a limited subset of the data in the electronic health record (EHR). We aimed to predict risk of preventable ACU after starting chemotherapy using machine learning (ML) algorithms trained on comprehensive EHR data. METHODS Chemotherapy patients treated at an academic institution and affiliated community care sites between January 2013 and July 2019 who met inclusion criteria for OP-35 were identified. Preventable ACU was defined using OP-35 criteria. Structured EHR data generated before chemotherapy treatment were obtained. ML models were trained to predict risk for ACU after starting chemotherapy using 80% of the cohort. The remaining 20% were used to test model performance by the area under the receiver operator curve. RESULTS Eight thousand four hundred thirty-nine patients were included, of whom 35% had preventable ACU within 180 days of starting chemotherapy. Our primary model classified patients at risk for preventable ACU with an area under the receiver operator curve of 0.783 (95% CI, 0.761 to 0.806). Performance was better for identifying admissions than emergency department visits. Key variables included prior hospitalizations, cancer stage, race, laboratory values, and a diagnosis of depression. Analyses showed limited benefit from including patient-reported outcome data and indicated inequities in outcomes and risk modeling for Black and Medicaid patients. CONCLUSION Dense EHR data can identify patients at risk for ACU using ML with promising accuracy. These models have potential to improve cancer care outcomes, patient experience, and costs by allowing for targeted, preventative interventions.


2019 ◽  
Author(s):  
Timothy Bergquist ◽  
Vikas Pejaver ◽  
Noah Hammarlund ◽  
Sean D. Mooney ◽  
Stephen J. Mooney

Abstract Background The increasing adoption of electronic health record (EHR) systems enables automated, large scale, and meaningful analysis of regional population health. We explored how EHR systems could inform surveillance of trauma-related emergency department visits arising from seasonal, holiday-related, and rare environmental events. Methods We analyzed temporal variation in diagnosis codes over 24 years of trauma visit data at the three hospitals in the University of Washington Medicine system in Seattle, Washington, USA. We identified seasons and days in which specific codes and categories of codes were statistically enriched, meaning that a significantly greater than average proportion of trauma visits included a given diagnosis code during that time period. Results We confirmed known seasonal patterns in emergency department visits for trauma. As expected, cold weather-related incidents (e.g. frostbite, snowboarding injury) were enriched in the winter, whereas fair weather-related incidents (e.g. bug bites, boating accidents, bicycle accidents) were enriched in the spring and summer. Our analysis of specific days of the year found that holidays were enriched for alcohol poisoning, assaults, and firework accidents. We also detected one time regional events such as the 2001 Nisqually earthquake and the 2006 Hanukkah Eve Windstorm. Conclusions Though EHR systems were developed to prioritize operational rather than analytic priorities and have consequent limitations for surveillance, our EHR enrichment analysis nonetheless re-identified expected temporal population health patterns. EHRs are potentially a valuable source of information to inform public health policy, both in retrospective analysis and in a surveillance capacity.


2020 ◽  
Author(s):  
Tjardo D Maarseveen ◽  
Timo Meinderink ◽  
Marcel J T Reinders ◽  
Johannes Knitza ◽  
Tom W J Huizinga ◽  
...  

BACKGROUND Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. OBJECTIVE The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. METHODS Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. RESULTS For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). CONCLUSIONS We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.


2019 ◽  
Author(s):  
Timothy Bergquist ◽  
Vikas Pejaver ◽  
Noah Hammarlund ◽  
Sean D. Mooney ◽  
Stephen J. Mooney

Abstract Background The increasing adoption of electronic health record (EHR) systems enables automated, large scale, and meaningful analysis of regional population health. We explored how EHR systems could inform surveillance of trauma-related emergency department visits arising from seasonal, holiday-related, and rare environmental events. Methods We analyzed temporal variation in diagnosis codes over 24 years of trauma visit data at the three hospitals in the University of Washington Medicine system in Seattle, Washington, USA. We identified seasons and days in which specific codes and categories of codes were statistically enriched, meaning that a significantly greater than average proportion of trauma visits included a given diagnosis code during that time period. Results We confirmed known seasonal patterns in emergency department visits for trauma. As expected, cold weather-related incidents (e.g. frostbite, snowboarding injury) were enriched in the winter, whereas fair weather-related incidents (e.g. bug bites, boating accidents, bicycle accidents) were enriched in the spring and summer. Our analysis of specific days of the year found that holidays were enriched for alcohol poisoning, assaults, and firework accidents. We also detected one time regional events such as the 2001 Nisqually earthquake and the 2006 Hanukkah Eve Windstorm. Conclusions Though EHR systems were developed to prioritize operational rather than analytic priorities and have consequent limitations for surveillance, our EHR enrichment analysis nonetheless re-identified expected temporal population health patterns. EHRs are potentially a valuable source of information to inform public health policy, both in retrospective analysis and in a surveillance capacity.


2021 ◽  
Author(s):  
Kamel Alachraf ◽  
Caroline Currie ◽  
William Wooten ◽  
Dmitry Tumin

Abstract Social determinants of health (SDH) influence emergency department (ED) use among children with asthma. We aimed to examine if SDH were more strongly associated with ED use among children with moderate/severe compared to mild asthma. This study utilized the 2016-2019 data from the National Survey of Children’s Health. Children with asthma ages 0-17 years (N=9,937) were included in the analysis. Asthma severity and all-cause ED use in the past year were reported by caregivers. The association between patient factors and ED visits was evaluated using ordinal logistic regression. Based on the study sample, 29% of children with asthma had moderate/severe asthma. In the mild group, 30% visited the ED at least once in the past 12 months, compared to 49% in the moderate/severe group. SDH associated with ED visits included race/ethnicity, insurance coverage, and parental educational attainment, but the strength of these associations did not vary according to asthma severity. In a nationally-representative data set, SDH were equally predictive of ED use regardless of children’s asthma severity. Interventions to reduce ED use among children with asthma should be considered for children with any severity of asthma, especially children in socially disadvantaged groups at higher risk of ED utilization.


2017 ◽  
Vol 132 (4) ◽  
pp. 471-479 ◽  
Author(s):  
Kathryn DeYoung ◽  
Yushiuan Chen ◽  
Robert Beum ◽  
Michele Askenazi ◽  
Cali Zimmerman ◽  
...  

Objectives: Reliable methods are needed to monitor the public health impact of changing laws and perceptions about marijuana. Structured and free-text emergency department (ED) visit data offer an opportunity to monitor the impact of these changes in near-real time. Our objectives were to (1) generate and validate a syndromic case definition for ED visits potentially related to marijuana and (2) describe a method for doing so that was less resource intensive than traditional methods. Methods: We developed a syndromic case definition for ED visits potentially related to marijuana, applied it to BioSense 2.0 data from 15 hospitals in the Denver, Colorado, metropolitan area for the period September through October 2015, and manually reviewed each case to determine true positives and false positives. We used the number of visits identified by and the positive predictive value (PPV) for each search term and field to refine the definition for the second round of validation on data from February through March 2016. Results: Of 126 646 ED visits during the first period, terms in 524 ED visit records matched ≥1 search term in the initial case definition (PPV, 92.7%). Of 140 932 ED visits during the second period, terms in 698 ED visit records matched ≥1 search term in the revised case definition (PPV, 95.7%). After another revision, the final case definition contained 6 keywords for marijuana or derivatives and 5 diagnosis codes for cannabis use, abuse, dependence, poisoning, and lung disease. Conclusions: Our syndromic case definition and validation method for ED visits potentially related to marijuana could be used by other public health jurisdictions to monitor local trends and for other emerging concerns.


10.2196/23930 ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. e23930
Author(s):  
Tjardo D Maarseveen ◽  
Timo Meinderink ◽  
Marcel J T Reinders ◽  
Johannes Knitza ◽  
Tom W J Huizinga ◽  
...  

Background Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. Objective The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. Methods Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. Results For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). Conclusions We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S688-S688
Author(s):  
Ian Breunig ◽  
Qing Zheng ◽  
Alan White ◽  
Christianna Williams ◽  
Allison Muma

Abstract CMS strives to reduce costs and improve care for nursing home (NH) residents by reducing acute care transfers. We used a national database of Medicare claims and the Minimum Data Set to build NH stays from July 2017 through June 2018 and identify dates of hospital admissions and emergency department visits without hospitalization (ED) among all residents. We calculated rates of 30-day re-hospitalization and ED among short-stay (rehabilitation) residents, and the number of hospitalizations or ED per long-stay resident day (LSRD), then examined associations with NH Five-Star ratings (data.medicare.gov) and other provider characteristics available from Medicare administrative data. We identified 1.79 million short-stays and 898,290 long-stays at 15,576 NHs. Nationally, the 30-day re-hospitalization rate is 22.6%, the short-stay ED rate is 12.0%, there was one hospitalization every 561 LSRD (1.8 per 1000 LSRD), and there was one ED every 617 LSRD (1.6 per 1000 LSRD). Median facility rates were 22.3% (IQR=17.8%, 27.1%) for 30-day re-hospitalizations, 12.0% (IQR=8.7%, 16.1%) for short-stay EDs, 1.6 hospitalizations per 1000 LSRD (IQR=1.1, 2.3), and 1.4 ED per 1000 LSRD (IQR =0.9, 2.2). Higher rates were strongly associated with lower Five-Star ratings, particularly staffing ratings, and larger, for-profit, non-hospital facilities; even after risk-adjustment. NH variation and associations with provider characteristics suggest it is possible to further reduce acute care transfers. CMS incorporated these measures into the Five-Star rating system, providing greater transparency for residents and possibly incentivizing NHs to improve through competition. Future research should monitor success or identify the need for other avenues to improve.


Sign in / Sign up

Export Citation Format

Share Document