scholarly journals Med7: a transferable clinical natural language processing model for electronic health records

Author(s):  
Andrey Kormilitzin ◽  
Nemanja Vaci ◽  
Qiang Liu ◽  
Alejo Nevado-Holgado
2021 ◽  
Author(s):  
Ye Seul Bae ◽  
Kyung Hwan Kim ◽  
Han Kyul Kim ◽  
Sae Won Choi ◽  
Taehoon Ko ◽  
...  

BACKGROUND Smoking is a major risk factor and important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). OBJECTIVE We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). METHODS With acronym replacement and Python package Soynlp, we normalize 4,711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. RESULTS Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual clinical notes. Given an identical SVM classifier, the extracted keywords improve the F1 score by as much as 1.8% compared to those of the unigram and bigram Bag of Words. CONCLUSIONS Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired and used for clinical practice and research.


2021 ◽  
Vol 11 (19) ◽  
pp. 8812
Author(s):  
Ye Seul Bae ◽  
Kyung Hwan Kim ◽  
Han Kyul Kim ◽  
Sae Won Choi ◽  
Taehoon Ko ◽  
...  

Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classification from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classified into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identified. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classifier, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current findings show how smoking information can be easily acquired for clinical practice and research.


2021 ◽  
Author(s):  
Marika Cusick ◽  
Sumithra Velupillai ◽  
Johnny Downs ◽  
Thomas Campion ◽  
Rina Dutta ◽  
...  

Abstract In the global effort to prevent death by suicide, many academic medical institutions are implementing natural language processing (NLP) approaches to detect suicidality from unstructured clinical text in electronic health records (EHRs), with the hope of targeting timely, preventative interventions to individuals most at risk of suicide. Despite the international need, the development of these NLP approaches in EHRs has been largely local and not shared across healthcare systems. In this study, we developed a process to share NLP approaches that were individually developed at King’s College London (KCL), UK and Weill Cornell Medicine (WCM), US - two academic medical centers based in different countries with vastly different healthcare systems. After a successful technical porting of the NLP approaches, our quantitative evaluation determined that independently developed NLP approaches can detect suicidality at another healthcare organization with a different EHR system, clinical documentation processes, and culture, yet do not achieve the same level of success as at the institution where the NLP algorithm was developed (KCL approach: F1-score 0.85 vs. 0.68, WCM approach: F1-score 0.87 vs. 0.72). Shared use of these NLP approaches is a critical step forward towards improving data-driven algorithms for early suicide risk identification and timely prevention.


10.2196/25157 ◽  
2022 ◽  
Vol 10 (1) ◽  
pp. e25157
Author(s):  
Zhen Yang ◽  
Chloé Pou-Prom ◽  
Ashley Jones ◽  
Michaelia Banning ◽  
David Dai ◽  
...  

Background The Expanded Disability Status Scale (EDSS) score is a widely used measure to monitor disability progression in people with multiple sclerosis (MS). However, extracting and deriving the EDSS score from unstructured electronic health records can be time-consuming. Objective We aimed to compare rule-based and deep learning natural language processing algorithms for detecting and predicting the total EDSS score and EDSS functional system subscores from the electronic health records of patients with MS. Methods We studied 17,452 electronic health records of 4906 MS patients followed at one of Canada’s largest MS clinics between June 2015 and July 2019. We randomly divided the records into training (80%) and test (20%) data sets, and compared the performance characteristics of 3 natural language processing models. First, we applied a rule-based approach, extracting the EDSS score from sentences containing the keyword “EDSS.” Next, we trained a convolutional neural network (CNN) model to predict the 19 half-step increments of the EDSS score. Finally, we used a combined rule-based–CNN model. For each approach, we determined the accuracy, precision, recall, and F-score compared with the reference standard, which was manually labeled EDSS scores in the clinic database. Results Overall, the combined keyword-CNN model demonstrated the best performance, with accuracy, precision, recall, and an F-score of 0.90, 0.83, 0.83, and 0.83 respectively. Respective figures for the rule-based and CNN models individually were 0.57, 0.91, 0.65, and 0.70, and 0.86, 0.70, 0.70, and 0.70. Because of missing data, the model performance for EDSS subscores was lower than that for the total EDSS score. Performance improved when considering notes with known values of the EDSS subscores. Conclusions A combined keyword-CNN natural language processing model can extract and accurately predict EDSS scores from patient records. This approach can be automated for efficient information extraction in clinical and research settings.


Sign in / Sign up

Export Citation Format

Share Document