scholarly journals An Unsupervised Approach to Structuring and Analyzing Repetitive Semantic Structures in Free Text of Electronic Medical Records

2022 ◽  
Vol 12 (1) ◽  
pp. 25
Author(s):  
Varvara Koshman ◽  
Anastasia Funkner ◽  
Sergey Kovalchuk

Electronic medical records (EMRs) include many valuable data about patients, which is, however, unstructured. Therefore, there is a lack of both labeled medical text data in Russian and tools for automatic annotation. As a result, today, it is hardly feasible for researchers to utilize text data of EMRs in training machine learning models in the biomedical domain. We present an unsupervised approach to medical data annotation. Syntactic trees are produced from initial sentences using morphological and syntactical analyses. In retrieved trees, similar subtrees are grouped using Node2Vec and Word2Vec and labeled using domain vocabularies and Wikidata categories. The usage of Wikidata categories increased the fraction of labeled sentences 5.5 times compared to labeling with domain vocabularies only. We show on a validation dataset that the proposed labeling method generates meaningful labels correctly for 92.7% of groups. Annotation with domain vocabularies and Wikidata categories covered more than 82% of sentences of the corpus, extended with timestamp and event labels 97% of sentences got covered. The obtained method can be used to label EMRs in Russian automatically. Additionally, the proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabulary.

2021 ◽  
Author(s):  
Varvara Koshman ◽  
Anastasia Funkner ◽  
Sergey Kovalchuk

Electronic Medical Records (EMR) contain a lot of valuable data about patients, which is however unstructured. There is a lack of labeled medical text data in Russian and there are no tools for automatic annotation. We present an unsupervised approach to medical data annotation. Morphological and syntactical analyses of initial sentences produce syntactic trees, from which similar subtrees are then grouped by Word2Vec and labeled using dictionaries and Wikidata categories. This method can be used to automatically label EMRs in Russian and proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabularies.


Author(s):  
Karen Tu ◽  
Julie Klein-Geltink ◽  
Tezeta F Mitiku ◽  
Chiriac Mihai ◽  
Joel Martin

2020 ◽  
Author(s):  
Xiaolin Diao ◽  
Yanni Huo ◽  
Zhanzheng Yan ◽  
Haibin Wang ◽  
Jing Yuan ◽  
...  

BACKGROUND Secondary hypertension is a kind of hypertension with definite etiology and may be cured. Patients with suspected secondary hypertension can benefit from detection and treatment in time and, conversely, will have higher risk of morbidity and mortality than patients with primary hypertension. OBJECTIVE The aim of this study was to develop and validate machine learning (ML) prediction models of common etiologies in patients with suspected secondary hypertension. METHODS The analyzed dataset was retrospectively extracted from electronic medical records (EMRs) of patients discharged from Fuwai hospital between January 1, 2016 and June 30, 2019. A total of 7532 unique patients were included and divided into two datasets by time: 6302 patients in 2016-2018 as training dataset for model building and 1230 patients in 2019 as validation dataset for further evaluation. Extreme Gradient Boosting (XGBoost) was adopted to develop five prediction models of four etiologies of secondary hypertension and occurrence of any of them, including renovascular hypertension (RVH), primary aldosteronism (PA), thyroid dysfunction and aortic stenosis. Both univariate logistic analysis and Gini impure method were used for feature selection, while grid search and 10-fold cross-validation were used to select the optimal hyperparameters for each model. RESULTS Validation of the composite outcome prediction model showed good performance with an area under the receiver-operating characteristic curve (AUC) of 0.924 in the validation dataset, while the four prediction models of RVH, PA, thyroid dysfunction and aortic stenosis achieved AUC of 0.938, 0.965, 0.959, 0.946, respectively, in the validation dataset. 79 clinical indicators were identified in all and finally used in our prediction models. The result of subgroup analysis on the composite outcome prediction model demonstrated high discrimination with AUCs all higher than 0.890 among all age groups of adults. CONCLUSIONS The ML prediction models in this study showed good performance in detecting four etiologies of patients with suspected secondary hypertension, thus they may potentially facilitate clinical diagnosis decision making of secondary hypertension in an intelligent way. CLINICALTRIAL


2019 ◽  
Author(s):  
Hsien-Liang Huang ◽  
Yun-Cheng Tsai ◽  
Shi-Hao Hong ◽  
Ya-Mei Hsueh

BACKGROUND Smoking is a complex behavior associated with multiple factors such as personality, environment, genetics, and emotions. Text data is a rich source of information. However, pure text data requires substantial human resources and time to extract and apply the information, resulting in many details not being discovered and used. OBJECTIVE This study proposes a novel approach that explores a text mining flow to capture the behavior of smokers quitting tobacco from their free-text medical records. More importantly, the paper explores the impact of these changes on smokers. The goal is to help smokers quit smoking. Therefore, the paper develops an algorithm for analyzing smoking cessation treatment plans documented in free-text medical records. METHODS The approach involves the development of an information extraction flow that uses a combination of data mining techniques, including text mining. It can be used not only to help others quit smoking but also for other medical records with similar data elements. RESULTS In the paper, the most visible areas for the medical application of text mining are the integration and transfer of advances made in basic sciences, as well as a better understanding of the processes involved in smoking cessation. CONCLUSIONS Text mining may also be useful for supporting decision-making processes associated with smoking cessation.


2019 ◽  
Vol 10 (S1) ◽  
Author(s):  
Hegler Tissot ◽  
Richard Dobson

Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0247404
Author(s):  
Akshaya V. Annapragada ◽  
Marcella M. Donaruma-Kwoh ◽  
Ananth V. Annapragada ◽  
Zbigniew A. Starosolski

Child physical abuse is a leading cause of traumatic injury and death in children. In 2017, child abuse was responsible for 1688 fatalities in the United States, of 3.5 million children referred to Child Protection Services and 674,000 substantiated victims. While large referral hospitals maintain teams trained in Child Abuse Pediatrics, smaller community hospitals often do not have such dedicated resources to evaluate patients for potential abuse. Moreover, identification of abuse has a low margin of error, as false positive identifications lead to unwarranted separations, while false negatives allow dangerous situations to continue. This context makes the consistent detection of and response to abuse difficult, particularly given subtle signs in young, non-verbal patients. Here, we describe the development of artificial intelligence algorithms that use unstructured free-text in the electronic medical record—including notes from physicians, nurses, and social workers—to identify children who are suspected victims of physical abuse. Importantly, only the notes from time of first encounter (e.g.: birth, routine visit, sickness) to the last record before child protection team involvement were used. This allowed us to develop an algorithm using only information available prior to referral to the specialized child protection team. The study was performed in a multi-center referral pediatric hospital on patients screened for abuse within five different locations between 2015 and 2019. Of 1123 patients, 867 records were available after data cleaning and processing, and 55% were abuse-positive as determined by a multi-disciplinary team of clinical professionals. These electronic medical records were encoded with three natural language processing (NLP) algorithms—Bag of Words (BOW), Word Embeddings (WE), and Rules-Based (RB)—and used to train multiple neural network architectures. The BOW and WE encodings utilize the full free-text, while RB selects crucial phrases as identified by physicians. The best architecture was selected by average classification accuracy for the best performing model from each train-test split of a cross-validation experiment. Natural language processing coupled with neural networks detected cases of likely child abuse using only information available to clinicians prior to child protection team referral with average accuracy of 0.90±0.02 and average area under the receiver operator characteristic curve (ROC-AUC) 0.93±0.02 for the best performing Bag of Words models. The best performing rules-based models achieved average accuracy of 0.77±0.04 and average ROC-AUC 0.81±0.05, while a Word Embeddings strategy was severely limited by lack of representative embeddings. Importantly, the best performing model had a false positive rate of 8%, as compared to rates of 20% or higher in previously reported studies. This artificial intelligence approach can help screen patients for whom an abuse concern exists and streamline the identification of patients who may benefit from referral to a child protection team. Furthermore, this approach could be applied to develop computer-aided-diagnosis platforms for the challenging and often intractable problem of reliably identifying pediatric patients suffering from physical abuse.


Sign in / Sign up

Export Citation Format

Share Document