An Unsupervised Approach to Structuring and Analyzing Repetitive Semantic Structures in Free Text of Electronic Medical Records

Electronic Medical Records (EMR) contain a lot of valuable data about patients, which is however unstructured. There is a lack of labeled medical text data in Russian and there are no tools for automatic annotation. We present an unsupervised approach to medical data annotation. Morphological and syntactical analyses of initial sentences produce syntactic trees, from which similar subtrees are then grouped by Word2Vec and labeled using dictionaries and Wikidata categories. This method can be used to automatically label EMRs in Russian and proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabularies.

Download Full-text

An Unsupervised Approach to Structuring and Analyzing Repetitive Semantic Structures in Free Text of Electronic Medical Records

Journal of Personalized Medicine ◽

10.3390/jpm12010025 ◽

2022 ◽

Vol 12 (1) ◽

pp. 25

Author(s):

Varvara Koshman ◽

Anastasia Funkner ◽

Sergey Kovalchuk

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Medical Data ◽

Validation Dataset ◽

Free Text ◽

Automatic Annotation ◽

Text Data ◽

Data Annotation ◽

Unsupervised Approach ◽

Labeling Method

Electronic medical records (EMRs) include many valuable data about patients, which is, however, unstructured. Therefore, there is a lack of both labeled medical text data in Russian and tools for automatic annotation. As a result, today, it is hardly feasible for researchers to utilize text data of EMRs in training machine learning models in the biomedical domain. We present an unsupervised approach to medical data annotation. Syntactic trees are produced from initial sentences using morphological and syntactical analyses. In retrieved trees, similar subtrees are grouped using Node2Vec and Word2Vec and labeled using domain vocabularies and Wikidata categories. The usage of Wikidata categories increased the fraction of labeled sentences 5.5 times compared to labeling with domain vocabularies only. We show on a validation dataset that the proposed labeling method generates meaningful labels correctly for 92.7% of groups. Annotation with domain vocabularies and Wikidata categories covered more than 82% of sentences of the corpus, extended with timestamp and event labels 97% of sentences got covered. The obtained method can be used to label EMRs in Russian automatically. Additionally, the proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabulary.

Download Full-text

Response to commentaries on ‘Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK’

Journal of Medical Ethics ◽

10.1136/medethics-2020-106430 ◽

2020 ◽

Vol 46 (6) ◽

pp. 384-385

Author(s):

Elizabeth Ford ◽

Malcolm Oswald

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Free Text ◽

Text Data ◽

The Uk

Download Full-text

De-identification of primary care electronic medical records free-text data in Ontario, Canada

BMC Medical Informatics and Decision Making ◽

10.1186/1472-6947-10-35 ◽

2010 ◽

Vol 10 (1) ◽

Cited By ~ 12

Author(s):

Karen Tu ◽

Julie Klein-Geltink ◽

Tezeta F Mitiku ◽

Chiriac Mihai ◽

Joel Martin

Keyword(s):

Primary Care ◽

Electronic Medical Records ◽

Medical Records ◽

Free Text ◽

Text Data

Download Full-text

Approaches to Text Mining for Analyzing Treatment Plan of Quit Smoking with Free-text Medical Records (Preprint)

10.2196/preprints.15844 ◽

2019 ◽

Author(s):

Hsien-Liang Huang ◽

Yun-Cheng Tsai ◽

Shi-Hao Hong ◽

Ya-Mei Hsueh

Keyword(s):

Smoking Cessation ◽

Text Mining ◽

Medical Records ◽

Treatment Plan ◽

Free Text ◽

Similar Data ◽

Text Data ◽

Smoking Cessation Treatment ◽

Quit Smoking ◽

The Impact

BACKGROUND Smoking is a complex behavior associated with multiple factors such as personality, environment, genetics, and emotions. Text data is a rich source of information. However, pure text data requires substantial human resources and time to extract and apply the information, resulting in many details not being discovered and used. OBJECTIVE This study proposes a novel approach that explores a text mining flow to capture the behavior of smokers quitting tobacco from their free-text medical records. More importantly, the paper explores the impact of these changes on smokers. The goal is to help smokers quit smoking. Therefore, the paper develops an algorithm for analyzing smoking cessation treatment plans documented in free-text medical records. METHODS The approach involves the development of an information extraction flow that uses a combination of data mining techniques, including text mining. It can be used not only to help others quit smoking but also for other medical records with similar data elements. RESULTS In the paper, the most visible areas for the medical application of text mining are the integration and transfer of advances made in basic sciences, as well as a better understanding of the processes involved in smoking cessation. CONCLUSIONS Text mining may also be useful for supporting decision-making processes associated with smoking cessation.

Download Full-text

MedTS: A BERT-based generation model to transform medical texts to SQL queries for electronic medical records (Preprint)

10.2196/preprints.32698 ◽

2021 ◽

Author(s):

Youcheng Pan ◽

Chenghao Wang ◽

Baotian Hu ◽

Yang Xiang ◽

Xiaolong Wang ◽

...

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Relational Databases ◽

Query Language ◽

Search Space ◽

Intermediate Representation ◽

Generation Model ◽

Medical Texts ◽

Medical Text ◽

Sql Query

BACKGROUND Electronic medical records (EMRs) are usually stored in relational databases that require structured query language (SQL) queries to retrieve information of interest. Effectively completing such queries is usually a challenging task for medical experts due to the barriers in expertise. However, existing text-to-SQL generation studies have not been fully embraced in the medical domain. OBJECTIVE The objective of this study was to propose a neural generation model, which can jointly consider the characteristics of medical text and the SQL structure, to automatically transform medical texts to SQL queries for EMRs. METHODS In contrast to regarding the SQL query as an ordinary word sequence, the syntax tree, introduced as an intermediate representation, is more in line with the tree-structure nature of SQL and also can effectively reduce the search space during generation. We proposed a medical text-to-SQL model (MedTS), which employed a pre-trained BERT as the encoder and leveraged a grammar-based LSTM as the decoder to predict the tree-structured intermediate representation that can be easily transformed to the final SQL query. Experiments are conducted on the MIMICSQL dataset and five competitor methods are compared. RESULTS Experimental results demonstrated that MedTS achieved the accuracy of 0.770 and 0.888 on the test set in terms of logic form and execution respectively, which significantly outperformed the existing state-of-the-art methods. Further analyses proved that the performance on each component of the generated SQL was relatively balanced and has substantial improvements. CONCLUSIONS The proposed MedTS was effective and robust for improving the performance of medical text-to-SQL generation, indicating strong potentials to be applied in the real medical scenario.

Download Full-text

Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records

BMC Medical Informatics and Decision Making ◽

10.1186/1472-6947-13-30 ◽

2013 ◽

Vol 13 (1) ◽

Cited By ~ 20

Author(s):

Zubair Afzal ◽

Martijn J Schuemie ◽

Jan C van Blijderveen ◽

Elif F Sen ◽

Miriam CJM Sturkenboom ◽

...

Keyword(s):

Machine Learning ◽

Electronic Medical Records ◽

Medical Records ◽

Free Text ◽

Learning Methods ◽

Case Identification ◽

Machine Learning Methods

Download Full-text

Research on text data mining of hospital patient records within Electronic Medical Records

2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS) ◽

10.1109/scis-isis.2014.7044651 ◽

2014 ◽

Author(s):

Muneo Kushima ◽

Kenji Araki ◽

Muneou Suzuki ◽

Tomoyoshi Yamazaki ◽

Noboru Sonehara

Keyword(s):

Data Mining ◽

Electronic Medical Records ◽

Medical Records ◽

Patient Records ◽

Hospital Patient ◽

Text Data ◽

Text Data Mining

Download Full-text

Medical Data Breaches: What the Reported Data Illustrates, and Implications for Transitioning to Electronic Medical Records

Journal of Applied Security Research ◽

10.1080/19361610.2013.738397 ◽

2013 ◽

Vol 8 (1) ◽

pp. 61-79

Author(s):

Akshat Kapoor ◽

Derek L. Nazareth

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Medical Data ◽

Data Breaches ◽

Reported Data

Download Full-text

Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

Journal of Biomedical Semantics ◽

10.1186/s13326-019-0216-2 ◽

2019 ◽

Vol 10 (S1) ◽

Cited By ~ 2

Author(s):

Hegler Tissot ◽

Richard Dobson

Keyword(s):

Medical Records ◽

Similarity Search ◽

Hybrid Approach ◽

Free Text ◽

Distance Metrics ◽

Exact Match ◽

Text Data ◽

String Similarity ◽

Phonetic Similarity ◽

String Distance

Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.

Download Full-text

A natural language processing and deep learning approach to identify child abuse from pediatric electronic medical records

PLoS ONE ◽

10.1371/journal.pone.0247404 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0247404

Author(s):

Akshaya V. Annapragada ◽

Marcella M. Donaruma-Kwoh ◽

Ananth V. Annapragada ◽

Zbigniew A. Starosolski

Keyword(s):

Child Abuse ◽

Natural Language Processing ◽

Natural Language ◽

Physical Abuse ◽

Electronic Medical Records ◽

Language Processing ◽

Child Protection ◽

Medical Records ◽

Free Text ◽

Average Accuracy

Child physical abuse is a leading cause of traumatic injury and death in children. In 2017, child abuse was responsible for 1688 fatalities in the United States, of 3.5 million children referred to Child Protection Services and 674,000 substantiated victims. While large referral hospitals maintain teams trained in Child Abuse Pediatrics, smaller community hospitals often do not have such dedicated resources to evaluate patients for potential abuse. Moreover, identification of abuse has a low margin of error, as false positive identifications lead to unwarranted separations, while false negatives allow dangerous situations to continue. This context makes the consistent detection of and response to abuse difficult, particularly given subtle signs in young, non-verbal patients. Here, we describe the development of artificial intelligence algorithms that use unstructured free-text in the electronic medical record—including notes from physicians, nurses, and social workers—to identify children who are suspected victims of physical abuse. Importantly, only the notes from time of first encounter (e.g.: birth, routine visit, sickness) to the last record before child protection team involvement were used. This allowed us to develop an algorithm using only information available prior to referral to the specialized child protection team. The study was performed in a multi-center referral pediatric hospital on patients screened for abuse within five different locations between 2015 and 2019. Of 1123 patients, 867 records were available after data cleaning and processing, and 55% were abuse-positive as determined by a multi-disciplinary team of clinical professionals. These electronic medical records were encoded with three natural language processing (NLP) algorithms—Bag of Words (BOW), Word Embeddings (WE), and Rules-Based (RB)—and used to train multiple neural network architectures. The BOW and WE encodings utilize the full free-text, while RB selects crucial phrases as identified by physicians. The best architecture was selected by average classification accuracy for the best performing model from each train-test split of a cross-validation experiment. Natural language processing coupled with neural networks detected cases of likely child abuse using only information available to clinicians prior to child protection team referral with average accuracy of 0.90±0.02 and average area under the receiver operator characteristic curve (ROC-AUC) 0.93±0.02 for the best performing Bag of Words models. The best performing rules-based models achieved average accuracy of 0.77±0.04 and average ROC-AUC 0.81±0.05, while a Word Embeddings strategy was severely limited by lack of representative embeddings. Importantly, the best performing model had a false positive rate of 8%, as compared to rates of 20% or higher in previously reported studies. This artificial intelligence approach can help screen patients for whom an abuse concern exists and streamline the identification of patients who may benefit from referral to a child protection team. Furthermore, this approach could be applied to develop computer-aided-diagnosis platforms for the challenging and often intractable problem of reliably identifying pediatric patients suffering from physical abuse.

Download Full-text