scholarly journals Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese

2019 ◽  
Vol 10 (S1) ◽  
Author(s):  
Hegler Tissot ◽  
Richard Dobson

Abstract Background There is an increasing amount of unstructured medical data that can be analysed for different purposes. However, information extraction from free text data may be particularly inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, coupled with a supporting dictionary. However, they are not rich enough to encode both typing and phonetic misspellings. Results Experimental results showed a joint string and language-dependent phonetic similarity is more accurate than traditional string distance metrics when identifying misspelt names of drugs in a set of medical records written in Portuguese. Conclusion We present a hybrid approach to efficiently perform similarity match that overcomes the loss of information inherit from using either exact match search or string based similarity search methods.

2019 ◽  
Author(s):  
Hsien-Liang Huang ◽  
Yun-Cheng Tsai ◽  
Shi-Hao Hong ◽  
Ya-Mei Hsueh

BACKGROUND Smoking is a complex behavior associated with multiple factors such as personality, environment, genetics, and emotions. Text data is a rich source of information. However, pure text data requires substantial human resources and time to extract and apply the information, resulting in many details not being discovered and used. OBJECTIVE This study proposes a novel approach that explores a text mining flow to capture the behavior of smokers quitting tobacco from their free-text medical records. More importantly, the paper explores the impact of these changes on smokers. The goal is to help smokers quit smoking. Therefore, the paper develops an algorithm for analyzing smoking cessation treatment plans documented in free-text medical records. METHODS The approach involves the development of an information extraction flow that uses a combination of data mining techniques, including text mining. It can be used not only to help others quit smoking but also for other medical records with similar data elements. RESULTS In the paper, the most visible areas for the medical application of text mining are the integration and transfer of advances made in basic sciences, as well as a better understanding of the processes involved in smoking cessation. CONCLUSIONS Text mining may also be useful for supporting decision-making processes associated with smoking cessation.


2022 ◽  
Vol 12 (1) ◽  
pp. 25
Author(s):  
Varvara Koshman ◽  
Anastasia Funkner ◽  
Sergey Kovalchuk

Electronic medical records (EMRs) include many valuable data about patients, which is, however, unstructured. Therefore, there is a lack of both labeled medical text data in Russian and tools for automatic annotation. As a result, today, it is hardly feasible for researchers to utilize text data of EMRs in training machine learning models in the biomedical domain. We present an unsupervised approach to medical data annotation. Syntactic trees are produced from initial sentences using morphological and syntactical analyses. In retrieved trees, similar subtrees are grouped using Node2Vec and Word2Vec and labeled using domain vocabularies and Wikidata categories. The usage of Wikidata categories increased the fraction of labeled sentences 5.5 times compared to labeling with domain vocabularies only. We show on a validation dataset that the proposed labeling method generates meaningful labels correctly for 92.7% of groups. Annotation with domain vocabularies and Wikidata categories covered more than 82% of sentences of the corpus, extended with timestamp and event labels 97% of sentences got covered. The obtained method can be used to label EMRs in Russian automatically. Additionally, the proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabulary.


Author(s):  
Karen Tu ◽  
Julie Klein-Geltink ◽  
Tezeta F Mitiku ◽  
Chiriac Mihai ◽  
Joel Martin

2021 ◽  
Author(s):  
Varvara Koshman ◽  
Anastasia Funkner ◽  
Sergey Kovalchuk

Electronic Medical Records (EMR) contain a lot of valuable data about patients, which is however unstructured. There is a lack of labeled medical text data in Russian and there are no tools for automatic annotation. We present an unsupervised approach to medical data annotation. Morphological and syntactical analyses of initial sentences produce syntactic trees, from which similar subtrees are then grouped by Word2Vec and labeled using dictionaries and Wikidata categories. This method can be used to automatically label EMRs in Russian and proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabularies.


1972 ◽  
Vol 11 (03) ◽  
pp. 152-162 ◽  
Author(s):  
P. GAYNON ◽  
R. L. WONG

With the objective of providing easier access to pathology specimens, slides and kodachromes with linkage to x-ray and the remainder of the patient’s medical records, an automated natural language parsing routine, based on dictionary look-up, was written for Surgical Pathology document-pairs, each consisting of a Request for Examination (authored by clinicians) and its corresponding report (authored by pathologists). These documents were input to the system in free-text English without manual editing or coding.Two types of indices were prepared. The first was an »inverted« file, available for on-line retrieval, for display of the content of the document-pairs, frequency counts of cases or listing of cases in table format. Retrievable items are patient’s and specimen’s identification data, date of operation, name of clinician and pathologist, etc. The English content of the operative procedure, clinical findings and pathologic diagnoses can be retrieved through logical combination of key words. The second type of index was a catalog. Three catalog files — »operation«, »clinical«, and »pathology« — were prepared by alphabetization of lines formed by the rotation of phrases, headed by keywords. These keywords were automatically selected and standardized by the parsing routine and the phrases were extracted from each sentence of each input document. Over 2,500 document-pairs have been entered and are currently being utilized for purpose of medical education.


1976 ◽  
Vol 15 (01) ◽  
pp. 21-28 ◽  
Author(s):  
Carmen A. Scudiero ◽  
Ruth L. Wong

A free text data collection system has been developed at the University of Illinois utilizing single word, syntax free dictionary lookup to process data for retrieval. The source document for the system is the Surgical Pathology Request and Report form. To date 12,653 documents have been entered into the system.The free text data was used to create an IRS (Information Retrieval System) database. A program to interrogate this database has been developed to numerically coded operative procedures. A total of 16,519 procedures records were generated. One and nine tenths percent of the procedures could not be fitted into any procedures category; 6.1% could not be specifically coded, while 92% were coded into specific categories. A system of PL/1 programs has been developed to facilitate manual editing of these records, which can be performed in a reasonable length of time (1 week). This manual check reveals that these 92% were coded with precision = 0.931 and recall = 0.924. Correction of the readily correctable errors could improve these figures to precision = 0.977 and recall = 0.987. Syntax errors were relatively unimportant in the overall coding process, but did introduce significant error in some categories, such as when right-left-bilateral distinction was attempted.The coded file that has been constructed will be used as an input file to a gynecological disease/PAP smear correlation system. The outputs of this system will include retrospective information on the natural history of selected diseases and a patient log providing information to the clinician on patient follow-up.Thus a free text data collection system can be utilized to produce numerically coded files of reasonable accuracy. Further, these files can be used as a source of useful information both for the clinician and for the medical researcher.


2020 ◽  
Author(s):  
Emma Chavez ◽  
Vanessa Perez ◽  
Angélica Urrutia

BACKGROUND : Currently, hypertension is one of the diseases with greater risk of mortality in the world. Particularly in Chile, 90% of the population with this disease has idiopathic or essential hypertension. Essential hypertension is characterized by high blood pressure rates and it´s cause is unknown, which means that every patient might requires a different treatment, depending on their history and symptoms. Different data, such as history, symptoms, exams, etc., are generated for each patient suffering from the disease. This data is presented in the patient’s medical record, in no order, making it difficult to search for relevant information. Therefore, there is a need for a common, unified vocabulary of the terms that adequately represent the diseased, making searching within the domain more effective. OBJECTIVE The objective of this study is to develop a domain ontology for essential hypertension , therefore arranging the more significant data within the domain as tool for medical training or to support physicians’ decision making will be provided. METHODS The terms used for the ontology were extracted from the medical history of de-identified medical records, of patients with essential hypertension. The Snomed-CT’ collection of medical terms, and clinical guidelines to control the disease were also used. Methontology was used for the design, classes definition and their hierarchy, as well as relationships between concepts and instances. Three criteria were used to validate the ontology, which also helped to measure its quality. Tests were run with a dataset to verify that the tool was created according to the requirements. RESULTS An ontology of 310 instances classified into 37 classes was developed. From these, 4 super classes and 30 relationships were obtained. In the dataset tests, 100% correct and coherent answers were obtained for quality tests (3). CONCLUSIONS The development of this ontology provides a tool for physicians, specialists, and students, among others, that can be incorporated into clinical systems to support decision making regarding essential hypertension. Nevertheless, more instances should be incorporated into the ontology by carrying out further searched in the medical history or free text sections of the medical records of patients with this disease.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Eyal Klang ◽  
Benjamin R. Kummer ◽  
Neha S. Dangayach ◽  
Amy Zhong ◽  
M. Arash Kia ◽  
...  

AbstractEarly admission to the neurosciences intensive care unit (NSICU) is associated with improved patient outcomes. Natural language processing offers new possibilities for mining free text in electronic health record data. We sought to develop a machine learning model using both tabular and free text data to identify patients requiring NSICU admission shortly after arrival to the emergency department (ED). We conducted a single-center, retrospective cohort study of adult patients at the Mount Sinai Hospital, an academic medical center in New York City. All patients presenting to our institutional ED between January 2014 and December 2018 were included. Structured (tabular) demographic, clinical, bed movement record data, and free text data from triage notes were extracted from our institutional data warehouse. A machine learning model was trained to predict likelihood of NSICU admission at 30 min from arrival to the ED. We identified 412,858 patients presenting to the ED over the study period, of whom 1900 (0.5%) were admitted to the NSICU. The daily median number of ED presentations was 231 (IQR 200–256) and the median time from ED presentation to the decision for NSICU admission was 169 min (IQR 80–324). A model trained only with text data had an area under the receiver-operating curve (AUC) of 0.90 (95% confidence interval (CI) 0.87–0.91). A structured data-only model had an AUC of 0.92 (95% CI 0.91–0.94). A combined model trained on structured and text data had an AUC of 0.93 (95% CI 0.92–0.95). At a false positive rate of 1:100 (99% specificity), the combined model was 58% sensitive for identifying NSICU admission. A machine learning model using structured and free text data can predict NSICU admission soon after ED arrival. This may potentially improve ED and NSICU resource allocation. Further studies should validate our findings.


Sign in / Sign up

Export Citation Format

Share Document