scholarly journals Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets

Author(s):  
Denis Newman-Griffis ◽  
Guy Divita ◽  
Bart Desmet ◽  
Ayah Zirikly ◽  
Carolyn P Rosé ◽  
...  

Abstract Objectives Normalizing mentions of medical concepts to standardized vocabularies is a fundamental component of clinical text analysis. Ambiguity—words or phrases that may refer to different concepts—has been extensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for advancing medical concept normalization research. Materials and Methods We identified ambiguous strings in datasets derived from the 2 available clinical corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language System (UMLS) to assess how representative available datasets are of ambiguity in clinical language. Results We found that <15% of strings were ambiguous within the datasets, while over 50% were ambiguous in the UMLS, indicating only partial coverage of clinical ambiguity. The percentage of strings in common between any pair of datasets ranged from 2% to only 36%; of these, 40% were annotated with different sets of concepts, severely limiting generalization. Finally, we observed 12 distinct types of ambiguity, distributed unequally across the available datasets, reflecting diverse linguistic and medical phenomena. Discussion Existing datasets are not sufficient to cover the diversity of clinical concept ambiguity, limiting both training and evaluation of normalization methods for clinical text. Additionally, the UMLS offers important semantic information for building and evaluating normalization methods. Conclusions Our findings identify 3 opportunities for concept normalization research, including a need for ambiguity-specific clinical datasets and leveraging the rich semantics of the UMLS in new methods and evaluation measures for normalization.

2017 ◽  
Vol 24 (4) ◽  
pp. 841-844 ◽  
Author(s):  
Dina Demner-Fushman ◽  
Willie J Rogers ◽  
Alan R Aronson

Abstract MetaMap is a widely used named entity recognition tool that identifies concepts from the Unified Medical Language System Metathesaurus in text. This study presents MetaMap Lite, an implementation of some of the basic MetaMap functions in Java. On several collections of biomedical literature and clinical text, MetaMap Lite demonstrated real-time speed and precision, recall, and F1 scores comparable to or exceeding those of MetaMap and other popular biomedical text processing tools, clinical Text Analysis and Knowledge Extraction System (cTAKES) and DNorm.


2017 ◽  
Vol 2017 ◽  
pp. 1-10 ◽  
Author(s):  
Jung-wei Fan ◽  
Jianrong Li ◽  
Yves A. Lussier

Exposome is a critical dimension in the precision medicine paradigm. Effective representation of exposomics knowledge is instrumental to melding nongenetic factors into data analytics for clinical research. There is still limited work in (1) modeling exposome entities and relations with proper integration to mainstream ontologies and (2) systematically studying their presence in clinical context. Through selected ontological relations, we developed a template-driven approach to identifying exposome concepts from the Unified Medical Language System (UMLS). The derived concepts were evaluated in terms of literature coverage and the ability to assist in annotating clinical text. The generated semantic model represents rich domain knowledge about exposure events (454 pairs of relations between exposure and outcome). Additionally, a list of 5667 disorder concepts with microbial etiology was created for inferred pathogen exposures. The model consistently covered about 90% of PubMed literature on exposure-induced iatrogenic diseases over 10 years (2001–2010). The model contributed to the efficiency of exposome annotation in clinical text by filtering out 78% of irrelevant machine annotations. Analysis into 50 annotated discharge summaries helped advance our understanding of the exposome information in clinical text. This pilot study demonstrated feasibility of semiautomatically developing a useful semantic resource for exposomics.


Author(s):  
Jun Xu ◽  
Zhiheng Li ◽  
Qiang Wei ◽  
Yonghui Wu ◽  
Yang Xiang ◽  
...  

Abstract Background To detect attributes of medical concepts in clinical text, a traditional method often consists of two steps: named entity recognition of attributes and then relation classification between medical concepts and attributes. Here we present a novel solution, in which attribute detection of given concepts is converted into a sequence labeling problem, thus attribute entity recognition and relation classification are done simultaneously within one step. Methods A neural architecture combining bidirectional Long Short-Term Memory networks and Conditional Random fields (Bi-LSTMs-CRF) was adopted to detect various medical concept-attribute pairs in an efficient way. We then compared our deep learning-based sequence labeling approach with traditional two-step systems for three different attribute detection tasks: disease-modifier, medication-signature, and lab test-value. Results Our results show that the proposed method achieved higher accuracy than the traditional methods for all three medical concept-attribute detection tasks. Conclusions This study demonstrates the efficacy of our sequence labeling approach using Bi-LSTM-CRFs on the attribute detection task, indicating its potential to speed up practical clinical NLP applications.


2019 ◽  
Author(s):  
Christian Holz ◽  
Torsten Kessler ◽  
Martin Dugas ◽  
Julian Varghese

BACKGROUND For cancer domains such as acute myeloid leukemia (AML), a large set of data elements is obtained from different institutions with heterogeneous data definitions within one patient course. The lack of clinical data harmonization impedes cross-institutional electronic data exchange and future meta-analyses. OBJECTIVE This study aimed to identify and harmonize a semantic core of common data elements (CDEs) in clinical routine and research documentation, based on a systematic metadata analysis of existing documentation models. METHODS Lists of relevant data items were collected and reviewed by hematologists from two university hospitals regarding routine documentation and several case report forms of clinical trials for AML. In addition, existing registries and international recommendations were included. Data items were coded to medical concepts via the Unified Medical Language System (UMLS) by a physician and reviewed by another physician. On the basis of the coded concepts, the data sources were analyzed for concept overlaps and identification of most frequent concepts. The most frequent concepts were then implemented as data elements in the standardized format of the Operational Data Model by the Clinical Data Interchange Standards Consortium. RESULTS A total of 3265 medical concepts were identified, of which 1414 were unique. Among the 1414 unique medical concepts, the 50 most frequent ones cover 26.98% of all concept occurrences within the collected AML documentation. The top 100 concepts represent 39.48% of all concepts’ occurrences. Implementation of CDEs is available on a European research infrastructure and can be downloaded in different formats for reuse in different electronic data capture systems. CONCLUSIONS Information management is a complex process for research-intense disease entities as AML that is associated with a large set of lab-based diagnostics and different treatment options. Our systematic UMLS-based analysis revealed the existence of a core data set and an exemplary reusable implementation for harmonized data capture is available on an established metadata repository.


2019 ◽  
Vol 26 (2) ◽  
pp. 1443-1454 ◽  
Author(s):  
Hamid Naderi ◽  
Sina Madani ◽  
Behzad Kiani ◽  
Kobra Etminani

The ability to automatically categorize submitted questions based on topics and suggest similar question and answer to the users reduces the number of redundant questions. Our objective was to compare intra-topic and inter-topic similarity between question and answers by using concept-based similarity computing analysis. We gathered existing question and answers from several popular online health communities. Then, Unified Medical Language System concepts related to selected questions and experts in different topics were extracted and weighted by term frequency -inverse document frequency values. Finally, the similarity between weighted vectors of Unified Medical Language System concepts was computed. Our result showed a considerable gap between intra-topic and inter-topic similarities in such a way that the average of intra-topic similarity (0.095, 0.192, and 0.110, respectively) was higher than the average of inter-topic similarity (0.012, 0.025, and 0.018, respectively) for questions of the top 3 popular online communities including NetWellness, WebMD, and Yahoo Answers. Similarity scores between the content of questions answered by experts in the same and different topics were calculated as 0.51 and 0.11, respectively. Concept-based similarity computing methods can be used in developing intelligent question and answering retrieval systems that contain auto recommendation functionality for similar questions and experts.


Author(s):  
Elmer V. Bernstam ◽  
Jorge R. Herskovic ◽  
William R. Hersh

Clinicians, researchers and members of the general public are increasingly using information technology to cope with the explosion in biomedical knowledge. This chapter describes the purpose of query log analysis in the biomedical domain as well as features of the biomedical domain such as controlled vocabularies (ontologies) and existing infrastructure useful for query log analysis. We focus specifically on MEDLINE, which is the most comprehensive bibliographic database of the world’s biomedical literature, the PubMed interface to MEDLINE, the Medical Subject Headings vocabulary and the Unified Medical Language System. However, the approaches discussed here can also be applied to other query logs. We conclude with a look toward the future of biomedical query log analysis.


2020 ◽  
Vol 27 (10) ◽  
pp. 1547-1555 ◽  
Author(s):  
Jake Vasilakes ◽  
Anusha Bompelli ◽  
Jeffrey R Bishop ◽  
Terrence J Adam ◽  
Olivier Bodenreider ◽  
...  

Abstract Objective We sought to assess the need for additional coverage of dietary supplements (DS) in the Unified Medical Language System (UMLS) by investigating (1) the overlap between the integrated DIetary Supplements Knowledge base (iDISK) DS ingredient terminology and the UMLS and (2) the coverage of iDISK and the UMLS over DS mentions in the biomedical literature. Materials and Methods We estimated the overlap between iDISK and the UMLS by mapping iDISK to the UMLS using exact and normalized strings. The coverage of iDISK and the UMLS over DS mentions in the biomedical literature was evaluated via a DS named-entity recognition (NER) task within PubMed abstracts. Results The coverage analysis revealed that only 30% of iDISK terms can be matched to the UMLS, although these cover over 99% of iDISK concepts. A manual review revealed that a majority of the unmatched terms represented new synonyms, rather than lexical variants. For NER, iDISK nearly doubles the precision and achieves a higher F1 score than the UMLS, while maintaining a competitive recall. Discussion While iDISK has significant concept overlap with the UMLS, it contains many novel synonyms. Furthermore, almost 3000 of these overlapping UMLS concepts are missing a DS designation, which could be provided by iDISK. The NER experiments show that the specialization of iDISK is useful for identifying DS mentions. Conclusions Our results show that the DS representation in the UMLS could be enriched by adding DS designations to many concepts and by adding new synonyms.


2001 ◽  
Vol 40 (04) ◽  
pp. 298-306 ◽  
Author(s):  
J. J. Cimino

Summary Objectives: As controlled medical terminologies evolve from simple code-name-hierarchy arrangements, into rich, knowledge-based ontologies of medical concepts, increased demands are placed on both the developers and users of the terminologies. In response, researchers have begun developing tools to address their needs. The aims of this article are to review previous work done to develop these tools and then to describe work done at Columbia University and New York Presbyterian Hospital (NYPH). Methods: Researchers working with the Systematized Nomenclature of Medicine (SNOMED), the Unified Medical Language System (UMLS), and NYPH’s Medical Entities Dictionary (MED) have created a wide variety of terminology browsers, editors and servers to facilitate creation, maintenance and use of these terminologies. Results: Although much work has been done, no generally available tools have yet emerged. Consensus on requirement for tool functions, especially terminology servers is emerging. Tools at NYPH have been used successfully to support the integration of clinical applications and the merger of health care institutions. Conclusions: Significant advancement has occurred over the past fifteen years in the development of sophisticated controlled terminologies and the tools to support them. The tool set at NYPH provides a case study to demonstrate one feasible architecture.


1991 ◽  
Vol 11 (4_suppl) ◽  
pp. S120-S124 ◽  
Author(s):  
William R. Hersh

SAPHIRE is a concept-based approach to information retrieval in the biomedical domain. Indexing and retrieval are based on a concept-matching algorithm that processes free text to identify concepts and map them to their canonical form. This process requires a large vocabulary containing a breadth of medical concepts and a diversity of synonym forms, which is provided by the Meta-1 vocabulary from the Unified Medical Language System Project of the National Library of Medicine. This paper describes the use of Meta-1 in SAPHIRE and an evaluation of both entities in the context of an information retrieval study.


Sign in / Sign up

Export Citation Format

Share Document