concept recognition
Recently Published Documents


TOTAL DOCUMENTS

67
(FIVE YEARS 14)

H-INDEX

12
(FIVE YEARS 1)

2021 ◽  
Vol 22 (S1) ◽  
Author(s):  
Mayla R. Boguslav ◽  
Negacy D. Hailu ◽  
Michael Bada ◽  
William A. Baumgartner ◽  
Lawrence E. Hunter

Abstract Background Automated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models have the potential to outperform multi-class classification approaches. Methods We systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning through extensive studies of alternative methods and hyperparameter selections. We not only identify the best-performing systems and parameters across a wide variety of ontologies but also provide insights into the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Results Bidirectional encoder representations from transformers for biomedical text mining (BioBERT) for span detection along with the open-source toolkit for neural machine translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies annotated in the CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches. Conclusions Machine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT shared task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.


2021 ◽  
Vol 22 (S11) ◽  
Author(s):  
Kwangmin Kim ◽  
Doheon Lee

Abstract Background Concept recognition is a term that corresponds to the two sequential steps of named entity recognition and named entity normalization, and plays an essential role in the field of bioinformatics. However, the conventional dictionary-based methods did not sufficiently addressed the variation of the concepts in actual use in literature, resulting in the particularly degraded performances in recognition of multi-token concepts. Results In this paper, we propose a concept recognition method of multi-token biological entities using neural models combined with literature contexts. The key aspect of our method is utilizing the contextual information from the biological knowledge-bases for concept normalization, which is followed by named entity recognition procedure. The model showed improved performances over conventional methods, particularly for multi-token concepts with higher variations. Conclusions We expect that our model can be utilized for effective concept recognition and variety of natural language processing tasks on bioinformatics.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Andreea Grigoriu ◽  
Amrapali Zaveri ◽  
Gerhard Weiss ◽  
Michel Dumontier

Abstract Background The amount of available data, which can facilitate answering scientific research questions, is growing. However, the different formats of published data are expanding as well, creating a serious challenge when multiple datasets need to be integrated for answering a question. Results This paper presents a semi-automated framework that provides semantic enhancement of biomedical data, specifically gene datasets. The framework involved a concept recognition task using machine learning, in combination with the BioPortal annotator. Compared to using methods which require only the BioPortal annotator for semantic enhancement, the proposed framework achieves the highest results. Conclusions Using concept recognition combined with machine learning techniques and annotation with a biomedical ontology, the proposed framework can provide datasets to reach their full potential of providing meaningful information, which can answer scientific research questions.


Author(s):  
Ling Luo ◽  
Shankai Yan ◽  
Po-Ting Lai ◽  
Daniel Veltri ◽  
Andrew Oler ◽  
...  

Abstract Motivation Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. Results In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods. Availabilityand implementation The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Yang Liu ◽  
Yuanhe Tian ◽  
Tsung-Hui Chang ◽  
Song Wu ◽  
Xiang Wan ◽  
...  

2020 ◽  
Author(s):  
Mayla R Boguslav ◽  
Negacy D Hailu ◽  
Michael Bada ◽  
William A Baumgartner ◽  
Lawrence E Hunter

AbstractBackgroundAutomated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models had the potential to outperform multi-class classification approaches. Here we systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning.ResultsWe report on our extensive studies of alternative methods and hyperparameter selections. The results not only identify the best-performing systems and parameters across a wide variety of ontologies but also illuminate about the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) for span detection (as previously found) along with the Open-source Toolkit for Neural Machine Translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies in CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches.ConclusionsMachine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT Shared Task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.


2020 ◽  
Vol 10 (3) ◽  
pp. 1127
Author(s):  
Yun Li ◽  
Yongyao Jiang ◽  
Justin C. Goldstein ◽  
Lewis J. Mcgibbney ◽  
Chaowei Yang

One longstanding complication with Earth data discovery involves understanding a user’s search intent from the input query. Most of the geospatial data portals use keyword-based match to search data. Little attention has focused on the spatial and temporal information from a query or understanding the query with ontology. No research in the geospatial domain has investigated user queries in a systematic way. Here, we propose a query understanding framework and apply it to fill the gap by better interpreting a user’s search intent for Earth data search engines and adopting knowledge that was mined from metadata and user query logs. The proposed query understanding tool contains four components: spatial and temporal parsing; concept recognition; Named Entity Recognition (NER); and, semantic query expansion. Spatial and temporal parsing detects the spatial bounding box and temporal range from a query. Concept recognition isolates clauses from free text and provides the search engine phrases instead of a list of words. Name entity recognition detects entities from the query, which inform the search engine to query the entities detected. The semantic query expansion module expands the original query by adding synonyms and acronyms to phrases in the query that was discovered from Web usage data and metadata. The four modules interact to parse a user’s query from multiple perspectives, with the goal of understanding the consumer’s quest intent for data. As a proof-of-concept, the framework is applied to oceanographic data discovery. It is demonstrated that the proposed framework accurately captures a user’s intent.


Sign in / Sign up

Export Citation Format

Share Document