Comparing general and specialized word embeddings for biomedical named entity recognition

PeerJ Computer Science ◽

10.7717/peerj-cs.384 ◽

2021 ◽

Vol 7 ◽

pp. e384

Author(s):

Rigo E. Ramos-Vargas ◽

Israel Román-Godínez ◽

Sulema Torres-Ramos

Keyword(s):

Named Entity Recognition ◽

Biomedical Literature ◽

Word Embedding ◽

The Other ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Word Representation ◽

The One ◽

Biomedical Named Entity Recognition

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.

Download Full-text

Combining word embeddings to extract chemical and drug entities in biomedical literature

BMC Bioinformatics ◽

10.1186/s12859-021-04188-3 ◽

2021 ◽

Vol 22 (S1) ◽

Author(s):

Pilar López-Úbeda ◽

Manuel Carlos Díaz-Galiano ◽

L. Alfonso Ureña-López ◽

M. Teresa Martín-Valdivia

Keyword(s):

Language Processing ◽

Named Entity Recognition ◽

Biomedical Literature ◽

Entity Recognition ◽

Levenshtein Distance ◽

Word Embeddings ◽

Snomed Ct ◽

Named Entity ◽

Indexing System ◽

The One

Abstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. Results For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. Conclusion On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.

Download Full-text

Biomedical Named-Entity Recognition by Hierarchically Fusing BioBERT Representations and Deep Contextual-Level Word-Embedding

2020 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn48605.2020.9206808 ◽

2020 ◽

Cited By ~ 1

Author(s):

Usman Naseem ◽

Katarzyna Musial ◽

Peter Eklund ◽

Mukesh Prasad

Keyword(s):

Named Entity Recognition ◽

Word Embedding ◽

Entity Recognition ◽

Named Entity ◽

Biomedical Named Entity Recognition

Download Full-text

Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks

BioMed Research International ◽

10.1155/2014/240403 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6 ◽

Cited By ~ 49

Author(s):

Buzhou Tang ◽

Hongxin Cao ◽

Xiaolong Wang ◽

Qingcai Chen ◽

Hua Xu

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biomedical Domain ◽

Crucial Step ◽

Named Entity ◽

Different Types ◽

Word Representation ◽

Biomedical Named Entity Recognition

Biomedical Named Entity Recognition (BNER), which extracts important entities such as genes and proteins, is a crucial step of natural language processing in the biomedical domain. Various machine learning-based approaches have been applied to BNER tasks and showed good performance. In this paper, we systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings. We selected one algorithm from each of the three types of WR features and applied them to the JNLPBA and BioCreAtIvE II BNER tasks. Our results showed that all the three WR algorithms were beneficial to machine learning-based BNER systems. Moreover, combining these different types of WR features further improved BNER performance, indicating that they are complementary to each other. By combining all the three types of WR features, the improvements inF-measure on the BioCreAtIvE II GM and JNLPBA corpora were 3.75% and 1.39%, respectively, when compared with the systems using baseline features. To the best of our knowledge, this is the first study to systematically evaluate the effect of three different types of WR features for BNER tasks.

Download Full-text

LM-Based Word Embeddings Improve Biomedical Named Entity Recognition: A Detailed Analysis

Bioinformatics and Biomedical Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-030-45385-5_56 ◽

2020 ◽

pp. 624-635 ◽

Cited By ~ 2

Author(s):

Liliya Akhtyamova ◽

John Cardiff

Keyword(s):

Detailed Analysis ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Biomedical Named Entity Recognition

Download Full-text

Deep learning with word embeddings improves biomedical named entity recognition

Bioinformatics ◽

10.1093/bioinformatics/btx228 ◽

2017 ◽

Vol 33 (14) ◽

pp. i37-i48 ◽

Cited By ~ 155

Author(s):

Maryam Habibi ◽

Leon Weber ◽

Mariana Neves ◽

David Luis Wiegandt ◽

Ulf Leser

Keyword(s):

Deep Learning ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Biomedical Named Entity Recognition

Download Full-text

A Multichannel Biomedical Named Entity Recognition Model Based on Multitask Learning and Contextualized Word Representations

Wireless Communications and Mobile Computing ◽

10.1155/2020/8894760 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Hao Wei ◽

Mingyuan Gao ◽

Ai Zhou ◽

Fei Chen ◽

Wen Qu ◽

...

Keyword(s):

Conditional Random Field ◽

Named Entity Recognition ◽

Multitask Learning ◽

Biomedical Literature ◽

Training Data ◽

Entity Recognition ◽

Language Models ◽

Named Entity ◽

Word Level ◽

Biomedical Named Entity Recognition

As the biomedical literature increases exponentially, biomedical named entity recognition (BNER) has become an important task in biomedical information extraction. In the previous studies based on deep learning, pretrained word embedding becomes an indispensable part of the neural network models, effectively improving their performance. However, the biomedical literature typically contains numerous polysemous and ambiguous words. Using fixed pretrained word representations is not appropriate. Therefore, this paper adopts the pretrained embeddings from language models (ELMo) to generate dynamic word embeddings according to context. In addition, in order to avoid the problem of insufficient training data in specific fields and introduce richer input representations, we propose a multitask learning multichannel bidirectional gated recurrent unit (BiGRU) model. Multiple feature representations (e.g., word-level, contextualized word-level, character-level) are, respectively, or collectively fed into the different channels. Manual participation and feature engineering can be avoided through automatic capturing features in BiGRU. In merge layer, multiple methods are designed to integrate the outputs of multichannel BiGRU. We combine BiGRU with the conditional random field (CRF) to address labels’ dependence in sequence labeling. Moreover, we introduce the auxiliary corpora with same entity types for the main corpora to be evaluated in multitask learning framework, then train our model on these separate corpora and share parameters with each other. Our model obtains promising results on the JNLPBA and NCBI-disease corpora, with F1-scores of 76.0% and 88.7%, respectively. The latter achieves the best performance among reported existing feature-based models.

Download Full-text

Biomedical named entity recognition and linking datasets: survey and our recent development

Briefings in Bioinformatics ◽

10.1093/bib/bbaa054 ◽

2020 ◽

Vol 21 (6) ◽

pp. 2219-2238 ◽

Cited By ~ 2

Author(s):

Ming-Siang Huang ◽

Po-Ting Lai ◽

Pei-Yen Lin ◽

Yu-Ting You ◽

Richard Tzong-Han Tsai ◽

...

Keyword(s):

Language Processing ◽

Protein Interaction ◽

Named Entity Recognition ◽

Paper Analysis ◽

Biomedical Literature ◽

Entity Recognition ◽

Supplementary Information ◽

Protein Protein Interaction ◽

Named Entity ◽

Biomedical Named Entity Recognition

Abstract Natural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein–protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein–protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: [email protected], Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: [email protected], Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Deep learning with word embeddings improves biomedical named entity recognition.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.730927099.793537967 ◽

2017 ◽

Author(s):

Nigel Collier

Keyword(s):

Deep Learning ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Biomedical Named Entity Recognition

Download Full-text

Biomedical Named Entity Recognition with Tri-Training Learning

2009 2nd International Conference on Biomedical Engineering and Informatics ◽

10.1109/bmei.2009.5304799 ◽

2009 ◽

Cited By ~ 3

Author(s):

YueHong Cai ◽

XianYi Cheng

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Biomedical Named Entity Recognition

Download Full-text

Recurrent Neural Network-Based Model for Named Entity Recognition with Improved Word Embeddings

IETE Journal of Research ◽

10.1080/03772063.2021.2006805 ◽

2021 ◽

pp. 1-7

Author(s):

Archana Goyal ◽

Vishal Gupta ◽

Manish Kumar

Keyword(s):

Neural Network ◽

Recurrent Neural Network ◽

Named Entity Recognition ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity

Download Full-text