scholarly journals Transfer Learning for Named Entity Recognition in Financial and Biomedical Documents

Information ◽  
2019 ◽  
Vol 10 (8) ◽  
pp. 248 ◽  
Author(s):  
Sumam Francis ◽  
Jordy Van Landeghem ◽  
Marie-Francine Moens

Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is scarcely present. In this paper we consider two use cases: generic entity extraction from financial and from biomedical documents. First, we have developed a character based model for NER in financial documents and a word and character based model with attention for NER in biomedical documents. Further, we have analyzed how transfer learning addresses the problem of limited training data in a target domain. We demonstrate through experiments that NER models trained on labeled data from a source domain can be used as base models and then be fine-tuned with few labeled data for recognition of different named entity classes in a target domain. We also witness an interest in language models to improve NER as a way of coping with limited labeled data. The current most successful language model is BERT. Because of its success in state-of-the-art models we integrate representations based on BERT in our biomedical NER model along with word and character information. The results are compared with a state-of-the-art model applied on a benchmarking biomedical corpus.

2021 ◽  
pp. 1-12
Author(s):  
Yingwen Fu ◽  
Nankai Lin ◽  
Xiaotian Lin ◽  
Shengyi Jiang

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.


2021 ◽  
Vol 54 (1) ◽  
pp. 1-39
Author(s):  
Zara Nasar ◽  
Syed Waqar Jaffry ◽  
Muhammad Kamran Malik

With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.


2020 ◽  
Author(s):  
Usman Naseem ◽  
Matloob Khushi ◽  
Vinay Reddy ◽  
Sakthivel Rajendran ◽  
Imran Razzak ◽  
...  

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.


2021 ◽  
Vol 9 ◽  
pp. 1116-1131
Author(s):  
David Ifeoluwa Adelani ◽  
Jade Abbott ◽  
Graham Neubig ◽  
Daniel D’souza ◽  
Julia Kreutzer ◽  
...  

Abstract We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0246310
Author(s):  
Shang Gao ◽  
Olivera Kotevska ◽  
Alexandre Sorokine ◽  
J. Blair Christian

Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.


Author(s):  
Ismail El Bazi ◽  
Nabil Laachfoubi

Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.


2021 ◽  
Author(s):  
Christoph Brandl ◽  
Jens Albrecht ◽  
Renato Budinich

The task of relation extraction aims at classifying the semantic relations between entities in a text. When coupled with named-entity recognition these can be used as the building blocks for an information extraction procedure that results in the construction of a Knowledge Graph. While many NLP libraries support named-entity recognition, there is no off-the-shelf solution for relation extraction. In this paper, we evaluate and compare several state-of-the-art approaches on a subset of the FewRel data set as well as a manually annotated corpus. The custom corpus contains six relations from the area of market research and is available for public use. Our approach provides guidance for the selection of models and training data for relation extraction in realworld projects.


2021 ◽  
Vol 22 (S1) ◽  
Author(s):  
Cong Sun ◽  
Zhihao Yang ◽  
Lei Wang ◽  
Yin Zhang ◽  
Hongfei Lin ◽  
...  

Abstract Background The recognition of pharmacological substances, compounds and proteins is essential for biomedical relation extraction, knowledge graph construction, drug discovery, as well as medical question answering. Although considerable efforts have been made to recognize biomedical entities in English texts, to date, only few limited attempts were made to recognize them from biomedical texts in other languages. PharmaCoNER is a named entity recognition challenge to recognize pharmacological entities from Spanish texts. Because there are currently abundant resources in the field of natural language processing, how to leverage these resources to the PharmaCoNER challenge is a meaningful study. Methods Inspired by the success of deep learning with language models, we compare and explore various representative BERT models to promote the development of the PharmaCoNER task. Results The experimental results show that deep learning with language models can effectively improve model performance on the PharmaCoNER dataset. Our method achieves state-of-the-art performance on the PharmaCoNER dataset, with a max F1-score of 92.01%. Conclusion For the BERT models on the PharmaCoNER dataset, biomedical domain knowledge has a greater impact on model performance than the native language (i.e., Spanish). The BERT models can obtain competitive performance by using WordPiece to alleviate the out of vocabulary limitation. The performance on the BERT model can be further improved by constructing a specific vocabulary based on domain knowledge. Moreover, the character case also has a certain impact on model performance.


2020 ◽  
Vol 2020 ◽  
pp. 1-13 ◽  
Author(s):  
Hao Wei ◽  
Mingyuan Gao ◽  
Ai Zhou ◽  
Fei Chen ◽  
Wen Qu ◽  
...  

As the biomedical literature increases exponentially, biomedical named entity recognition (BNER) has become an important task in biomedical information extraction. In the previous studies based on deep learning, pretrained word embedding becomes an indispensable part of the neural network models, effectively improving their performance. However, the biomedical literature typically contains numerous polysemous and ambiguous words. Using fixed pretrained word representations is not appropriate. Therefore, this paper adopts the pretrained embeddings from language models (ELMo) to generate dynamic word embeddings according to context. In addition, in order to avoid the problem of insufficient training data in specific fields and introduce richer input representations, we propose a multitask learning multichannel bidirectional gated recurrent unit (BiGRU) model. Multiple feature representations (e.g., word-level, contextualized word-level, character-level) are, respectively, or collectively fed into the different channels. Manual participation and feature engineering can be avoided through automatic capturing features in BiGRU. In merge layer, multiple methods are designed to integrate the outputs of multichannel BiGRU. We combine BiGRU with the conditional random field (CRF) to address labels’ dependence in sequence labeling. Moreover, we introduce the auxiliary corpora with same entity types for the main corpora to be evaluated in multitask learning framework, then train our model on these separate corpora and share parameters with each other. Our model obtains promising results on the JNLPBA and NCBI-disease corpora, with F1-scores of 76.0% and 88.7%, respectively. The latter achieves the best performance among reported existing feature-based models.


PLoS ONE ◽  
2019 ◽  
Vol 14 (5) ◽  
pp. e0216046 ◽  
Author(s):  
Xishuang Dong ◽  
Shanta Chowdhury ◽  
Lijun Qian ◽  
Xiangfang Li ◽  
Yi Guan ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document