scholarly journals Low-Resource Named Entity Recognition via the Pre-Training Model

Symmetry ◽  
2021 ◽  
Vol 13 (5) ◽  
pp. 786
Author(s):  
Siqi Chen ◽  
Yijie Pei ◽  
Zunwang Ke ◽  
Wushour Silamu

Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.

2021 ◽  
Vol 11 (13) ◽  
pp. 6007
Author(s):  
Muzamil Hussain Syed ◽  
Sun-Tae Chung

Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task.


2021 ◽  
pp. 1-12
Author(s):  
Yingwen Fu ◽  
Nankai Lin ◽  
Xiaotian Lin ◽  
Shengyi Jiang

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.


Author(s):  
Xiaocheng Feng ◽  
Xiachong Feng ◽  
Bing Qin ◽  
Zhangyin Feng ◽  
Ting Liu

Neural networks have been widely used for high resource language (e.g. English) named entity recognition (NER) and have shown state-of-the-art results.However, for low resource languages, such as Dutch, Spanish, due to the limitation of resources and lack of annotated data, taggers tend to have lower performances.To narrow this gap, we propose three novel strategies to enrich the semantic representations of low resource languages: we first develop neural networks to improve low resource word representations by knowledge transfer from high resource language using bilingual lexicons. Further, a lexicon extension strategy is designed to address out-of lexicon problem by automatically learning semantic projections.Thirdly, we regard word-level entity type distribution features as an external language-independent knowledge and incorporate them into our neural architecture. Experiments on two low resource languages (including Dutch and Spanish) demonstrate the effectiveness of these additional semantic representations (average 4.8\% improvement). Moreover, on Chinese OntoNotes 4.0 dataset, our approach achieved an F-score of 83.07\% with 2.91\% absolute gain compared to the state-of-the-art results.


Author(s):  
Aldo Hernandez-Suarez ◽  
Gabriel Sanchez-Perez ◽  
Karina Toscano-Medina ◽  
Hector Perez-Meana ◽  
Jose Portillo-Portillo ◽  
...  

In recent years, online social networks have received important consideration in spatial modelling fields given the critical information that can be extracted from them for events in real time; one of the most latent issues is that regarding various natural disasters such as earthquakes. Although it is possible to retrieve data from these social networks with embedded geographic information provided by GPS, in many cases this is not possible. An alternative solution is to reconstruct specific locations using probabilistic language models, more specifically those based on Name Entity Recognition (NER), which extracts names from a user’s description about an event occurring in a specific place (e.g., a collapsed building on a specific avenue). In this work, we present a methodology to use twitter as a social sensor system for disasters. The methodology scores NER locations with a kernel density estimation function for different subtopics originating from a natural disaster and that maps them into a geographic space is proposed. The proposed methodology is evaluated with tweets related to the 2017 earthquake in Mexico.


2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Yu Wang ◽  
Yining Sun ◽  
Zuchang Ma ◽  
Lisheng Gao ◽  
Yang Xu

The medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. Named Entity Recognition (NER) is the initial step in extracting this knowledge from unstructured text and presenting it as a Knowledge Graph (KG). However, the previous approaches of NER have often suffered from small-scale human-labelled training data. Furthermore, extracting knowledge from Chinese medical literature is a more complex task because there is no segmentation between Chinese characters. Recently, the pretraining models, which obtain representations with the prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of Natural Language Processing (NLP) tasks. However, the capabilities of pretraining models have not been fully exploited, and applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest. In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. First, we propose a method of data augmentation by replacing the words in the training set with synonyms through the Mask Language Model (MLM), which is a pretraining task. Then, we consider NER as the downstream task of the pretraining model and transfer the prior semantic knowledge obtained during pretraining to it. Finally, we conduct experiments to compare the performances of six pretraining models (BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIE-tiny, and RoBERTa) in recognizing named entities from Chinese medical literature. The effects of feature extraction and fine-tuning, as well as different downstream model structures, are also explored. Experimental results demonstrate that the method of data augmentation we proposed can obtain meaningful improvements in the performance of recognition. Besides, RoBERTa-CRF achieves the highest F1-score compared with the previous methods and other pretraining models.


2021 ◽  
Vol 189 ◽  
pp. 292-299
Author(s):  
Caroline Sabty ◽  
Islam Omar ◽  
Fady Wasfalla ◽  
Mohamed Islam ◽  
Slim Abdennadher

Sign in / Sign up

Export Citation Format

Share Document