scholarly journals Leveraging Document-Level Label Consistency for Named Entity Recognition

Author(s):  
Tao Gui ◽  
Jiacheng Ye ◽  
Qi Zhang ◽  
Yaqian Zhou ◽  
Yeyun Gong ◽  
...  

Document-level label consistency is an effective indicator that different occurrences of a particular token sequence are very likely to have the same entity types. Previous work focused on better context representations and used the CRF for label decoding. However, CRF-based methods are inadequate for modeling document-level label consistency. This work introduces a novel two-stage label refinement approach to handle document-level label consistency, where a key-value memory network is first used to record draft labels predicted by the base model, and then a multi-channel Transformer makes refinements on these draft predictions based on the explicit co-occurrence relationship derived from the memory network. In addition, in order to mitigate the side effects of incorrect draft labels, Bayesian neural networks are used to indicate the labels with a high probability of being wrong, which can greatly assist in preventing the incorrect refinement of correct draft labels. The experimental results on three named entity recognition benchmarks demonstrated that the proposed method significantly outperformed the state-of-the-art methods.

2020 ◽  
Vol 34 (05) ◽  
pp. 7961-7968
Author(s):  
Anwen Hu ◽  
Zhicheng Dou ◽  
Jian-Yun Nie ◽  
Ji-Rong Wen

Most state-of-the-art named entity recognition systems are designed to process each sentence within a document independently. These systems are easy to confuse entity types when the context information in a sentence is not sufficient enough. To utilize the context information within the whole document, most document-level work let neural networks on their own to learn the relation across sentences, which is not intuitive enough for us humans. In this paper, we divide entities to multi-token entities that contain multiple tokens and single-token entities that are composed of a single token. We propose that the context information of multi-token entities should be more reliable in document-level NER for news articles. We design a fusion attention mechanism which not only learns the semantic relevance between occurrences of the same token, but also focuses more on occurrences belonging to multi-tokens entities. To identify multi-token entities, we design an auxiliary task namely ‘Multi-token Entity Classification’ and perform this task simultaneously with document-level NER. This auxiliary task is simplified from NER and doesn't require extra annotation. Experimental results on the CoNLL-2003 dataset and OntoNotesnbm dataset show that our model outperforms state-of-the-art sentence-level and document-level NER methods.


2020 ◽  
Vol 34 (05) ◽  
pp. 8441-8448
Author(s):  
Ying Luo ◽  
Fengshun Xiao ◽  
Hai Zhao

Named entity recognition (NER) models are typically based on the architecture of Bi-directional LSTM (BiLSTM). The constraints of sequential nature and the modeling of single input prevent the full utilization of global information from larger scope, not only in the entire sentence, but also in the entire document (dataset). In this paper, we address these two deficiencies and propose a model augmented with hierarchical contextualized representation: sentence-level representation and document-level representation. In sentence-level, we take different contributions of words in a single sentence into consideration to enhance the sentence representation learned from an independent BiLSTM via label embedding attention mechanism. In document-level, the key-value memory network is adopted to record the document-aware information for each unique word which is sensitive to similarity of context information. Our two-level hierarchical contextualized representations are fused with each input token embedding and corresponding hidden state of BiLSTM, respectively. The experimental results on three benchmark NER datasets (CoNLL-2003 and Ontonotes 5.0 English datasets, CoNLL-2002 Spanish dataset) show that we establish new state-of-the-art results.


2020 ◽  
Vol 36 (15) ◽  
pp. 4331-4338
Author(s):  
Mei Zuo ◽  
Yang Zhang

Abstract Motivation Named entity recognition is a critical and fundamental task for biomedical text mining. Recently, researchers have focused on exploiting deep neural networks for biomedical named entity recognition (Bio-NER). The performance of deep neural networks on a single dataset mostly depends on data quality and quantity while high-quality data tends to be limited in size. To alleviate task-specific data limitation, some studies explored the multi-task learning (MTL) for Bio-NER and achieved state-of-the-art performance. However, these MTL methods did not make full use of information from various datasets of Bio-NER. The performance of state-of-the-art MTL method was significantly limited by the number of training datasets. Results We propose two dataset-aware MTL approaches for Bio-NER which jointly train all models for numerous Bio-NER datasets, thus each of these models could discriminatively exploit information from all of related training datasets. Both of our two approaches achieve substantially better performance compared with the state-of-the-art MTL method on 14 out of 15 Bio-NER datasets. Furthermore, we implemented our approaches by incorporating Bio-NER and biomedical part-of-speech (POS) tagging datasets. The results verify Bio-NER and POS can significantly enhance one another. Availability and implementation Our source code is available at https://github.com/zmmzGitHub/MTL-BC-LBC-BioNER and all datasets are publicly available at https://github.com/cambridgeltl/MTL-Bioinformatics-2016. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Xiaocheng Feng ◽  
Xiachong Feng ◽  
Bing Qin ◽  
Zhangyin Feng ◽  
Ting Liu

Neural networks have been widely used for high resource language (e.g. English) named entity recognition (NER) and have shown state-of-the-art results.However, for low resource languages, such as Dutch, Spanish, due to the limitation of resources and lack of annotated data, taggers tend to have lower performances.To narrow this gap, we propose three novel strategies to enrich the semantic representations of low resource languages: we first develop neural networks to improve low resource word representations by knowledge transfer from high resource language using bilingual lexicons. Further, a lexicon extension strategy is designed to address out-of lexicon problem by automatically learning semantic projections.Thirdly, we regard word-level entity type distribution features as an external language-independent knowledge and incorporate them into our neural architecture. Experiments on two low resource languages (including Dutch and Spanish) demonstrate the effectiveness of these additional semantic representations (average 4.8\% improvement). Moreover, on Chinese OntoNotes 4.0 dataset, our approach achieved an F-score of 83.07\% with 2.91\% absolute gain compared to the state-of-the-art results.


2021 ◽  
pp. 1-12
Author(s):  
Yingwen Fu ◽  
Nankai Lin ◽  
Xiaotian Lin ◽  
Shengyi Jiang

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.


2021 ◽  
Vol 54 (1) ◽  
pp. 1-39
Author(s):  
Zara Nasar ◽  
Syed Waqar Jaffry ◽  
Muhammad Kamran Malik

With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.


Author(s):  
Rodrigo Agerri ◽  
German Rigau

We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empiricalexperimentation how to effectively combine various types of clustering features allows us to seamlessly export our system to other datasets and languages. The result is a simple but highly competitive system which obtains state of the art results across five languages and twelve datasets. The results are reported on standard shared task evaluation data such as CoNLL for English, Spanish and Dutch. Furthermore, and despite the lack of linguistically motivated features, we also report best results for languages such as Basque and German. In addition, we demonstrate that our method also obtains very competitive results even when the amount of supervised data is cut by half, alleviating the dependency on manually annotated data. Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models. The system and models are freely available to facilitate its use and guarantee the reproducibility of results.


2020 ◽  
Author(s):  
Usman Naseem ◽  
Matloob Khushi ◽  
Vinay Reddy ◽  
Sakthivel Rajendran ◽  
Imran Razzak ◽  
...  

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.


Sign in / Sign up

Export Citation Format

Share Document