scholarly journals MAF-CNER : A Chinese Named Entity Recognition Model Based on Multifeature Adaptive Fusion

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xuming Han ◽  
Feng Zhou ◽  
Zhiyuan Hao ◽  
Qiaoming Liu ◽  
Yong Li ◽  
...  

Named entity recognition (NER) is a subtask in natural language processing, and its accuracy greatly affects the effectiveness of downstream tasks. Aiming at the problem of insufficient expression of potential Chinese features in named entity recognition tasks, this paper proposes a multifeature adaptive fusion Chinese named entity recognition (MAF-CNER) model. The model uses bidirectional long short-term memory (BiLSTM) neural network to extract stroke and radical features and adopts a weighted concatenation method to fuse two sets of features adaptively. This method can better integrate the two sets of features, thereby improving the model entity recognition ability. In order to fully test the entity recognition performance of this model, we compared the basic model and other mainstream models on Microsoft Research Asia (MSRA) and “China People’s Daily” dataset from January to June 1998. Experimental results show that this model is better than other models, with F1 values of 97.01% and 96.78%, respectively.

Electronics ◽  
2020 ◽  
Vol 9 (6) ◽  
pp. 1001 ◽  
Author(s):  
Jingang Liu ◽  
Chunhe Xia ◽  
Haihua Yan ◽  
Wenjing Xu

Named entity recognition (NER) is a basic but crucial task in the field of natural language processing (NLP) and big data analysis. The recognition of named entities based on Chinese is more complicated and difficult than English, which makes the task of NER in Chinese more challenging. In particular, fine-grained named entity recognition is more challenging than traditional named entity recognition tasks, mainly because fine-grained tasks have higher requirements for the ability of automatic feature extraction and information representation of deep neural models. In this paper, we propose an innovative neural network model named En2BiLSTM-CRF to improve the effect of fine-grained Chinese entity recognition tasks. This proposed model including the initial encoding layer, the enhanced encoding layer, and the decoding layer combines the advantages of pre-training model encoding, dual bidirectional long short-term memory (BiLSTM) networks, and a residual connection mechanism. Hence, it can encode information multiple times and extract contextual features hierarchically. We conducted sufficient experiments on two representative datasets using multiple important metrics and compared them with other advanced baselines. We present promising results showing that our proposed En2BiLSTM-CRF has better performance as well as better generalization ability in both fine-grained and coarse-grained Chinese entity recognition tasks.


Information ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 45 ◽  
Author(s):  
Shardrom Johnson ◽  
Sherlock Shen ◽  
Yuanchen Liu

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.


Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 82
Author(s):  
SaiKiranmai Gorla ◽  
Lalita Bhanu Murthy Neti ◽  
Aruna Malapati

Named entity recognition (NER) is a fundamental step for many natural language processing tasks and hence enhancing the performance of NER models is always appreciated. With limited resources being available, NER for South-East Asian languages like Telugu is quite a challenging problem. This paper attempts to improve the NER performance for Telugu using gazetteer-related features, which are automatically generated using Wikipedia pages. We make use of these gazetteer features along with other well-known features like contextual, word-level, and corpus features to build NER models. NER models are developed using three well-known classifiers—conditional random field (CRF), support vector machine (SVM), and margin infused relaxed algorithms (MIRA). The gazetteer features are shown to improve the performance, and theMIRA-based NER model fared better than its counterparts SVM and CRF.


2018 ◽  
Vol 10 (12) ◽  
pp. 123 ◽  
Author(s):  
Mohammed Ali ◽  
Guanzheng Tan ◽  
Aamir Hussain

Recurrent neural network (RNN) has achieved remarkable success in sequence labeling tasks with memory requirement. RNN can remember previous information of a sequence and can thus be used to solve natural language processing (NLP) tasks. Named entity recognition (NER) is a common task of NLP and can be considered a classification problem. We propose a bidirectional long short-term memory (LSTM) model for this entity recognition task of the Arabic text. The LSTM network can process sequences and relate to each part of it, which makes it useful for the NER task. Moreover, we use pre-trained word embedding to train the inputs that are fed into the LSTM network. The proposed model is evaluated on a popular dataset called “ANERcorp.” Experimental results show that the model with word embedding achieves a high F-score measure of approximately 88.01%.


2021 ◽  
Vol 11 (19) ◽  
pp. 9038
Author(s):  
Wazir Ali ◽  
Jay Kumar ◽  
Zenglin Xu ◽  
Rajesh Kumar ◽  
Yazhou Ren

Named entity recognition (NER) is a fundamental task in many natural language processing (NLP) applications, such as text summarization and semantic information retrieval. Recently, deep neural networks (NNs) with the attention mechanism yield excellent performance in NER by taking advantage of character-level and word-level representation learning. In this paper, we propose a deep context-aware bidirectional long short-term memory (CaBiLSTM) model for the Sindhi NER task. The model relies upon contextual representation learning (CRL), bidirectional encoder, self-attention, and sequential conditional random field (CRF). The CaBiLSTM model incorporates task-oriented CRL based on joint character-level and word-level representations. It takes character-level input to learn the character representations. Afterwards, the character representations are transformed into word features, and the bidirectional encoder learns the word representations. The output of the final encoder is fed into the self-attention through a hidden layer before decoding. Finally, we employ the CRF for the prediction of label sequences. The baselines and the proposed CaBiLSTM model are compared by exploiting pretrained Sindhi GloVe (SdGloVe), Sindhi fastText (SdfastText), task-oriented, and CRL-based word representations on the recently proposed SiNER dataset. Our proposed CaBiLSTM model achieved a high F1-score of 91.25% on the SiNER dataset with CRL without relying on additional handmade features, such as hand-crafted rules, gazetteers, or dictionaries.


Author(s):  
Yashvardhan Sharma ◽  
Rupal Bhargava ◽  
Bapiraju Vamsi Tadikonda

With the increase of internet applications and social media platforms there has been an increase in the informal way of text communication. People belonging to different regions tend to mix their regional language with English on social media text. This has been the trend with many multilingual nations now and is commonly known as code mixing. In code mixing, multiple languages are used within a statement. The problem of named entity recognition (NER) is a well-researched topic in natural language processing (NLP), but the present NER systems tend to perform inefficiently on code-mixed text. This paper proposes three approaches to improve named entity recognizers for handling code-mixing. The first approach is based on machine learning techniques such as support vector machines and other tree-based classifiers. The second approach is based on neural networks and the third approach uses long short-term memory (LSTM) architecture to solve the problem.


2021 ◽  
Author(s):  
Donghyeong Seong ◽  
Yoonho Choi ◽  
Sungwon Jung ◽  
Sungchul Bae ◽  
Soo-Yong Shin ◽  
...  

BACKGROUND Colorectal cancer is a leading cause of cancer deaths. Several screening tests such as colonoscopy can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of the unstructured text data are still very low despite the large amounts of accumulated data. OBJECTIVE We aimed to develop a deep learning-based natural language processing (NLP) method for named entity recognition (NER) in colonoscopy reports. To the best of our knowledge, no previous studies on clinical NLP for colonoscopy reports have applied deep learning techniques. METHODS This study proposed a method to apply pre-trained word embedding to a deep learning-based NER model using large unlabeled colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of the Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared variants of the long short-term memory (LSTM) model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using a large unlabeled data (280,668 reports) to the selected model. RESULTS The NER model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 score for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers. CONCLUSIONS In this study, clinical NER was applied to extract meaningful information from colonoscopy reports. We proposed a deep learning-based NER model with pre-trained word embedding. The proposed method in this study achieved promising results that demonstrate it can be applied to various practical purposes.


Data ◽  
2021 ◽  
Vol 6 (7) ◽  
pp. 71
Author(s):  
Gonçalo Carnaz ◽  
Mário Antunes ◽  
Vitor Beires Nogueira

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.


2021 ◽  
pp. 1-12
Author(s):  
Yingwen Fu ◽  
Nankai Lin ◽  
Xiaotian Lin ◽  
Shengyi Jiang

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.


2019 ◽  
pp. 1-8 ◽  
Author(s):  
Tomasz Oliwa ◽  
Steven B. Maron ◽  
Leah M. Chase ◽  
Samantha Lomnicki ◽  
Daniel V.T. Catenacci ◽  
...  

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.


Sign in / Sign up

Export Citation Format

Share Document