Data Augmentation based on Sequence Generative Adversarial Network for Chinese Clinical Named Entity Recognition (Preprint)

2020 ◽  
Author(s):  
蓬辉 王

BACKGROUND Chinese clinical named entity recognition, as a fundamental task of Chinese medical information extraction, plays an important role in recognizing medical entities contained in Chinese electronic medical records. Limited to lack of large annotated data, existing methods concentrate on employing external resources to improve the performance of clinical named entity recognition, which require lots of time and efficient rules to add external resources. OBJECTIVE To solve the problem of lack of large annotated data, we employ data augmentation without external resource to automatically generate more medical data depending on entities and non-entities in the training set, and enlarge training dataset to improve the performance of named entity recognition. METHODS In this paper, we propose a method of data augmentation, based on sequence generative adversarial network, to enlarge the training set. Different from other sequence generative adversarial networks, where the basic element is character or word, the basic element of our generated sequence is entity or non-entity. In our model, the generator can generate new sentences composed of entities and non-entities based on the learned hidden relationship between the entities and non-entities in the training set and the discriminator can judge if the generated sentences are positive and give rewards to help train the generator. The generated data from sequence adversarial network is used to enlarge the training set and improve the performance of named entity recognition in medical records. RESULTS Without external resource, we employ our data augmentation method in three datasets, both in general domains and medical domain. Experiments show that when we use generated data from data augmentation to expand training set, named entity recognition system has achieved competitive performance compared with existing methods, which shows the effectiveness of our data augmentation method. In general domains, our method achieves an overall F1-score of 59.42% in Weibo NER dataset and a F1-score of 95.28% in Resume. In medical domain, our method achieves 83.40%. CONCLUSIONS Our data augmentation method can expand training set based on the hidden relationship between entities and non-entities in the dataset, which can alleviate the problem of lack of labeled data while avoid using external resource. At the same time, our method can improve the performance of named entity recognition not only in general domains but also medical domain.

2019 ◽  
Vol 2019 ◽  
pp. 1-9
Author(s):  
Han Zhang ◽  
Yuanbo Guo ◽  
Tao Li

In order to obtain high quality and large-scale labelled data for information security research, we propose a new approach that combines a generative adversarial network with the BiLSTM-Attention-CRF model to obtain labelled data from crowd annotations. We use the generative adversarial network to find common features in crowd annotations and then consider them in conjunction with the domain dictionary feature and sentence dependency feature as additional features to be introduced into the BiLSTM-Attention-CRF model, which is then used to carry out named entity recognition in crowdsourcing. Finally, we create a dataset to evaluate our models using information security data. The experimental results show that our model has better performance than the other baseline models.


2021 ◽  
Vol 189 ◽  
pp. 292-299
Author(s):  
Caroline Sabty ◽  
Islam Omar ◽  
Fady Wasfalla ◽  
Mohamed Islam ◽  
Slim Abdennadher

2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Han Huang ◽  
Hongyu Wang ◽  
Dawei Jin

Named entity recognition (NER) is an indispensable and very important part of many natural language processing technologies, such as information extraction, information retrieval, and intelligent Q & A. This paper describes the development of the AL-CRF model, which is a NER approach based on active learning (AL). The algorithmic sequence of the processes performed by the AL-CRF model is the following: first, the samples are clustered using the k-means approach. Then, stratified sampling is performed on the produced clusters in order to obtain initial samples, which are used to train the basic conditional random field (CRF) classifier. The next step includes the initiation of the selection process which uses the criterion of entropy. More specifically, samples having the highest entropy values are added to the training set. Afterwards, the learning process is repeated, and the CRF classifier is retrained based on the obtained training set. The learning and the selection process of the AL is running iteratively until the harmonic mean F stabilizes and the final NER model is obtained. Several NER experiments are performed on legislative and medical cases in order to validate the AL-CRF performance. The testing data include Chinese judicial documents and Chinese electronic medical records (EMRs). Testing indicates that our proposed algorithm has better recognition accuracy and recall rate compared to the conventional CRF model. Moreover, the main advantage of our approach is that it requires fewer manually labelled training samples, and at the same time, it is more effective. This can result in a more cost effective and more reliable process.


2021 ◽  
Vol 9 ◽  
pp. 586-604
Author(s):  
Abbas Ghaddar ◽  
Philippe Langlais ◽  
Ahmad Rashid ◽  
Mehdi Rezagholizadeh

Abstract In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB, despite having comparable (sometimes lower) performance on standard benchmarks. To mitigate this bias, we propose a novel model-agnostic training method that adds learnable adversarial noise to some entity mentions, thus enforcing models to focus more strongly on the contextual signal, leading to significant gains on NRB. Combining it with two other training strategies, data augmentation and parameter freezing, leads to further gains.


2021 ◽  
Author(s):  
Arslan Erdengasileng ◽  
Keqiao Li ◽  
Qing Han ◽  
Shubo Tian ◽  
Jian Wang ◽  
...  

Identification and indexing of chemical compounds in full-text articles are essential steps in biomedical article categorization, information extraction, and biological text mining. BioCreative Challenge was established to evaluate methods for biological text mining and information extraction. Track 2 of BioCreative VII (summer 2021) consists of two subtasks: chemical identification and chemical indexing in full-text PubMed articles. The chemical identification subtask also includes two parts: chemical named entity recognition (NER) and chemical normalization. In this paper, we present our work on developing a hybrid pipeline for chemical named entity recognition, chemical normalization, and chemical indexing in full-text PubMed articles. Specifically, we applied BERT-based methods for chemical NER and chemical indexing, and a sieve-based dictionary matching method for chemical normalization. For subtask 1, we used PubMedBERT with data augmentation on the chemical NER task. Several chemical-MeSH dictionaries including MeSH.XML, SUPP.XML, MRCONSO.RFF, and PubTator chemical annotations are used in a specific order to get the best performance on chemical normalization. We achieved an F1 score of 0.86 and 0.7668 on chemical NER and chemical normalization, respectively. For subtask 2, we formulated it as a binary prediction problem for each individual chemical compound name. We then used a BERT-based model with engineered features and achieved a strict F1 score of 0.4825 on the test set, which is substantially higher than the median F1 score (0.3971) of all the submissions.


Sign in / Sign up

Export Citation Format

Share Document