Data Augmentation based on Sequence Generative Adversarial Network for Chinese Clinical Named Entity Recognition (Preprint)
BACKGROUND Chinese clinical named entity recognition, as a fundamental task of Chinese medical information extraction, plays an important role in recognizing medical entities contained in Chinese electronic medical records. Limited to lack of large annotated data, existing methods concentrate on employing external resources to improve the performance of clinical named entity recognition, which require lots of time and efficient rules to add external resources. OBJECTIVE To solve the problem of lack of large annotated data, we employ data augmentation without external resource to automatically generate more medical data depending on entities and non-entities in the training set, and enlarge training dataset to improve the performance of named entity recognition. METHODS In this paper, we propose a method of data augmentation, based on sequence generative adversarial network, to enlarge the training set. Different from other sequence generative adversarial networks, where the basic element is character or word, the basic element of our generated sequence is entity or non-entity. In our model, the generator can generate new sentences composed of entities and non-entities based on the learned hidden relationship between the entities and non-entities in the training set and the discriminator can judge if the generated sentences are positive and give rewards to help train the generator. The generated data from sequence adversarial network is used to enlarge the training set and improve the performance of named entity recognition in medical records. RESULTS Without external resource, we employ our data augmentation method in three datasets, both in general domains and medical domain. Experiments show that when we use generated data from data augmentation to expand training set, named entity recognition system has achieved competitive performance compared with existing methods, which shows the effectiveness of our data augmentation method. In general domains, our method achieves an overall F1-score of 59.42% in Weibo NER dataset and a F1-score of 95.28% in Resume. In medical domain, our method achieves 83.40%. CONCLUSIONS Our data augmentation method can expand training set based on the hidden relationship between entities and non-entities in the dataset, which can alleviate the problem of lack of labeled data while avoid using external resource. At the same time, our method can improve the performance of named entity recognition not only in general domains but also medical domain.