scholarly journals Multifeature Named Entity Recognition in Information Security Based on Adversarial Learning

2019 ◽  
Vol 2019 ◽  
pp. 1-9
Author(s):  
Han Zhang ◽  
Yuanbo Guo ◽  
Tao Li

In order to obtain high quality and large-scale labelled data for information security research, we propose a new approach that combines a generative adversarial network with the BiLSTM-Attention-CRF model to obtain labelled data from crowd annotations. We use the generative adversarial network to find common features in crowd annotations and then consider them in conjunction with the domain dictionary feature and sentence dependency feature as additional features to be introduced into the BiLSTM-Attention-CRF model, which is then used to carry out named entity recognition in crowdsourcing. Finally, we create a dataset to evaluate our models using information security data. The experimental results show that our model has better performance than the other baseline models.

2020 ◽  
Author(s):  
蓬辉 王

BACKGROUND Chinese clinical named entity recognition, as a fundamental task of Chinese medical information extraction, plays an important role in recognizing medical entities contained in Chinese electronic medical records. Limited to lack of large annotated data, existing methods concentrate on employing external resources to improve the performance of clinical named entity recognition, which require lots of time and efficient rules to add external resources. OBJECTIVE To solve the problem of lack of large annotated data, we employ data augmentation without external resource to automatically generate more medical data depending on entities and non-entities in the training set, and enlarge training dataset to improve the performance of named entity recognition. METHODS In this paper, we propose a method of data augmentation, based on sequence generative adversarial network, to enlarge the training set. Different from other sequence generative adversarial networks, where the basic element is character or word, the basic element of our generated sequence is entity or non-entity. In our model, the generator can generate new sentences composed of entities and non-entities based on the learned hidden relationship between the entities and non-entities in the training set and the discriminator can judge if the generated sentences are positive and give rewards to help train the generator. The generated data from sequence adversarial network is used to enlarge the training set and improve the performance of named entity recognition in medical records. RESULTS Without external resource, we employ our data augmentation method in three datasets, both in general domains and medical domain. Experiments show that when we use generated data from data augmentation to expand training set, named entity recognition system has achieved competitive performance compared with existing methods, which shows the effectiveness of our data augmentation method. In general domains, our method achieves an overall F1-score of 59.42% in Weibo NER dataset and a F1-score of 95.28% in Resume. In medical domain, our method achieves 83.40%. CONCLUSIONS Our data augmentation method can expand training set based on the hidden relationship between entities and non-entities in the dataset, which can alleviate the problem of lack of labeled data while avoid using external resource. At the same time, our method can improve the performance of named entity recognition not only in general domains but also medical domain.


Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 79 ◽  
Author(s):  
Xiaoyu Han ◽  
Yue Zhang ◽  
Wenkai Zhang ◽  
Tinglei Huang

Relation extraction is a vital task in natural language processing. It aims to identify the relationship between two specified entities in a sentence. Besides information contained in the sentence, additional information about the entities is verified to be helpful in relation extraction. Additional information such as entity type getting by NER (Named Entity Recognition) and description provided by knowledge base both have their limitations. Nevertheless, there exists another way to provide additional information which can overcome these limitations in Chinese relation extraction. As Chinese characters usually have explicit meanings and can carry more information than English letters. We suggest that characters that constitute the entities can provide additional information which is helpful for the relation extraction task, especially in large scale datasets. This assumption has never been verified before. The main obstacle is the lack of large-scale Chinese relation datasets. In this paper, first, we generate a large scale Chinese relation extraction dataset based on a Chinese encyclopedia. Second, we propose an attention-based model using the characters that compose the entities. The result on the generated dataset shows that these characters can provide useful information for the Chinese relation extraction task. By using this information, the attention mechanism we used can recognize the crucial part of the sentence that can express the relation. The proposed model outperforms other baseline models on our Chinese relation extraction dataset.


2019 ◽  
Vol 9 (1) ◽  
pp. 15 ◽  
Author(s):  
Runyu Fan ◽  
Lizhe Wang ◽  
Jining Yan ◽  
Weijing Song ◽  
Yingqian Zhu ◽  
...  

Constructing a knowledge graph of geological hazards literature can facilitate the reuse of geological hazards literature and provide a reference for geological hazard governance. Named entity recognition (NER), as a core technology for constructing a geological hazard knowledge graph, has to face the challenges that named entities in geological hazard literature are diverse in form, ambiguous in semantics, and uncertain in context. This can introduce difficulties in designing practical features during the NER classification. To address the above problem, this paper proposes a deep learning-based NER model; namely, the deep, multi-branch BiGRU-CRF model, which combines a multi-branch bidirectional gated recurrent unit (BiGRU) layer and a conditional random field (CRF) model. In an end-to-end and supervised process, the proposed model automatically learns and transforms features by a multi-branch bidirectional GRU layer and enhances the output with a CRF layer. Besides the deep, multi-branch BiGRU-CRF model, we also proposed a pattern-based corpus construction method to construct the corpus needed for the deep, multi-branch BiGRU-CRF model. Experimental results indicated the proposed deep, multi-branch BiGRU-CRF model outperformed state-of-the-art models. The proposed deep, multi-branch BiGRU-CRF model constructed a large-scale geological hazard literature knowledge graph containing 34,457 entities nodes and 84,561 relations.


2019 ◽  
Vol 9 (18) ◽  
pp. 3658 ◽  
Author(s):  
Jianliang Yang ◽  
Yuenan Liu ◽  
Minghui Qian ◽  
Chenghua Guan ◽  
Xiangfei Yuan

Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.


2020 ◽  
Vol 7 (2) ◽  
pp. 205395172096886
Author(s):  
Mark Altaweel ◽  
Tasoula Georgiou Hadjitofi

The marketisation of heritage has been a major topic of interest among heritage specialists studying how the online marketplace shapes sales. Missing from that debate is a large-scale analysis seeking to understand market trends on popular selling platforms such as eBay. Sites such as eBay can inform what heritage items are of interest to the wider public, and thus what is potentially of greater cultural value, while also demonstrating monetary value trends. To better understand the sale of heritage on eBay’s international site, this work applies named entity recognition using conditional random fields, a method within natural language processing, and word dictionaries that inform on market trends. The methods demonstrate how Western markets, particularly the US and UK, have dominated sales for different cultures. Roman, Egyptian, Viking (Norse/Dane) and Near East objects are sold the most. Surprisingly, Cyprus and Egypt, two countries with relatively strict prohibition against the sale of heritage items, make the top 10 selling countries on eBay. Objects such as jewellery, statues and figurines, and religious items sell in relatively greater numbers, while masks and vessels (e.g. vases) sell at generally higher prices. Metal, stone and terracotta are commonly sold materials. More rare materials, such as those made of ivory, papyrus or wood, have relatively higher prices. Few sellers dominate the market, where in some months 40% of sales are controlled by the top 10 sellers. The tool used for the study is freely provided, demonstrating benefits in an automated approach to understanding sale trends.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Ming Cheng ◽  
Shufeng Xiong ◽  
Fei Li ◽  
Pan Liang ◽  
Jianbo Gao

Abstract Background Named entity recognition (NER) on Chinese electronic medical/healthcare records has attracted significantly attentions as it can be applied to building applications to understand these records. Most previous methods have been purely data-driven, requiring high-quality and large-scale labeled medical data. However, labeled data is expensive to obtain, and these data-driven methods are difficult to handle rare and unseen entities. Methods To tackle these problems, this study presents a novel multi-task deep neural network model for Chinese NER in the medical domain. We incorporate dictionary features into neural networks, and a general secondary named entity segmentation is used as auxiliary task to improve the performance of the primary task of named entity recognition. Results In order to evaluate the proposed method, we compare it with other currently popular methods, on three benchmark datasets. Two of the datasets are publicly available, and the other one is constructed by us. Experimental results show that the proposed model achieves 91.07% average f-measure on the two public datasets and 87.05% f-measure on private dataset. Conclusions The comparison results of different models demonstrated the effectiveness of our model. The proposed model outperformed traditional statistical models.


Author(s):  
Dat Ba Nguyen ◽  
Martin Theobald ◽  
Gerhard Weikum

Methods for Named Entity Recognition and Disambiguation (NERD) perform NER and NED in two separate stages. Therefore, NED may be penalized with respect to precision by NER false positives, and suffers in recall from NER false negatives. Conversely, NED does not fully exploit information computed by NER such as types of mentions. This paper presents J-NERD, a new approach to perform NER and NED jointly, by means of a probabilistic graphical model that captures mention spans, mention types, and the mapping of mentions to entities in a knowledge base. We present experiments with different kinds of texts from the CoNLL’03, ACE’05, and ClueWeb’09-FACC1 corpora. J-NERD consistently outperforms state-of-the-art competitors in end-to-end NERD precision, recall, and F1.


Sign in / Sign up

Export Citation Format

Share Document