entity extraction Latest Research Papers

With the development of advanced persistent threat (APT) and the increasingly severe situation of network security, the strategic defense idea with the concept of “active defense, traceability, and countermeasures” arises at the historic moment, thus cyberspace threat intelligence (CTI) has become increasingly valuable in enhancing the ability to resist cyber threats. Based on the actual demand of defending against the APT threat, we apply natural language processing to process the cyberspace threat intelligence (CTI) and design a new automation system CTI View, which is oriented to text extraction and analysis for the massive unstructured cyberspace threat intelligence (CTI) released by various security vendors. The main work of CTI View is as follows: (1) to deal with heterogeneous CTI, a text extraction framework for threat intelligence is designed based on automated test framework, text recognition technology, and text denoising technology. It effectively solves the problem of poor adaptability when crawlers are used to crawl heterogeneous CTI; (2) using regular expressions combined with blacklist and whitelist mechanism to extract the IOC and TTP information described in CTI effectively; (3) according to the actual requirements, a model based on bidirectional encoder representations from transformers (BERT) is designed to complete the entity extraction algorithm for heterogeneous threat intelligence. In this paper, the GRU layer is added to the existing BERT-BiLSTM-CRF model, and we evaluate the proposed model on the marked dataset and get better performance than the current mainstream entity extraction mode.

Download Full-text

The geoattributed entity extraction technology for visual representation of objects spatial relations based on automated schematic map generation

Transaction Kola Science Cetnre ◽

10.37614/2307-5252.2021.5.12.003 ◽

2021 ◽

Vol 12 (5-2021) ◽

pp. 35-49

Author(s):

Alexander V. Vicentiy ◽

◽

Maxim G. Shishaev ◽

Keyword(s):

Information Systems ◽

Geographic Information Systems ◽

Natural Language ◽

Spatial Relations ◽

Geographic Information ◽

Entity Extraction ◽

Extraction Technology ◽

Neural Network Approach ◽

Map Generation ◽

Rule Based Approach

This paper considers the problem of extracting geoattributed entities from natural language texts to visualize the spatial relations of geographical objects. For visualization we use the technology of automated generation of schematic maps as subject-oriented components of geographic information systems. The paper describes the information technology that allows extracting geoattributed entities from natural language texts by combining several approaches. These are the neural network approach, the rule-based approach and the approach based on the use of lexico-syntactic patterns for the analysis of natural language texts. For data visualization we propose to use automated geocoding tools in conjunction with the capabilities of modern geographic information systems. The result of this technology is a cartogram that displays the spatial relations of the objects mentioned in the text.

Download Full-text

Natural language query formalization to SPARQL for querying knowledge bases using Rasa

Progress in Artificial Intelligence ◽

10.1007/s13748-021-00271-1 ◽

2021 ◽

Author(s):

Divyansh Shankar Mishra ◽

Abhinav Agarwal ◽

B. P. Swathi ◽

K C. Akshay

Keyword(s):

Natural Language ◽

Linked Data ◽

Query Language ◽

Open Data ◽

Knowledge Bases ◽

Computer Applications ◽

Closed Domain ◽

Entity Extraction ◽

Web 3.0 ◽

Natural Language Query

AbstractThe idea of data to be semantically linked and the subsequent usage of this linked data with modern computer applications has been one of the most important aspects of Web 3.0. However, the actualization of this aspect has been challenging due to the difficulties associated with building knowledge bases and using formal languages to query them. In this regard, SPARQL, a recursive acronym for standard query language and protocol for Linked Open Data and Resource Description Framework databases, is a most popular formal querying language. Nonetheless, writing SPARQL queries is known to be difficult, even for experts. Natural language query formalization, which involves semantically parsing natural language queries to their formal language equivalents, has been an essential step in overcoming this steep learning curve. Recent work in the field has seen the usage of artificial intelligence (AI) techniques for language modelling with adequate accuracy. This paper discusses a design for creating a closed domain ontology, which is then used by an AI-powered chat-bot that incorporates natural language query formalization for querying linked data using Rasa for entity extraction after intent recognition. A precision–recall analysis is performed using in-built Rasa tools in conjunction with our own testing parameters, and it is found that our system achieves a precision of 0.78, recall of 0.79 and F1-score of 0.79, which are better than the current state of the art.

Download Full-text

Development of Dialogue Management System for Banking Services

Applied Sciences ◽

10.3390/app112210995 ◽

2021 ◽

Vol 11 (22) ◽

pp. 10995

Author(s):

Samir Rustamov ◽

Aygul Bayramova ◽

Emin Alasgarov

Keyword(s):

Neural Networks ◽

Natural Language ◽

Conditional Random Fields ◽

Natural Language Understanding ◽

Text Messages ◽

Entity Extraction ◽

Language Understanding ◽

User Intention ◽

Dialogue Management ◽

Dialogue Manager

Rapid increase in conversational AI and user chat data lead to intensive development of dialogue management systems (DMS) for various industries. Yet, for low-resource languages, such as Azerbaijani, very little research has been conducted. The main purpose of this work is to experiment with various DMS pipeline set-ups to decide on the most appropriate natural language understanding and dialogue manager settings. In our project, we designed and evaluated different DMS pipelines with respect to the conversational text data obtained from one of the leading retail banks in Azerbaijan. In the work, the main two components of DMS—Natural language Understanding (NLU) and Dialogue Manager—have been investigated. In the first step of NLU, we utilized a language identification (LI) component for language detection. We investigated both built-in LI methods such as fastText and custom machine learning (ML) models trained on the domain-based dataset. The second step of the work was a comparison of the classic ML classifiers (logistic regression, neural networks, and SVM) and Dual Intent and Entity Transformer (DIET) architecture for user intention detection. In these experiments we used different combinations of feature extractors such as CountVectorizer, Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer, and word embeddings for both word and character n-gram based tokens. To extract important information from the text messages, Named Entity Extraction (NER) component was added to the pipeline. The best NER model was chosen among conditional random fields (CRF) tagger, deep neural networks (DNN), models and build in entity extraction component inside DIET architecture. Obtained entity tags fed to the Dialogue Management module as features. All NLU set-ups were followed by the Dialogue Management module that contains a Rule-based Policy to handle FAQs and chitchats as well as a Transformer Embedding Dialogue (TED) Policy to handle more complex and unexpected dialogue inputs. As a result, we suggest a DMS pipeline for a financial assistant, which is capable of identifying intentions, named entities, and a language of text followed by policies that allow generating a proper response (based on the designed dialogues) and suggesting the best next action.

Download Full-text

Crowdsourcing and machine learning approaches for extracting entities indicating potential foodborne outbreaks from social media

Scientific Reports ◽

10.1038/s41598-021-00766-w ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dandan Tao ◽

Dongyu Zhang ◽

Ruofan Hu ◽

Elke Rundensteiner ◽

Hao Feng

Keyword(s):

Social Media ◽

Economic Loss ◽

Consumer Confidence ◽

Learning Approaches ◽

Entity Extraction ◽

Foodborne Illnesses ◽

Loss Of Life ◽

Foodborne Outbreaks ◽

Significant Economic Loss

AbstractFoodborne outbreaks are a serious but preventable threat to public health that often lead to illness, loss of life, significant economic loss, and the erosion of consumer confidence. Understanding how consumers respond when interacting with foods, as well as extracting information from posts on social media may provide new means of reducing the risks and curtailing the outbreaks. In recent years, Twitter has been employed as a new tool for identifying unreported foodborne illnesses. However, there is a huge gap between the identification of sporadic illnesses and the early detection of a potential outbreak. In this work, the dual-task BERTweet model was developed to identify unreported foodborne illnesses and extract foodborne-illness-related entities from Twitter. Unlike previous methods, our model leveraged the mutually beneficial relationships between the two tasks. The results showed that the F1-score of relevance prediction was 0.87, and the F1-score of entity extraction was 0.61. Key elements such as time, location, and food detected from sentences indicating foodborne illnesses were used to analyze potential foodborne outbreaks in massive historical tweets. A case study on tweets indicating foodborne illnesses showed that the discovered trend is consistent with the true outbreaks that occurred during the same period.

Download Full-text

Research on BERT-Based Audit Entity Extraction Method

10.1109/rcae53607.2021.9638834 ◽

2021 ◽

Author(s):

Rui Xiang ◽

Weibo Li ◽

Hua Yan

Keyword(s):

Extraction Method ◽

Entity Extraction

Download Full-text

Comparison of ACM and CLAMP for Entity Extraction in Clinical Notes

10.1109/embc46164.2021.9630611 ◽

2021 ◽

Author(s):

Fatemeh Shah-Mohammadi ◽

Wanting Cui ◽

Joseph Finkelstein

Keyword(s):

Entity Extraction ◽

Clinical Notes

Download Full-text

PICO Entity Extraction For Preclinical Animal Literature

10.21203/rs.3.rs-1008099/v1 ◽

2021 ◽

Author(s):

Qianying Wang ◽

Jing Liao ◽

Mirella Lapata ◽

Malcolm Macleod

Keyword(s):

Language Processing ◽

Systematic Reviews ◽

Animal Studies ◽

Fine Tuning ◽

Entity Recognition ◽

Entity Extraction ◽

Published Evidence ◽

Sentence Classification ◽

The Difference

Abstract Background: Natural language processing could assist multiple tasks in systematic reviews to reduce workflow, including the extraction of PICO elements such as study populations, interventions and outcomes. The PICO framework provides a basis for the retrieval and selection for inclusion of published evidence relevant to a specific systematic review question, and automatic approaches of PICO extraction have been developed particularly for reviews of clinical trial findings. Considering the difference between preclinical animal studies and clinical trials, developing separate approaches are necessary. Facilitating preclinical systematic reviews will inform the translation from preclinical to clinical research. Methods: We randomly selected 400 abstracts from the PubMed Central Open Access database which described in vivo animal research and manually annotated these with PICO phrases for Species, Strain, model Induction, Intervention, Comparator and Outcome. We developed a two-stage workflow for preclinical PICO extraction. Firstly we fine-tuned BERT with different pre-trained modules for PICO sentence classification. Then, after removing text irrelevant to PICO features, we explored LSTM, CRF and BERT-based models for PICO entity recognition. We also explored a self-training approach because of the small training corpus.Results: For PICO sentence classification, BERT models using all pre-trained modules achieved an F1 score over 80%, and models pre-trained on PubMed abstracts achieved the highest F1 of 85%. For PICO entity recognition, fine-tuning BERT pre-trained on PubMed abstracts achieved an overall F1 of 71%, and satisfactory F1 for Species (98%), Strain (70%), Intervention (70%) and Outcome (67%). The score of Induction and Comparator is less satisfactory, but F1 of Comparator can be improved to 50% by applying self-training. Conclusions: Our study indicates that of the approaches tested, BERT pre-trained on PubMed abstracts is the best for both PICO sentence classification and PICO entity recognition in the preclinical abstracts. Self-training yields better performance for identifying comparators and strains.

Download Full-text