entity extraction
Recently Published Documents


TOTAL DOCUMENTS

251
(FIVE YEARS 98)

H-INDEX

19
(FIVE YEARS 3)

2022 ◽  
Vol 2022 ◽  
pp. 1-15
Author(s):  
Yinghai Zhou ◽  
Yi Tang ◽  
Ming Yi ◽  
Chuanyu Xi ◽  
Hai Lu

With the development of advanced persistent threat (APT) and the increasingly severe situation of network security, the strategic defense idea with the concept of “active defense, traceability, and countermeasures” arises at the historic moment, thus cyberspace threat intelligence (CTI) has become increasingly valuable in enhancing the ability to resist cyber threats. Based on the actual demand of defending against the APT threat, we apply natural language processing to process the cyberspace threat intelligence (CTI) and design a new automation system CTI View, which is oriented to text extraction and analysis for the massive unstructured cyberspace threat intelligence (CTI) released by various security vendors. The main work of CTI View is as follows: (1) to deal with heterogeneous CTI, a text extraction framework for threat intelligence is designed based on automated test framework, text recognition technology, and text denoising technology. It effectively solves the problem of poor adaptability when crawlers are used to crawl heterogeneous CTI; (2) using regular expressions combined with blacklist and whitelist mechanism to extract the IOC and TTP information described in CTI effectively; (3) according to the actual requirements, a model based on bidirectional encoder representations from transformers (BERT) is designed to complete the entity extraction algorithm for heterogeneous threat intelligence. In this paper, the GRU layer is added to the existing BERT-BiLSTM-CRF model, and we evaluate the proposed model on the marked dataset and get better performance than the current mainstream entity extraction mode.


2021 ◽  
Vol 12 (5-2021) ◽  
pp. 35-49
Author(s):  
Alexander V. Vicentiy ◽  
◽  
Maxim G. Shishaev ◽  

This paper considers the problem of extracting geoattributed entities from natural language texts to visualize the spatial relations of geographical objects. For visualization we use the technology of automated generation of schematic maps as subject-oriented components of geographic information systems. The paper describes the information technology that allows extracting geoattributed entities from natural language texts by combining several approaches. These are the neural network approach, the rule-based approach and the approach based on the use of lexico-syntactic patterns for the analysis of natural language texts. For data visualization we propose to use automated geocoding tools in conjunction with the capabilities of modern geographic information systems. The result of this technology is a cartogram that displays the spatial relations of the objects mentioned in the text.


Author(s):  
Divyansh Shankar Mishra ◽  
Abhinav Agarwal ◽  
B. P. Swathi ◽  
K C. Akshay

AbstractThe idea of data to be semantically linked and the subsequent usage of this linked data with modern computer applications has been one of the most important aspects of Web 3.0. However, the actualization of this aspect has been challenging due to the difficulties associated with building knowledge bases and using formal languages to query them. In this regard, SPARQL, a recursive acronym for standard query language and protocol for Linked Open Data and Resource Description Framework databases, is a most popular formal querying language. Nonetheless, writing SPARQL queries is known to be difficult, even for experts. Natural language query formalization, which involves semantically parsing natural language queries to their formal language equivalents, has been an essential step in overcoming this steep learning curve. Recent work in the field has seen the usage of artificial intelligence (AI) techniques for language modelling with adequate accuracy. This paper discusses a design for creating a closed domain ontology, which is then used by an AI-powered chat-bot that incorporates natural language query formalization for querying linked data using Rasa for entity extraction after intent recognition. A precision–recall analysis is performed using in-built Rasa tools in conjunction with our own testing parameters, and it is found that our system achieves a precision of 0.78, recall of 0.79 and F1-score of 0.79, which are better than the current state of the art.


2021 ◽  
Vol 11 (22) ◽  
pp. 10995
Author(s):  
Samir Rustamov ◽  
Aygul Bayramova ◽  
Emin Alasgarov

Rapid increase in conversational AI and user chat data lead to intensive development of dialogue management systems (DMS) for various industries. Yet, for low-resource languages, such as Azerbaijani, very little research has been conducted. The main purpose of this work is to experiment with various DMS pipeline set-ups to decide on the most appropriate natural language understanding and dialogue manager settings. In our project, we designed and evaluated different DMS pipelines with respect to the conversational text data obtained from one of the leading retail banks in Azerbaijan. In the work, the main two components of DMS—Natural language Understanding (NLU) and Dialogue Manager—have been investigated. In the first step of NLU, we utilized a language identification (LI) component for language detection. We investigated both built-in LI methods such as fastText and custom machine learning (ML) models trained on the domain-based dataset. The second step of the work was a comparison of the classic ML classifiers (logistic regression, neural networks, and SVM) and Dual Intent and Entity Transformer (DIET) architecture for user intention detection. In these experiments we used different combinations of feature extractors such as CountVectorizer, Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer, and word embeddings for both word and character n-gram based tokens. To extract important information from the text messages, Named Entity Extraction (NER) component was added to the pipeline. The best NER model was chosen among conditional random fields (CRF) tagger, deep neural networks (DNN), models and build in entity extraction component inside DIET architecture. Obtained entity tags fed to the Dialogue Management module as features. All NLU set-ups were followed by the Dialogue Management module that contains a Rule-based Policy to handle FAQs and chitchats as well as a Transformer Embedding Dialogue (TED) Policy to handle more complex and unexpected dialogue inputs. As a result, we suggest a DMS pipeline for a financial assistant, which is capable of identifying intentions, named entities, and a language of text followed by policies that allow generating a proper response (based on the designed dialogues) and suggesting the best next action.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dandan Tao ◽  
Dongyu Zhang ◽  
Ruofan Hu ◽  
Elke Rundensteiner ◽  
Hao Feng

AbstractFoodborne outbreaks are a serious but preventable threat to public health that often lead to illness, loss of life, significant economic loss, and the erosion of consumer confidence. Understanding how consumers respond when interacting with foods, as well as extracting information from posts on social media may provide new means of reducing the risks and curtailing the outbreaks. In recent years, Twitter has been employed as a new tool for identifying unreported foodborne illnesses. However, there is a huge gap between the identification of sporadic illnesses and the early detection of a potential outbreak. In this work, the dual-task BERTweet model was developed to identify unreported foodborne illnesses and extract foodborne-illness-related entities from Twitter. Unlike previous methods, our model leveraged the mutually beneficial relationships between the two tasks. The results showed that the F1-score of relevance prediction was 0.87, and the F1-score of entity extraction was 0.61. Key elements such as time, location, and food detected from sentences indicating foodborne illnesses were used to analyze potential foodborne outbreaks in massive historical tweets. A case study on tweets indicating foodborne illnesses showed that the discovered trend is consistent with the true outbreaks that occurred during the same period.


2021 ◽  
Author(s):  
Fatemeh Shah-Mohammadi ◽  
Wanting Cui ◽  
Joseph Finkelstein

2021 ◽  
Author(s):  
Qianying Wang ◽  
Jing Liao ◽  
Mirella Lapata ◽  
Malcolm Macleod

Abstract Background: Natural language processing could assist multiple tasks in systematic reviews to reduce workflow, including the extraction of PICO elements such as study populations, interventions and outcomes. The PICO framework provides a basis for the retrieval and selection for inclusion of published evidence relevant to a specific systematic review question, and automatic approaches of PICO extraction have been developed particularly for reviews of clinical trial findings. Considering the difference between preclinical animal studies and clinical trials, developing separate approaches are necessary. Facilitating preclinical systematic reviews will inform the translation from preclinical to clinical research. Methods: We randomly selected 400 abstracts from the PubMed Central Open Access database which described in vivo animal research and manually annotated these with PICO phrases for Species, Strain, model Induction, Intervention, Comparator and Outcome. We developed a two-stage workflow for preclinical PICO extraction. Firstly we fine-tuned BERT with different pre-trained modules for PICO sentence classification. Then, after removing text irrelevant to PICO features, we explored LSTM, CRF and BERT-based models for PICO entity recognition. We also explored a self-training approach because of the small training corpus.Results: For PICO sentence classification, BERT models using all pre-trained modules achieved an F1 score over 80%, and models pre-trained on PubMed abstracts achieved the highest F1 of 85%. For PICO entity recognition, fine-tuning BERT pre-trained on PubMed abstracts achieved an overall F1 of 71%, and satisfactory F1 for Species (98%), Strain (70%), Intervention (70%) and Outcome (67%). The score of Induction and Comparator is less satisfactory, but F1 of Comparator can be improved to 50% by applying self-training. Conclusions: Our study indicates that of the approaches tested, BERT pre-trained on PubMed abstracts is the best for both PICO sentence classification and PICO entity recognition in the preclinical abstracts. Self-training yields better performance for identifying comparators and strains.


2021 ◽  
pp. 763-771
Author(s):  
Bajeela Aejas ◽  
Abdelaziz Bouras ◽  
Abdelhak Belhi ◽  
Houssem Gasmi
Keyword(s):  

Author(s):  
Yang Jiao ◽  
Jingru Han ◽  
Bohao Xu ◽  
Ming Xiao ◽  
Binnan Shen ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document