A Distributed Event Extraction Framework for Large-Scale Unstructured Text

AbstractData mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we callknowledge mining: integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.

Download Full-text

Detecting Covert Networks in Multilingual Groups: Evidence within a Virtual World

Journal of Virtual Worlds Research ◽

10.4101/jvwr.v9i2.7213 ◽

2016 ◽

Vol 9 (2) ◽

Author(s):

Janea Triplet ◽

Andrew Harrison ◽

Brian Mennecke ◽

Akmal Mirsadikov

Keyword(s):

Social Network ◽

Virtual World ◽

Large Scale ◽

A Priori ◽

Information Networks ◽

A Priori Knowledge ◽

Unstructured Text ◽

Network Analytics ◽

Covert Networks ◽

Information Providers

This paper introduces an approach for the examination and organization of unstructured text to identify relationships between networks of individuals. This approach uses discourse analysis to identify information providers and recipients and determines the structure of covert organizations irrespective of the language that facilitate conversations between members. Then, this method applies social network analytics to determine the arrangement of a covert organization without any a priori knowledge of the network structure. This approach is tested and validated using communication data collected in a virtual world setting. Our analysis indicates that the proposed framework successfully detected the covert structure of three information networks, and their cliques, within an online gaming community during a simulation of a large-scale event.

Download Full-text

Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization

PLoS ONE ◽

10.1371/journal.pone.0055814 ◽

2013 ◽

Vol 8 (4) ◽

pp. e55814 ◽

Cited By ~ 59

Author(s):

Sofie Van Landeghem ◽

Jari Björne ◽

Chih-Hsuan Wei ◽

Kai Hakala ◽

Sampo Pyysalo ◽

...

Keyword(s):

Large Scale ◽

Event Extraction ◽

Gene Normalization ◽

Multi Level

Download Full-text

Event Geoparser with Pseudo-Location Entity Identification and Numerical Argument Extraction Implementation and Evaluation in Indonesian News Domain

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9120712 ◽

2020 ◽

Vol 9 (12) ◽

pp. 712

Author(s):

Agung Dewandaru ◽

Dwi Hendratmo Widyantoro ◽

Saiful Akbar

Keyword(s):

Topic Model ◽

Event Extraction ◽

Geographic Information Retrieval ◽

Unstructured Text ◽

Three Stages ◽

Entity Identification ◽

Choropleth Map ◽

Extraction Model ◽

Document Level ◽

Large Corpus

Geoparser is a fundamental component of a Geographic Information Retrieval (GIR) geoparser, which performs toponym recognition, disambiguation, and geographic coordinate resolution from unstructured text domain. However, geoparsing of news articles which report several events across many place-mentions in the document are not yet adequately handled by regular geoparser, where the scope of resolution is either toponym-level or document-level. The capacity to detect multiple events and geolocate their true coordinates along with their numerical arguments is still missing from modern geoparsers, much less in Indonesian news corpora domain. We propose an event geoparser model with three stages of processing, which tightly integrates event extraction model into geoparsing and provides precise event-level resolution scope. The model casts the geotagging and event extraction as sequence labeling and uses LSTM-CRF inferencer equipped with features derived using Aggregated Topic Model from a large corpus to increase the generalizability. Throughout the proposed workflow and features, the geoparser is able to significantly improve the identification of pseudo-location entities, resulting in a 23.43% increase for weighted F1 score compared to baseline gazetteer and POS Tag features. As a side effect of event extraction, various numerical arguments are also extracted, and the output is easily projected to a rich choropleth map from a single news document.

Download Full-text

Prior Knowledge-Based Event Network for Chinese Text

International Journal of Digital Multimedia Broadcasting ◽

10.1155/2017/8594863 ◽

2017 ◽

Vol 2017 ◽

pp. 1-5

Author(s):

Yunyu Shi ◽

Jianfang Shan ◽

Xiang Liu ◽

Yongxiang Xia

Keyword(s):

Prior Knowledge ◽

Chinese Text ◽

Large Scale ◽

Text Processing ◽

Event Extraction ◽

Knowledge Based ◽

Text Understanding ◽

Text Information ◽

Data Source ◽

Lexical Relations

Text representation is a basic issue of text information processing and event plays an important role in text understanding; both attract the attention of scholars. The event network conceals lexical relations in events, and its edges express logical relations between events in document. However, the events and relations are extracted from event-annotated text, which makes it hard for large-scale text automatic processing. In the paper, with expanded CEC (Chinese Event Corpus) as data source, prior knowledge of manifestation rules of event and relation as the guide, we propose an event extraction method based on knowledge-based rule of event manifestation, to achieve automatic building and improve text processing performance of event network.

Download Full-text

Glyfn: A Glyph-Aware Fusion Network for Distributed Chinese Event Detection

10.5121/csit.2021.110114 ◽

2021 ◽

Author(s):

Qi Zhai ◽

Zhigang Kan ◽

Linhui Feng ◽

Linbo Qiao ◽

Feng Liu

Keyword(s):

Event Detection ◽

Large Scale ◽

State Of The Art ◽

Language Model ◽

Special Kind ◽

Detection Task ◽

Experimental Results ◽

Large Scale Data ◽

Unstructured Text ◽

Scale Data

Recently, Chinese event detection has attracted more and more attention. As a special kind of hieroglyphics, Chinese glyphs are semantically useful but still unexplored in this task. In this paper, we propose a novel Glyph-Aware Fusion Network, named GlyFN. It introduces the glyphs' information into the pre-trained language model representation. To obtain a better representation, we design a Vector Linear Fusion mechanism to fuse them. Specifically, it first utilizes a max-pooling to capture salient information. Then, we use the linear operation of vectors to retain unique information. Moreover, for large-scale unstructured text, we distribute the data into different clusters parallelly. Finally, we conduct extensive experiments on ACE2005 and large-scale data. Experimental results show that GlyFN obtains increases of 7.48(10.18%) and 6.17(8.7%) in the F1-score for trigger identification and classification over the state-of-the-art methods, respectively. Furthermore, the event detection task for large-scale unstructured text can be efficiently accomplished through distribution.

Download Full-text

DuEE: A Large-Scale Dataset for Chinese Event Extraction in Real-World Scenarios

Natural Language Processing and Chinese Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-60457-8_44 ◽

2020 ◽

pp. 534-545

Author(s):

Xinyu Li ◽

Fayuan Li ◽

Lu Pan ◽

Yuguang Chen ◽

Weihua Peng ◽

...

Keyword(s):

Real World ◽

Large Scale ◽

Event Extraction ◽

Large Scale Dataset

Download Full-text

Event Extraction from Unstructured Text Data

Lecture Notes in Computer Science - Database and Expert Systems Applications ◽

10.1007/978-3-319-22849-5_38 ◽

2015 ◽

pp. 543-557 ◽

Cited By ~ 2

Author(s):

Chao Shang ◽

Anand Panangadan ◽

Viktor K. Prasanna

Keyword(s):

Event Extraction ◽

Text Data ◽

Unstructured Text

Download Full-text

Enriching contextualized language model from knowledge graph for biomedical information extraction

Briefings in Bioinformatics ◽

10.1093/bib/bbaa110 ◽

2020 ◽

Author(s):

Hao Fei ◽

Yafeng Ren ◽

Yue Zhang ◽

Donghong Ji ◽

Xiaohui Liang

Keyword(s):

Information Extraction ◽

Large Scale ◽

Language Model ◽

Relation Extraction ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Training Procedure ◽

Biomedical Knowledge ◽

Biomedical Texts

Abstract Biomedical information extraction (BioIE) is an important task. The aim is to analyze biomedical texts and extract structured information such as named entities and semantic relations between them. In recent years, pre-trained language models have largely improved the performance of BioIE. However, they neglect to incorporate external structural knowledge, which can provide rich factual information to support the underlying understanding and reasoning for biomedical information extraction. In this paper, we first evaluate current extraction methods, including vanilla neural networks, general language models and pre-trained contextualized language models on biomedical information extraction tasks, including named entity recognition, relation extraction and event extraction. We then propose to enrich a contextualized language model by integrating a large scale of biomedical knowledge graphs (namely, BioKGLM). In order to effectively encode knowledge, we explore a three-stage training procedure and introduce different fusion strategies to facilitate knowledge injection. Experimental results on multiple tasks show that BioKGLM consistently outperforms state-of-the-art extraction models. A further analysis proves that BioKGLM can capture the underlying relations between biomedical knowledge concepts, which are crucial for BioIE.

Download Full-text

Optimization of hierarchical reinforcement learning relationship extraction model

Information Discovery and Delivery ◽

10.1108/idd-01-2020-0005 ◽

2020 ◽

Vol 48 (3) ◽

pp. 129-136

Author(s):

Qihang Wu ◽

Daifeng Li ◽

Lu Huang ◽

Biyun Ye

Keyword(s):

Reinforcement Learning ◽

Large Scale ◽

Relation Extraction ◽

Unified Framework ◽

Data Set ◽

Joint Learning ◽

Content Type ◽

Hierarchical Reinforcement Learning ◽

Learning Framework ◽

Unstructured Text

Purpose Entity relation extraction is an important research direction to obtain structured information. However, most of the current methods are to determine the relations between entities in a given sentence based on a stepwise method, seldom considering entities and relations into a unified framework. The joint learning method is an optimal solution that combines relations and entities. This paper aims to optimize hierarchical reinforcement learning framework and provide an efficient model to extract entity relation. Design/methodology/approach This paper is based on the hierarchical reinforcement learning framework of joint learning and combines the model with BERT, the best language representation model, to optimize the word embedding and encoding process. Besides, this paper adjusts some punctuation marks to make the data set more standardized, and introduces positional information to improve the performance of the model. Findings Experiments show that the model proposed in this paper outperforms the baseline model with a 13% improvement, and achieve 0.742 in F1 score in NYT10 data set. This model can effectively extract entities and relations in large-scale unstructured text and can be applied to the fields of multi-domain information retrieval, intelligent understanding and intelligent interaction. Originality/value The research provides an efficient solution for researchers in a different domain to make use of artificial intelligence (AI) technologies to process their unstructured text more accurately.

Download Full-text