Extracting relational facts based on hybrid Syntax-Guided transformer and pointer network

Joint extraction of entities and relations from unstructured text is an essential step in constructing a knowledge base. However, relational facts in these texts are often complicated, where most of them contain overlapping triplets, making the joint extraction task still challenging. This paper proposes a novel Sequence-to-Sequence (Seq2Seq) framework to handle the overlapping issue, which models the triplet extraction as a sequence generation task. Specifically, a unique cascade structure is proposed to connect transformer and pointer network to extract entities and relations jointly. By this means, sequences can be generated in triplet-level and it speeds up the decoding process. Besides, a syntax-guided encoder is applied to integrate the sentence’s syntax structure into the transformer encoder explicitly, which helps the encoder pay more accurate attention to the syntax-related words. Extensive experiments were conducted on three public datasets, named NYT24, NYT29, and WebNLG, and the results show the validity of this model by comparing with various baselines. In addition, a pre-trained BERT model is also employed as the encoder. Then it comes up to excellent performance that the F1 scores on the three datasets surpass the strongest baseline by 5.7%, 5.6%, and 4.4%.

Download Full-text

Data mining for building knowledge bases: techniques, architectures and applications

The Knowledge Engineering Review ◽

10.1017/s0269888916000047 ◽

2016 ◽

Vol 31 (2) ◽

pp. 97-123 ◽

Cited By ~ 4

Author(s):

Alfred Krzywicki ◽

Wayne Wobcke ◽

Michael Bain ◽

John Calvo Martinez ◽

Paul Compton

Keyword(s):

Data Mining ◽

Knowledge Base ◽

Question Answering ◽

Knowledge Bases ◽

Event Extraction ◽

Data Sources ◽

Small Scale ◽

Knowledge Mining ◽

Practical Applications ◽

Unstructured Text

AbstractData mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we callknowledge mining: integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.

Download Full-text

Twenty-five years of information extraction

Natural Language Engineering ◽

10.1017/s1351324919000512 ◽

2019 ◽

Vol 25 (06) ◽

pp. 677-692

Author(s):

Ralph Grishman

Keyword(s):

Neural Networks ◽

Information Extraction ◽

Data Base ◽

Information Content ◽

Structured Data ◽

Essential Step ◽

The Past ◽

Unstructured Text ◽

The Us ◽

Us Government

AbstractInformation extraction is the process of converting unstructured text into a structured data base containing selected information from the text. It is an essential step in making the information content of the text usable for further processing. In this paper, we describe how information extraction has changed over the past 25 years, moving from hand-coded rules to neural networks, with a few stops on the way. We connect these changes to research advances in NLP and to the evaluations organized by the US Government.

Download Full-text

On Completing Sparse Knowledge Base with Transitive Relation Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013125 ◽

2019 ◽

Vol 33 ◽

pp. 3125-3132

Author(s):

Zili Zhou ◽

Shaowu Liu ◽

Guandong Xu ◽

Wu Zhang

Keyword(s):

Computational Complexity ◽

Knowledge Base ◽

Transitive Relation ◽

Popular Approach ◽

New Model ◽

Sparsity Problem ◽

Public Datasets

Multi-relation embedding is a popular approach to knowledge base completion that learns embedding representations of entities and relations to compute the plausibility of missing triplet. The effectiveness of embedding approach depends on the sparsity of KB and falls for infrequent entities that only appeared a few times. This paper addresses this issue by proposing a new model exploiting the entity-independent transitive relation patterns, namely Transitive Relation Embedding (TRE). The TRE model alleviates the sparsity problem for predicting on infrequent entities while enjoys the generalisation power of embedding. Experiments on three public datasets against seven baselines showed the merits of TRE in terms of knowledge base completion accuracy as well as computational complexity.

Download Full-text

Recursive sequence generation in monkeys, children, U.S. adults, and native Amazonians

Science Advances ◽

10.1126/sciadv.aaz1002 ◽

2020 ◽

Vol 6 (26) ◽

pp. eaaz1002 ◽

Cited By ~ 3

Author(s):

Stephen Ferrigno ◽

Samuel J. Cheyette ◽

Steven T. Piantadosi ◽

Jessica F. Cantlon

Keyword(s):

Training Data ◽

Recursive Structure ◽

Generation Task ◽

Sequence Generation ◽

Human Thought ◽

Bayesian Mixture Model ◽

Indigenous Group ◽

Recursive Sequence ◽

Bayesian Mixture ◽

Recursive Structures

The question of what computational capacities, if any, differ between humans and nonhuman animals has been at the core of foundational debates in cognitive psychology, anthropology, linguistics, and animal behavior. The capacity to form nested hierarchical representations is hypothesized to be essential to uniquely human thought, but its origins in evolution, development, and culture are controversial. We used a nonlinguistic sequence generation task to test whether subjects generalize sequential groupings of items to a center-embedded, recursive structure. Children (3 to 5 years old), U.S. adults, and adults from a Bolivian indigenous group spontaneously induced recursive structures from ambiguous training data. In contrast, monkeys did so only with additional exposure. We quantify these patterns using a Bayesian mixture model over logically possible strategies. Our results show that recursive hierarchical strategies are robust in human thought, both early in development and across cultures, but the capacity itself is not unique to humans.

Download Full-text

Effect of Cognitive Load on a Random Sequence Generation Task

2019 IEEE Region 10 Symposium (TENSYMP) ◽

10.1109/tensymp46218.2019.8971081 ◽

2019 ◽

Author(s):

Rahul Gavas ◽

Debatri Chatterjee ◽

Aniruddha Sinha ◽

Sanjoy Kumar Saha

Keyword(s):

Cognitive Load ◽

Random Sequence ◽

Generation Task ◽

Sequence Generation ◽

Random Sequence Generation

Download Full-text

Analyzing the Mechanism and Effect of Acid Protease in Wet blue Bating Process for Leather Production

Journal of the American Leather Chemists Association ◽

10.34314/jalca.v115i1.1463 ◽

2020 ◽

Vol 115 (1) ◽

pp. 10-15

Author(s):

Hao Li ◽

Deyi Zhu ◽

Yanchun Li ◽

Shan Cao ◽

Jing Xiao

Keyword(s):

Molecular Weight ◽

Raw Materials ◽

Collagen Fibers ◽

Acid Protease ◽

Excellent Performance ◽

Functional Protein ◽

Essential Step ◽

Production Safety ◽

Chromium Tolerance ◽

Chromium Resistance

In recent years, in order to reduce the pollution produced in beam-house and tanning sections, more and more tanneries purchase wet blue from other factories in other regions directly used as raw materials for finished leather production thereby those polluted preliminary steps can be eliminated. Therefore, the wet blue bating process is an essential step to minimize the differences of wet blue which are purchased from different regions. In this study, the properties of different acid protease are analyzed for selecting suitable protease used for wet blue bating. The analysis of chromium tolerance of different acid proteases reveals that, L1 and L4 produced from Aspergillus have higher chromium resistance than that of produced from Bacillus. The effect of L1 and L4 on wet blue and collagen shows that the L1 has more excellent performance, in which the molecular weight of functional protein is 48 KD. By SEM and MCT analysis, L1 can successfully disperse the collagen fibers of wet blue. Furthermore, the biodegradation rates of collagen and elastin were 0.006‰ and 0.5‰, respectively. It indicates that the acid protease mainly degraded elastin but not collagen in bating process thereby ensuring production safety. This paper provides the importance references for the application and the basis for the development of mechanism of acid protease in bating process.

Download Full-text

A Statistical Approach for Knowledge Discovery: Bootstrapped Analysis of Language Models for Knowledge base Population from Unstructured Text

Scientia Iranica ◽

10.24200/sci.2018.20198 ◽

2018 ◽

Vol 0 (0) ◽

pp. 0-0

Author(s):

Saeedeh Momtazi ◽

Omid Moradiannasab

Keyword(s):

Knowledge Base ◽

Knowledge Discovery ◽

Statistical Approach ◽

Language Models ◽

Base Population ◽

Unstructured Text ◽

Knowledge Base Population

Download Full-text

AEMF: An Attention-Based Efficient and Multifeature Fast Text Detector

Complexity ◽

10.1155/2021/9958333 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Wanqi Ma ◽

Chaoyu Yang ◽

Jie Yang ◽

Jian Wu

Keyword(s):

Semantic Segmentation ◽

Detection Methods ◽

Excellent Performance ◽

Important Data ◽

Original Algorithm ◽

The Past ◽

Related Information ◽

Different Shapes ◽

Public Datasets ◽

F Measure

The label from industrial commodity packaging usually contains important data, such as production date, manufacturer, and other commodity-related information. As such, those labels are essential for consumers to purchase goods, help commodity supervision, and reveal potential product safety problems. Consequently, packaging label detection, as the prerequisite for product label identification, becomes a very useful application, which has achieved promising results in the past decades. Yet, in complex industrial scenarios, traditional detection methods are often unable to meet the requirements, which suffer from many problems of low accuracy and efficiency. In this paper, we propose a multifeature fast and attention-based algorithm using a combination of area suggestion and semantic segmentation. This algorithm is an attention-based efficient and multifeature fast text detector (termed AEMF). The proposed approach is formed by fusing segmentation branches and detection branches with each other. Based on the original algorithm that can only detect text in any direction, it is possible to detect different shapes with a better accuracy. Meanwhile, the algorithm also works better on long-text detection. The algorithm was evaluated using ICDAR2015, CTW1500, and MSRA-TD500 public datasets. The experimental results show that the proposed multifeature fusion with self-attention module makes the algorithm more accurate and efficient than existing algorithms. On the MSRA-TD500 dataset, the AEMF algorithm has an F-measure of 72.3% and a frame per second (FPS) of 8. On the CTW1500 dataset, the AEMF algorithm has an F-measure of 62.3% and an FPS of 23. In particular, the AEMF algorithm has achieved an F-measure of 79.3% and an FPS of 16 on the ICDAR2015 dataset, demonstrating the excellent performance in detecting label text on industrial packaging.

Download Full-text

Generative Language Modeling for Antibody Design

10.1101/2021.12.13.472419 ◽

2021 ◽

Author(s):

Richard W. Shuai ◽

Jeffrey A. Ruffolo ◽

Jeffrey J. Gray

Keyword(s):

Language Model ◽

Successful Development ◽

Generation Task ◽

Sequence Generation ◽

Generative Language ◽

Antibody Libraries ◽

Antibody Design ◽

Low Solubility ◽

High Immunogenicity ◽

Chain Type

Successful development of monoclonal antibodies (mAbs) for therapeutic applications is hindered by developability issues such as low solubility, low thermal stability, high aggregation, and high immunogenicity. The discovery of more developable mAb candidates relies on high-quality antibody libraries for isolating candidates with desirable properties. We present Immunoglobulin Language Model (IgLM), a deep generative language model for generating synthetic libraries by re-designing variable-length spans of antibody sequences. IgLM formulates antibody design as an autoregressive sequence generation task based on text-infilling in natural language. We trained IgLM on approximately 558M antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species-of-origin. We demonstrate that IgLM can be applied to generate synthetic libraries that may accelerate the discovery of therapeutic antibody candidates

Download Full-text

MEnet: A Metric Expression Network for Salient Object Segmentation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/83 ◽

2018 ◽

Cited By ~ 7

Author(s):

Shulian Cai ◽

Jiabin Huang ◽

Delu Zeng ◽

Xinghao Ding ◽

John Paisley

Keyword(s):

Metric Space ◽

Object Segmentation ◽

Experimental Results ◽

Salient Object ◽

Excellent Performance ◽

Deep Network ◽

Salient Region ◽

Latent Space ◽

Salient Object Segmentation ◽

Public Datasets

Recent CNN-based saliency models have achieved excellent performance on public datasets, but most are sensitive to distortions from noise or compression. In this paper, we propose an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) to overcome this drawback. We construct a topological metric space where the implicit metric is determined by a deep network. In this latent space, we can group pixels within an observed image semantically into two regions, based on whether they are in a salient region or a non-salient region in the image. We carry out all feature extractions at the pixel level, which makes the output boundaries of the salient object finely-grained. Experimental results show that the proposed metric can generate robust salient maps that allow for object segmentation. By testing the method on several public benchmarks, we show that the performance of MEnet achieves excellent results. We also demonstrate that the proposed method outperforms previous CNN-based methods on distorted images.

Download Full-text