Data mining for building knowledge bases: techniques, architectures and applications

Alfred Krzywicki; Wayne Wobcke; Michael Bain; John Calvo Martinez; Paul Compton

doi:10.1017/s0269888916000047

Data mining for building knowledge bases: techniques, architectures and applications

The Knowledge Engineering Review ◽

10.1017/s0269888916000047 ◽

2016 ◽

Vol 31 (2) ◽

pp. 97-123 ◽

Cited By ~ 4

Author(s):

Alfred Krzywicki ◽

Wayne Wobcke ◽

Michael Bain ◽

John Calvo Martinez ◽

Paul Compton

Keyword(s):

Data Mining ◽

Knowledge Base ◽

Question Answering ◽

Knowledge Bases ◽

Event Extraction ◽

Data Sources ◽

Small Scale ◽

Knowledge Mining ◽

Practical Applications ◽

Unstructured Text

AbstractData mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we callknowledge mining: integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.

Download Full-text

Improving the Quality of Linked Data Using Statistical Distributions

Information Retrieval and Management ◽

10.4018/978-1-5225-5191-1.ch074 ◽

2018 ◽

pp. 1638-1664 ◽

Cited By ~ 1

Author(s):

Heiko Paulheim ◽

Christian Bizer

Keyword(s):

Knowledge Base ◽

Linked Data ◽

Relational Databases ◽

Knowledge Bases ◽

Structured Data ◽

Data Sources ◽

Data Sets ◽

Statistical Distributions ◽

The Web

Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.

Download Full-text

Formal Query Building with Query Structure Prediction for Complex Question Answering over Knowledge Base

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/519 ◽

2020 ◽

Author(s):

Yongrui Chen ◽

Huiying Li ◽

Yuncheng Hua ◽

Guilin Qi

Keyword(s):

Knowledge Base ◽

Structure Prediction ◽

Question Answering ◽

State Transition ◽

Knowledge Bases ◽

Second Stage ◽

Transition Strategy ◽

Query Structure ◽

Two Stages ◽

Complex Question

Formal query building is an important part of complex question answering over knowledge bases. It aims to build correct executable queries for questions. Recent methods try to rank candidate queries generated by a state-transition strategy. However, this candidate generation strategy ignores the structure of queries, resulting in a considerable number of noisy queries. In this paper, we propose a new formal query building approach that consists of two stages. In the first stage, we predict the query structure of the question and leverage the structure to constrain the generation of the candidate queries. We propose a novel graph generation framework to handle the structure prediction task and design an encoder-decoder model to predict the argument of the predetermined operation in each generative step. In the second stage, we follow the previous methods to rank the candidate queries. The experimental results show that our formal query building approach outperforms existing methods on complex questions while staying competitive on simple questions.

Download Full-text

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5962 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5182-5190

Author(s):

Pasquale Minervini ◽

Matko Bošnjak ◽

Tim Rocktäschel ◽

Sebastian Riedel ◽

Edward Grefenstette

Keyword(s):

Natural Language ◽

Link Prediction ◽

Question Answering ◽

Knowledge Bases ◽

Small Scale ◽

Reasoning Systems ◽

Novel Approach ◽

Real World Datasets ◽

Interpretable Models ◽

Machine Reading

Reasoning with knowledge expressed in natural language and Knowledge Bases (KBs) is a major challenge for Artificial Intelligence, with applications in machine reading, dialogue, and question answering. General neural architectures that jointly learn representations and transformations of text are very data-inefficient, and it is hard to analyse their reasoning process. These issues are addressed by end-to-end differentiable reasoning systems such as Neural Theorem Provers (NTPs), although they can only be used with small-scale symbolic KBs. In this paper we first propose Greedy NTPs (GNTPs), an extension to NTPs addressing their complexity and scalability limitations, thus making them applicable to real-world datasets. This result is achieved by dynamically constructing the computation graph of NTPs and including only the most promising proof paths during inference, thus obtaining orders of magnitude more efficient models 1. Then, we propose a novel approach for jointly reasoning over KBs and textual mentions, by embedding logic facts and natural language sentences in a shared embedding space. We show that GNTPs perform on par with NTPs at a fraction of their cost while achieving competitive link prediction results on large datasets, providing explanations for predictions, and inducing interpretable models.

Download Full-text

Introducing External Knowledge to Answer Questions with Implicit Temporal Constraints over Knowledge Base

Future Internet ◽

10.3390/fi12030045 ◽

2020 ◽

Vol 12 (3) ◽

pp. 45

Author(s):

Wenqing Wu ◽

Zhenfang Zhu ◽

Qiang Lu ◽

Dianyuan Zhang ◽

Qiangqiang Guo

Keyword(s):

Natural Language ◽

Knowledge Base ◽

Question Answering ◽

Knowledge Bases ◽

Temporal Information ◽

Temporal Constraints ◽

External Knowledge ◽

Question Answering Systems ◽

Natural Language Question ◽

Applied Knowledge

Knowledge base question answering (KBQA) aims to analyze the semantics of natural language questions and return accurate answers from the knowledge base (KB). More and more studies have applied knowledge bases to question answering systems, and when using a KB to answer a natural language question, there are some words that imply the tense (e.g., original and previous) and play a limiting role in questions. However, most existing methods for KBQA cannot model a question with implicit temporal constraints. In this work, we propose a model based on a bidirectional attentive memory network, which obtains the temporal information in the question through attention mechanisms and external knowledge. Specifically, we encode the external knowledge as vectors, and use additive attention between the question and external knowledge to obtain the temporal information, then further enhance the question vector to increase the accuracy. On the WebQuestions benchmark, our method not only performs better with the overall data, but also has excellent performance regarding questions with implicit temporal constraints, which are separate from the overall data. As we use attention mechanisms, our method also offers better interpretability.

Download Full-text

SWFQA Semantic Web Based Framework for Question Answering

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2019010106 ◽

2019 ◽

Vol 9 (1) ◽

pp. 88-106

Author(s):

Irphan Ali ◽

Divakar Yadav ◽

Ashok Kumar Sharma

Keyword(s):

Semantic Web ◽

Natural Language ◽

Knowledge Base ◽

Language Processing ◽

Question Answering ◽

Knowledge Bases ◽

Digital Information ◽

Web Based ◽

Question Answering System ◽

User Query

A question answering system aims to provide the correct and quick answer to users' query from a knowledge base. Due to the growth of digital information on the web, information retrieval system is the need of the day. Most recent question answering systems consult knowledge bases to answer a question, after parsing and transforming natural language queries to knowledge base-executable forms. In this article, the authors propose a semantic web-based approach for question answering system that uses natural language processing for analysis and understanding the user query. It employs a “Total Answer Relevance Score” to find the relevance of each answer returned by the system. The results obtained thereof are quite promising. The real-time performance of the system has been evaluated on the answers, extracted from the knowledge base.

Download Full-text

CN-DBpedia2: An Extraction and Verification Framework for Enriching Chinese Encyclopedia Knowledge Base

Data Intelligence ◽

10.1162/dint_a_00017 ◽

2019 ◽

Vol 1 (3) ◽

pp. 271-288 ◽

Cited By ~ 2

Author(s):

Bo Xu ◽

Jiaqing Liang ◽

Chenhao Xie ◽

Bin Liang ◽

Lihan Chen ◽

...

Keyword(s):

Knowledge Base ◽

Search Engine ◽

Recommendation System ◽

Question Answering ◽

Knowledge Bases ◽

Crowd Sourcing ◽

High Confidence

Knowledge base plays an important role in machine understanding and has been widely used in various applications, such as search engine, recommendation system and question answering. However, most knowledge bases are incomplete, which can cause many downstream applications to perform poorly because they cannot find the corresponding facts in the knowledge bases. In this paper, we propose an extraction and verification framework to enrich the knowledge bases. Specifically, based on the existing knowledge base, we first extract new facts from the description texts of entities. But not all newly-formed facts can be added directly to the knowledge base because the errors might be involved by the extraction. Then we propose a novel crowd-sourcing based verification step to verify the candidate facts. Finally, we apply this framework to the existing knowledge base CN-DBpedia and construct a new version of knowledge base CN-DBpedia2, which additionally contains the high confidence facts extracted from the description texts of entities.

Download Full-text

Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/509 ◽

2020 ◽

Author(s):

Yuncheng Hua ◽

Yuan-Fang Li ◽

Gholamreza Haffari ◽

Guilin Qi ◽

Wei Wu

Keyword(s):

Knowledge Base ◽

Large Scale ◽

Question Answering ◽

Knowledge Bases ◽

Retrieval Model ◽

Test Question ◽

Weak Supervision ◽

Meta Learning ◽

Complex Knowledge ◽

Complex Question

A compelling approach to complex question answering is to convert the question to a sequence of actions, which can then be executed on the knowledge base to yield the answer, aka the programmer-interpreter approach. Use similar training questions to the test question, meta-learning enables the programmer to adapt to unseen questions to tackle potential distributional biases quickly. However, this comes at the cost of manually labeling similar questions to learn a retrieval model, which is tedious and expensive. In this paper, we present a novel method that automatically learns a retrieval model alternately with the programmer from weak supervision, i.e., the system’s performance with respect to the produced answers. To the best of our knowledge, this is the first attempt to train the retrieval model with the programmer jointly. Our system leads to state-of-the-art performance on a large-scale task for complex question answering over knowledge bases. We have released our code at https://github.com/DevinJake/MARL.

Download Full-text

Improving the Quality of Linked Data Using Statistical Distributions

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2014040104 ◽

2014 ◽

Vol 10 (2) ◽

pp. 63-86 ◽

Cited By ~ 74

Author(s):

Heiko Paulheim ◽

Christian Bizer

Keyword(s):

Knowledge Base ◽

Linked Data ◽

Relational Databases ◽

Knowledge Bases ◽

Structured Data ◽

Data Sources ◽

Data Sets ◽

Statistical Distributions ◽

The Web

Download Full-text

The role of semantics in mining frequent patterns from knowledge bases in description logics with rules

Theory and Practice of Logic Programming ◽

10.1017/s1471068410000098 ◽

2010 ◽

Vol 10 (3) ◽

pp. 251-289 ◽

Cited By ~ 21

Author(s):

JOANNA JÓZEFOWSKA ◽

AGNIESZKA ŁAWRYNOWICZ ◽

TOMASZ ŁUKASZEWSKI

Keyword(s):

Data Mining ◽

Semantic Web ◽

Knowledge Base ◽

Description Logics ◽

Pattern Discovery ◽

Knowledge Bases ◽

Frequent Pattern ◽

Frequent Patterns ◽

Representation Formalism

AbstractWe propose a new method for mining frequent patterns in a language that combines both Semantic Web ontologies and rules. In particular, we consider the setting of using a language that combines description logics (DLs) with DL-safe rules. This setting is important for the practical application of data mining to the Semantic Web. We focus on the relation of the semantics of the representation formalism to the task of frequent pattern discovery, and for the core of our method, we propose an algorithm that exploits the semantics of the combined knowledge base. We have developed a proof-of-concept data mining implementation of this. Using this we have empirically shown that using the combined knowledge base to perform semantic tests can make data mining faster by pruning useless candidate patterns before their evaluation. We have also shown that the quality of the set of patterns produced may be improved: the patterns are more compact, and there are fewer patterns. We conclude that exploiting the semantics of a chosen representation formalism is key to the design and application of (onto-)relational frequent pattern discovery methods.

Download Full-text

A Hybrid Question Answering System

Current Journal of Applied Science and Technology ◽

10.9734/cjast/2019/v34i330129 ◽

2019 ◽

pp. 1-7

Author(s):

Waheeb Ahmed ◽

P. Babu Anto

Keyword(s):

Knowledge Base ◽

Web Search ◽

Question Answering ◽

Arabic Language ◽

Knowledge Bases ◽

External Resources ◽

Online Module ◽

Types Of Information ◽

Available Information ◽

F Measure

In this study, we propose a hybrid Question Answering (QA) system for Arabic language. The system combines textual and structured knowledge-Base (KB) data for question answering. It make use of other relevant text data, outside the KB, which could enrich the available information. The system consists of four modules. 1) a KB, 2) an online module, and 3) A Text- to-KB transformer to construct our own knowledge base from web texts. Using these modules, we can query two types of information sources: knowledge bases, and web text. Text-to-KB uses web search results to identify question topic entities, map question words to KB predicates, and enhance the features of the candidates obtained from the KB. The system scored f-measure of .495 when using KB. The system performed better with f-measure of .573 when using both KB and Text-to-KB module. The system demonstrates higher performance by combining knowledge base and text from external resources.

Download Full-text