An assertion and alignment correction framework for large scale knowledge bases

Various knowledge bases (KBs) have been constructed via information extraction from encyclopedias, text and tables, as well as alignment of multiple sources. Their usefulness and usability is often limited by quality issues. One common issue is the presence of erroneous assertions and alignments, often caused by lexical or semantic confusion. We study the problem of correcting such assertions and alignments, and present a general correction framework which combines lexical matching, context-aware sub-KB extraction, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated with one set of literal assertions from DBpedia, one set of entity assertions from an enterprise medical KB, and one set of mapping assertions from a music KB constructed by integrating Wikidata, Discogs and MusicBrainz. It has achieved promising results, with a correction rate (i.e., the ratio of the target assertions/alignments that are corrected with right substitutes) of 70.1 %, 60.9 % and 71.8 %, respectively.

Download Full-text

Progress and Challenges on Entity Alignment of Geographic Knowledge Bases

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8020077 ◽

2019 ◽

Vol 8 (2) ◽

pp. 77 ◽

Cited By ~ 9

Author(s):

Kai Sun ◽

Yunqiang Zhu ◽

Jia Song

Keyword(s):

Large Scale ◽

Knowledge Bases ◽

Evaluation Procedure ◽

Multiple Sources ◽

Alignment Algorithms ◽

Geographic Knowledge ◽

Heterogeneous Features ◽

Benchmark Datasets ◽

Alignment Process ◽

Made In

Geographic knowledge bases (GKBs) with multiple sources and forms are of obvious heterogeneity, which hinders the integration of geographic knowledge. Entity alignment provides an effective way to find correspondences of entities by measuring the multidimensional similarity between entities from different GKBs, thereby overcoming the semantic gap. Thus, many efforts have been made in this field. This paper initially proposes basic definitions and a general framework for the entity alignment of GKBs. Specifically, the state-of-the-art of algorithms of entity alignment of GKBs is reviewed from the three aspects of similarity metrics, similarity combination, and alignment judgement; the evaluation procedure of alignment results is also summarized. On this basis, eight challenges for future studies are identified. There is a lack of methods to assess the qualities of GKBs. The alignment process should be improved by determining the best composition of heterogeneous features, optimizing alignment algorithms, and incorporating background knowledge. Furthermore, a unified infrastructure, techniques for aligning large-scale GKBs, and deep learning-based alignment techniques should be developed. Meanwhile, the generation of benchmark datasets for the entity alignment of GKBs and the applications of this field need to be investigated. The progress of this field will be accelerated by addressing these challenges.

Download Full-text

Mining user queries with information extraction methods and linked data

Journal of Documentation ◽

10.1108/jd-09-2017-0133 ◽

2018 ◽

Vol 74 (5) ◽

pp. 936-950

Author(s):

Anne Chardonnens ◽

Ettore Rizza ◽

Mathias Coeckelbergs ◽

Seth van Hooland

Keyword(s):

Information Extraction ◽

Large Scale ◽

Extraction Methods ◽

Knowledge Bases ◽

Entity Recognition ◽

Web Analytics ◽

Place Names ◽

Data Set ◽

Content Type ◽

User Queries

Purpose Advanced usage of web analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is problematic. The purpose of this paper is to address the problem of named entity recognition in digital library user queries. Design/methodology/approach The paper presents a large-scale case study conducted at the Royal Library of Belgium in its online historical newspapers platform BelgicaPress. The object of the study is a data set of 83,854 queries resulting from 29,812 visits over a 12-month period. By making use of information extraction methods, knowledge bases (KBs) and various authority files, this paper presents the possibilities and limits to identify what percentage of end users are looking for person and place names. Findings Based on a quantitative assessment, the method can successfully identify the majority of person and place names from user queries. Due to the specific character of user queries and the nature of the KBs used, a limited amount of queries remained too ambiguous to be treated in an automated manner. Originality/value This paper demonstrates in an empirical manner how user queries can be extracted from a web analytics tool and how named entities can then be mapped with KBs and authority files, in order to facilitate automated analysis of their content. Methods and tools used are generalisable and can be reused by other collection holders.

Download Full-text

A bootstrapping approach for robust topic analysis

Natural Language Engineering ◽

10.1017/s1351324902002929 ◽

2002 ◽

Vol 8 (2-3) ◽

pp. 209-233 ◽

Cited By ~ 1

Author(s):

OLIVIER FERRET ◽

BRIGITTE GRAU

Keyword(s):

Information Extraction ◽

Large Scale ◽

Text Summarization ◽

Great Precision ◽

Topic Analysis ◽

Structured Knowledge

Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.

Download Full-text

Accessible Routes Integrating Data from Multiple Sources

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10010007 ◽

2020 ◽

Vol 10 (1) ◽

pp. 7

Author(s):

Miguel R. Luaces ◽

Jesús A. Fisteus ◽

Luis Sánchez-Fernández ◽

Mario Munoz-Organero ◽

Jesús Balado ◽

...

Keyword(s):

Information System ◽

Data Model ◽

Large Scale ◽

Heterogeneous Data ◽

Multiple Sources ◽

Heterogeneous Data Sources ◽

Different Types ◽

Software Sensors ◽

The City

Providing citizens with the ability to move around in an accessible way is a requirement for all cities today. However, modeling city infrastructures so that accessible routes can be computed is a challenge because it involves collecting information from multiple, large-scale and heterogeneous data sources. In this paper, we propose and validate the architecture of an information system that creates an accessibility data model for cities by ingesting data from different types of sources and provides an application that can be used by people with different abilities to compute accessible routes. The article describes the processes that allow building a network of pedestrian infrastructures from the OpenStreetMap information (i.e., sidewalks and pedestrian crossings), improving the network with information extracted obtained from mobile-sensed LiDAR data (i.e., ramps, steps, and pedestrian crossings), detecting obstacles using volunteered information collected from the hardware sensors of the mobile devices of the citizens (i.e., ramps and steps), and detecting accessibility problems with software sensors in social networks (i.e., Twitter). The information system is validated through its application in a case study in the city of Vigo (Spain).

Download Full-text

PEDRERA. Positive Energy District Renovation Model for Large Scale Actions

Energies ◽

10.3390/en14102833 ◽

2021 ◽

Vol 14 (10) ◽

pp. 2833

Author(s):

Paolo Civiero ◽

Jordi Pascual ◽

Joaquim Arcas Abella ◽

Ander Bilbao Figuero ◽

Jaume Salom

Keyword(s):

Simulation Model ◽

Performance Indicators ◽

Large Scale ◽

Key Performance Indicators ◽

Positive Energy ◽

Design Phase ◽

Multiple Sources ◽

Reliable Prediction ◽

Sensitive Analysis ◽

Web Platform

In this paper, we provide a view of the ongoing PEDRERA project, whose main scope is to design a district simulation model able to set and analyze a reliable prediction of potential business scenarios on large scale retrofitting actions, and to evaluate the overall co-benefits resulting from the renovation process of a cluster of buildings. According to this purpose and to a Positive Energy Districts (PEDs) approach, the model combines systemized data—at both building and district scale—from multiple sources and domains. A sensitive analysis of 200 scenarios provided a quick perception on how results will change once inputs are defined, and how attended results will answer to stakeholders’ requirements. In order to enable a clever input analysis and to appraise wide-ranging ranks of Key Performance Indicators (KPIs) suited to each stakeholder and design phase targets, the model is currently under the implementation in the urbanZEB tool’s web platform.

Download Full-text

Developing a vocabulary and ontology for modeling insect natural history data: example data, use cases, and competency questions

Biodiversity Data Journal ◽

10.3897/bdj.7.e33303 ◽

2019 ◽

Vol 7 ◽

Author(s):

Brian Stucky ◽

James Balhoff ◽

Narayani Barve ◽

Vijay Barve ◽

Laura Brenskelle ◽

...

Keyword(s):

Natural History ◽

Large Scale ◽

Use Cases ◽

Data Systems ◽

Multiple Sources ◽

History Data ◽

Insect Ecology ◽

Many Sources ◽

Multicellular Organisms ◽

Initial Results

Insects are possibly the most taxonomically and ecologically diverse class of multicellular organisms on Earth. Consequently, they provide nearly unlimited opportunities to develop and test ecological and evolutionary hypotheses. Currently, however, large-scale studies of insect ecology, behavior, and trait evolution are impeded by the difficulty in obtaining and analyzing data derived from natural history observations of insects. These data are typically highly heterogeneous and widely scattered among many sources, which makes developing robust information systems to aggregate and disseminate them a significant challenge. As a step towards this goal, we report initial results of a new effort to develop a standardized vocabulary and ontology for insect natural history data. In particular, we describe a new database of representative insect natural history data derived from multiple sources (but focused on data from specimens in biological collections), an analysis of the abstract conceptual areas required for a comprehensive ontology of insect natural history data, and a database of use cases and competency questions to guide the development of data systems for insect natural history data. We also discuss data modeling and technology-related challenges that must be overcome to implement robust integration of insect natural history data.

Download Full-text

Building Context-Aware Customized Stories Based on Uncovering Indirect Associations from Semantic Knowledge Bases

2016 IEEE Tenth International Conference on Semantic Computing (ICSC) ◽

10.1109/icsc.2016.27 ◽

2016 ◽

Author(s):

Omar G. Bravo-Quezada ◽

Yolanda Blanco-Fernandez ◽

Martin Lopez-Nores ◽

Diego A. Pesantez Nauta

Keyword(s):

Knowledge Bases ◽

Semantic Knowledge ◽

Context Aware

Download Full-text

Context-aware sequence labeling for condition information extraction from historical bridge inspection reports

Advanced Engineering Informatics ◽

10.1016/j.aei.2021.101333 ◽

2021 ◽

Vol 49 ◽

pp. 101333

Author(s):

Tianshu Li ◽

Mohamad Alipour ◽

Devin K. Harris

Keyword(s):

Information Extraction ◽

Context Aware ◽

Bridge Inspection ◽

Sequence Labeling

Download Full-text

Digital Mega-Studies as a New Research Paradigm: Governing the Health Research of the Future

Journal of Empirical Research on Human Research Ethics ◽

10.1177/15562646211041492 ◽

2021 ◽

pp. 155626462110414

Author(s):

Jessica Bell ◽

Megan Prictor ◽

Lauren Davenport ◽

Lynda O’Brien ◽

Melissa Wake

Keyword(s):

Large Scale ◽

Group Processes ◽

Research Paradigm ◽

Multiple Sources ◽

Individual Level ◽

Key Characteristics ◽

Governance Challenges ◽

Technological Developments ◽

Multi Stakeholder ◽

New Research

‘Digital Mega-Studies’ are entirely or extensively digitised, longitudinal, population-scale initiatives, collecting, storing, and making available individual-level research data of different types and from multiple sources, shaped by technological developments and unforeseeable risks over time. The Australian ‘Gen V’ project exemplifies this new research paradigm. In 2019, we undertook a multidisciplinary, multi-stakeholder process to map Digital Mega-Studies’ key characteristics, legal and governance challenges and likely solutions. We conducted large and small group processes within a one-day symposium and directed online synthesis and group prioritisation over subsequent weeks. We present our methods (including elicitation, affinity mapping and prioritisation processes) and findings, proposing six priority governance principles across three areas—data, participation, trust—to support future high-quality, large-scale digital research in health.

Download Full-text

Context-aware Adaptive Surgery

Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies ◽

10.1145/3478073 ◽

2021 ◽

Vol 5 (3) ◽

pp. 1-22

Author(s):

Hongli Wang ◽

Bin Guo ◽

Jiaqi Liu ◽

Sicong Liu ◽

Yungang Wu ◽

...

Keyword(s):

Real Time ◽

Large Scale ◽

Resource Constraints ◽

Search Algorithm ◽

Search Time ◽

State Graph ◽

Context Aware ◽

Optimal Partition ◽

Research Attention ◽

Neighbor Effect

Deep Neural Networks (DNNs) have made massive progress in many fields and deploying DNNs on end devices has become an emerging trend to make intelligence closer to users. However, it is challenging to deploy large-scale and computation-intensive DNNs on resource-constrained end devices due to their small size and lightweight. To this end, model partition, which aims to partition DNNs into multiple parts to realize the collaborative computing of multiple devices, has received extensive research attention. To find the optimal partition, most existing approaches need to run from scratch under given resource constraints. However, they ignore that resources of devices (e.g., storage, battery power), and performance requirements (e.g., inference latency), are often continuously changing, making the optimal partition solution change constantly during processing. Therefore, it is very important to reduce the tuning latency of model partition to realize the real-time adaption under the changing processing context. To address these problems, we propose the Context-aware Adaptive Surgery (CAS) framework to actively perceive the changing processing context, and adaptively find the appropriate partition solution in real-time. Specifically, we construct the partition state graph to comprehensively model different partition solutions of DNNs by import context resources. Then "the neighbor effect" is proposed, which provides the heuristic rule for the search process. When the processing context changes, CAS adopts the runtime search algorithm, Graph-based Adaptive DNN Surgery (GADS), to quickly find the appropriate partition that satisfies resource constraints under the guidance of the neighbor effect. The experimental results show that CAS realizes adaptively rapid tuning of the model partition solutions in 10ms scale even for large DNNs (2.25x to 221.7x search time improvement than the state-of-the-art researches), and the total inference latency still keeps the same level with baselines.

Download Full-text