TOWARDS AN EFFICIENT RDF DATASET SLICING

2013 ◽  
Vol 07 (04) ◽  
pp. 455-477 ◽  
Author(s):  
EDGARD MARX ◽  
TOMMASO SORU ◽  
SAEEDEH SHEKARPOUR ◽  
SÖREN AUER ◽  
AXEL-CYRILLE NGONGA NGOMO ◽  
...  

Over the last years, a considerable amount of structured data has been published on the Web as Linked Open Data (LOD). Despite recent advances, consuming and using Linked Open Data within an organization is still a substantial challenge. Many of the LOD datasets are quite large and despite progress in Resource Description Framework (RDF) data management their loading and querying within a triple store is extremely time-consuming and resource-demanding. To overcome this consumption obstacle, we propose a process inspired by the classical Extract-Transform-Load (ETL) paradigm. In this article, we focus particularly on the selection and extraction steps of this process. We devise a fragment of SPARQL Protocol and RDF Query Language (SPARQL) dubbed SliceSPARQL, which enables the selection of well-defined slices of datasets fulfilling typical information needs. SliceSPARQL supports graph patterns for which each connected subgraph pattern involves a maximum of one variable or Internationalized resource identifier (IRI) in its join conditions. This restriction guarantees the efficient processing of the query against a sequential dataset dump stream. Furthermore, we evaluate our slicing approach on three different optimization strategies. Results show that dataset slices can be generated an order of magnitude faster than by using the conventional approach of loading the whole dataset into a triple store.

Author(s):  
Olga A. Lavrenova ◽  
Andrey A. Vinberg

The goal of any library is to ensure high quality and general availability of information retrieval tools. The paper describes the project implemented by the Russian State Library (RSL) to present Library Bibliographic Classification as a Networked Knowledge Organization System. The project goal is to support content and provide tools for ensuring system’s interoperability with other resources of the same nature (i.e. with Linked Data Vocabularies) in the global network environment. The project was partially supported by the Russian Foundation for Basic Research (RFBR).The RSL General Classified Catalogue (GCC) was selected as the main data source for the Classification system of knowledge organization. The meaning of each classification number is expressed by complete string of wordings (captions), rather than the last level caption alone. Data converted to the Resource Description Framework (RDF) files based on the standard set of properties defined in the Simple Knowledge Organization System (SKOS) model was loaded into the semantic storage for subsequent data processing using the SPARQL query language. In order to enrich user queries for search of resources, the RSL has published its Classification System in the form of Linked Open Data (https://lod.rsl.ru) for searching in the RSL electronic catalogue. Currently, the work is underway to enable its smooth integration with other LOD vocabularies. The SKOS mapping tags are used to differentiate the types of connections between SKOS elements (concepts) existing in different concept schemes, for example, UDC, MeSH, authority data.The conceptual schemes of the leading classifications are fundamentally different from each other. Establishing correspondence between concepts is possible only on the basis of lexical and structural analysis to compute the concept similarity as a combination of attributes.The authors are looking forward to working with libraries in Russia and other countries to create a common space of Linked Open Data vocabularies.


2021 ◽  
Vol 11 (5) ◽  
pp. 2405
Author(s):  
Yuxiang Sun ◽  
Tianyi Zhao ◽  
Seulgi Yoon ◽  
Yongju Lee

Semantic Web has recently gained traction with the use of Linked Open Data (LOD) on the Web. Although numerous state-of-the-art methodologies, standards, and technologies are applicable to the LOD cloud, many issues persist. Because the LOD cloud is based on graph-based resource description framework (RDF) triples and the SPARQL query language, we cannot directly adopt traditional techniques employed for database management systems or distributed computing systems. This paper addresses how the LOD cloud can be efficiently organized, retrieved, and evaluated. We propose a novel hybrid approach that combines the index and live exploration approaches for improved LOD join query performance. Using a two-step index structure combining a disk-based 3D R*-tree with the extended multidimensional histogram and flash memory-based k-d trees, we can efficiently discover interlinked data distributed across multiple resources. Because this method rapidly prunes numerous false hits, the performance of join query processing is remarkably improved. We also propose a hot-cold segment identification algorithm to identify regions of high interest. The proposed method is compared with existing popular methods on real RDF datasets. Results indicate that our method outperforms the existing methods because it can quickly obtain target results by reducing unnecessary data scanning and reduce the amount of main memory required to load filtering results.


2017 ◽  
Vol 35 (1) ◽  
pp. 159-178
Author(s):  
Timothy W. Cole ◽  
Myung-Ja K. Han ◽  
Maria Janina Sarol ◽  
Monika Biel ◽  
David Maus

Purpose Early Modern emblem books are primary sources for scholars studying the European Renaissance. Linked Open Data (LOD) is an approach for organizing and modeling information in a data-centric manner compatible with the emerging Semantic Web. The purpose of this paper is to examine ways in which LOD methods can be applied to facilitate emblem resource discovery, better reveal the structure and connectedness of digitized emblem resources, and enhance scholar interactions with digitized emblem resources. Design/methodology/approach This research encompasses an analysis of the existing XML-based Spine (emblem-specific) metadata schema; the design of a new, domain-specific, Resource Description Framework compatible ontology; the mapping and transformation of metadata from Spine to both the new ontology and (separately) to the pre-existing Schema.org ontology; and the (experimental) modification of the Emblematica Online portal as a proof of concept to illustrate enhancements supported by LOD. Findings LOD is viable as an approach for facilitating discovery and enhancing the value to scholars of digitized emblem books; however, metadata must first be enriched with additional uniform resource identifiers and the workflow upgrades required to normalize and transform existing emblem metadata are substantial and still to be fully worked out. Practical implications The research described demonstrates the feasibility of transforming existing, special collections metadata to LOD. Although considerable work and further study will be required, preliminary findings suggest potential benefits of LOD for both users and libraries. Originality/value This research is unique in the context of emblem studies and adds to the emerging body of work examining the application of LOD best practices to library special collections.


2020 ◽  
pp. 016555152093095
Author(s):  
Gustavo Candela ◽  
Pilar Escobar ◽  
Rafael C Carrasco ◽  
Manuel Marco-Such

Cultural heritage institutions have recently started to share their metadata as Linked Open Data (LOD) in order to disseminate and enrich them. The publication of large bibliographic data sets as LOD is a challenge that requires the design and implementation of custom methods for the transformation, management, querying and enrichment of the data. In this report, the methodology defined by previous research for the evaluation of the quality of LOD is analysed and adapted to the specific case of Resource Description Framework (RDF) triples containing standard bibliographic information. The specified quality measures are reported in the case of four highly relevant libraries.


Author(s):  
Mariana Baptista Brandt ◽  
Silvana Aparecida Borsetti Gregorio Vidotti ◽  
José Eduardo Santarem Segundo

A presente pesquisa objetiva propor um modelo de dados abertos conectados (linked open data - LOD), para um conjunto de dados abertos legislativos da Câmara dos Deputados. Para tanto, procede-se à revisão de literatura sobre os conceitos de dados abertos, dados abertos governamentais, dados conectados (linked data), e dados abertos conectados (linked open data), seguido de pesquisa aplicada, com a modelagem de dados legislativos no modelo LOD. Para esta pesquisa foi selecionado o conjunto de dados "Deputados", que contém informações como partido político, unidade federativa, e-mail, legislatura, entre outras, sobre os parlamentares. Desse modo, observa-se que a estruturação do conjunto de dados em RDF (Resource Description Framework) é possível com reuso de vocabulários e padrões já estabelecidos na Web Semântica como Dublin Core, Friend of a Friend (FOAF), RDF e RDF Schema, além de vocabulários de áreas correlatas, como a Ontologia da Câmara dos Deputados italiana e a da Assembleia Nacional Francesa. Conforme recomendação do padrão Linked Data, os recursos foram relacionados também a outros conjuntos de LOD para enriquecimento semântico, como as bases Geonames e DBpedia. O estudo que permite concluir que a disponibilização dos dados governamentais, em especial, dados legislativos, pode ser feita seguindo as recomendações da W3C (World Wide Web Consortium) e, assim, integrar os dados legislativos à Web de Dados e ampliar as possibilidades de reuso e aplicações dos dados em ações de transparência e fiscalização, aproximando os cidadãos do Congresso e de seus representantes.


Author(s):  
Reto Gmür ◽  
Donat Agosti

Taxonomic treatments, sections of publications documenting the features or distribution of a related group of organisms (called a “taxon”, plural “taxa”) in ways adhering to highly formalized conventions, and published in scientific journals, shape our understanding of global biodiversity (Catapano 2019). Treatments are the building blocks of the evolving scientific consensus on taxonomic entities. The semantics of these treatments and their relationships are highly structured: taxa are introduced, merged, made obsolete, split, renamed, associated with specimens and so on. Plazi makes this content available in machine-readable form using Resource Description Framework (RDF) . RDF is the standard model for Linked Data and the Semantic Web. RDF can be exchanged in different formats (aka concrete syntaxes) such as RDF/XML or Turtle. The data model describes graph structures and relies on Internationalized Resource Identifiers (IRIs) , ontologies such as Darwin Core basic vocabulary are used to assign meaning to the identifiers. For Synospecies, we unite all treatments into one large knowledge graph, modelling taxonomic knowledge and its evolution with complete references to quotable treatments. However, this knowledge graph expresses much more than any individual treatment could convey because every referenced entity is linked to every other relevant treatment. On synospecies.plazi.org, we provide a user-friendly interface to find the names and treatments related to a taxon. An advanced mode allows execution of queries using the SPARQL query language.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 1677
Author(s):  
Toshiaki Katayama ◽  
Shuichi Kawashima ◽  
Gos Micklem ◽  
Shin Kawano ◽  
Jin-Dong Kim ◽  
...  

Publishing databases in the Resource Description Framework (RDF) model is becoming widely accepted to maximize the syntactic and semantic interoperability of open data in life sciences. Here we report advancements made in the 6th and 7th annual BioHackathons which were held in Tokyo and Miyagi respectively. This review consists of two major sections covering: 1) improvement and utilization of RDF data in various domains of the life sciences and 2) meta-data about these RDF data, the resources that store them, and the service quality of SPARQL Protocol and RDF Query Language (SPARQL) endpoints. The first section describes how we developed RDF data, ontologies and tools in genomics, proteomics, metabolomics, glycomics and by literature text mining. The second section describes how we defined descriptions of datasets, the provenance of data, and quality assessment of services and service discovery. By enhancing the harmonization of these two layers of machine-readable data and knowledge, we improve the way community wide resources are developed and published.  Moreover, we outline best practices for the future, and prepare ourselves for an exciting and unanticipatable variety of real world applications in coming years.


2015 ◽  
Author(s):  
Matthew Lincoln

This lesson explains why many cultural institutions are adopting graph databases, and how researchers can access these data though the query language called SPARQL.


2017 ◽  
Author(s):  
Jonathan Blaney

Introduces core concepts of Linked Open Data, including URIs, ontologies, RDF formats, and a gentle intro to the graph query language SPARQL.


2019 ◽  
Vol 8 (8) ◽  
pp. 353 ◽  
Author(s):  
Alejandro Vaisman ◽  
Kevin Chentout

This paper describes how a platform for publishing and querying linked open data for the Brussels Capital region in Belgium is built. Data are provided as relational tables or XML documents and are mapped into the RDF data model using R2RML, a standard language that allows defining customized mappings from relational databases to RDF datasets. In this work, data are spatiotemporal in nature; therefore, R2RML must be adapted to allow producing spatiotemporal Linked Open Data.Data generated in this way are used to populate a SPARQL endpoint, where queries are submitted and the result can be displayed on a map. This endpoint is implemented using Strabon, a spatiotemporal RDF triple store built by extending the RDF store Sesame. The first part of the paper describes how R2RML is adapted to allow producing spatial RDF data and to support XML data sources. These techniques are then used to map data about cultural events and public transport in Brussels into RDF. Spatial data are stored in the form of stRDF triples, the format required by Strabon. In addition, the endpoint is enriched with external data obtained from the Linked Open Data Cloud, from sites like DBpedia, Geonames, and LinkedGeoData, to provide context for analysis. The second part of the paper shows, through a comprehensive set of the spatial extension to SPARQL (stSPARQL) queries, how the endpoint can be exploited.


Sign in / Sign up

Export Citation Format

Share Document