scholarly journals How do you Develop a Data Standard? Wikibase might be the Solution…

Author(s):  
Maarten Trekels ◽  
Matt Woodburn ◽  
Deborah L Paul ◽  
Sharon Grant ◽  
Kate Webbink ◽  
...  

Data standards allow us to aggregate, compare, compute and communicate data from a wide variety of origins. However, for historical reasons, data are most likely to be stored in many different formats and conform to different models. Every data set might contain a huge amount of information, but it becomes tremendously difficult to compare them without a common way to represent the data. That is when standards development jumps in. Developing a standard is a formidable process, often involving many stakeholders. Typically the initial blueprint of a standard is created by a limited number of people who have a clear view of their use cases. However, as development continues, additional stakeholders participate in the process. As a result, conflicting opinions and interests will influence the development of the standard. Compromises need to be made and the standard might look very different from the initial concept. In order to address the needs of the community, a high level of engagement in the development process is encouraged. However, this does not necessarily increase the usability of the standard. To mitigate this, there is a need to test the standard during the early stages of development. In order to facilitate this, we explored the use of Wikibase to create an initial implementation of the standard. Wikibase is the underlying technology that drives Wikidata. The software is open-source and can be customized for creating collaborative knowledge bases. In addition to containing an RDF (Resource Description Framework) triple store under the hood, it provides users with an easy-to-use graphical user interface (see Fig. 1). This facilitates the use of an implementation of a standard by non-technical users. The Wikibase remains fully flexible in the way data are represented and no data model is enforced. This allows users to map their data onto the standard without any restrictions. Retrieving information from RDF data can be done through the SPARQL query language (W3C 2020). The software package has also a built-in SPARQL endpoint, allowing users to extract the relevant information: Does the standard cover all use cases envisioned? Are parts of the standard underdeveloped? Are the controlled vocabularies sufficient to describe the data? Does the standard cover all use cases envisioned? Are parts of the standard underdeveloped? Are the controlled vocabularies sufficient to describe the data? This strategy was applied during the development of the TDWG Collection Description standard. After completing a rough version of the standard, the different terms that were defined in the first version were transferred to a Wikibase instance running on WBStack (Addshore 2020). Initially, collection data were entered manually, which revealed several issues. The Wikibase allowed us to easily define controlled vocabularies and expand them as needed. The feedback reported from users then flowed back to the further development of the standard. Currently we envisage creating automated scripts that will import data en masse from collections. Using the SPARQL query interface, it will then be straightforward to ensure that data can be extracted from the Wikibase to support the envisaged use cases.

2013 ◽  
Vol 441 ◽  
pp. 970-973
Author(s):  
Yan Qin Zhang ◽  
Jing Bin Wang

As the development of the semantic web, RDF data set has grown rapidly, thus causing the query problem of massive RDF. Using distributed technique to complete the SPARQL (Simple Protocol and RDF Query Language) Query is a new way of solving the large amounts of RDF query problem. At present, most of the RDF query strategies based on Hadoop have to use multiple MapReduce jobs to complete the task, resulting in waste of time. In order to overcome this drawback, MRQJ (using MapReduce to query and join) algorithm is proposed in the paper, which firstly uses a greedy strategy to generate join plan, then only one MapReduce job should be created to get the query results in SPARQL query execution. Finally, a contrast experiment on the LUBM (Lehigh University Benchmark) test data set is conducted, the results of which show that MRQJ method has a great advantage in the case that the query is more complicated.


2021 ◽  
Vol 11 (5) ◽  
pp. 2232
Author(s):  
Francesca Noardo ◽  
Ken Arroyo Ohori ◽  
Thomas Krijnen ◽  
Jantien Stoter

Industry Foundation Classes (IFC) is a complete, wide and complex open standard data model to represent Building Information Models. Big efforts are being made by the standardization organization buildingSMART, to develop and maintain this standard in collaboration with researchers, companies and institutions. However, when trying to use IFC models from practice for automatic analysis, some issues emerge, as a consequence of a misalignment between what is prescribed by, or available in, the standard with the data sets that are produced in practice. In this study, a sample of models produced by practitioners for aims different from their explicit use within automatic processing tools is inspected and analyzed. The aim is to find common patterns in data set from practice and their possible discrepancies with the standard, in order to find ways to address such discrepancies in a next step. In particular, it is noticeable that the overall quality of the models requires specific additional care by the modellers before relying on them for automatic analysis, and a high level of variability is present concerning the storage of some relevant information (such as georeferencing).


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 34 ◽  
Author(s):  
Maria-Evangelia Papadaki ◽  
Nicolas Spyratos ◽  
Yannis Tzitzikas

The continuous accumulation of multi-dimensional data and the development of Semantic Web and Linked Data published in the Resource Description Framework (RDF) bring new requirements for data analytics tools. Such tools should take into account the special features of RDF graphs, exploit the semantics of RDF and support flexible aggregate queries. In this paper, we present an approach for applying analytics to RDF data based on a high-level functional query language, called HIFUN. According to that language, each analytical query is considered to be a well-formed expression of a functional algebra and its definition is independent of the nature and structure of the data. In this paper, we investigate how HIFUN can be used for easing the formulation of analytic queries over RDF data. We detail the applicability of HIFUN over RDF, as well as the transformations of data that may be required, we introduce the translation rules of HIFUN queries to SPARQL and we describe a first implementation of the proposed model.


AWARI ◽  
2020 ◽  
Vol 1 (1) ◽  
Author(s):  
Higor Alexandre Duarte Mascarenhas ◽  
Thiago Magela Rodrigues Dias ◽  
Patrícia Mascarenhas Dias

The migration of Brazilians has become more and more frequent nowadays, with the main purpose of obtaining better living conditions. Studies indicate that one of the main reasons for migration is the search for training at a high level of training. Therefore, in this scenario, this research has as main objective to analyze the exodus of Brazilian students during their academic formation process, from data extracted from their curricula registered in the Lattes Platform with the adoption of network analysis techniques. The Lattes Platform was used for referring to one of the main Brazilian academic repositories, and for having relevant information for this research. Therefore, the LattesDataXplorer framework was used for the extraction and treatment of the data. Subsequently, the data set of individuals with a doctorate completed were selected because they are individuals with a higher level of education and who maintain a constant update of their curricula. Once this was done, data was enriched with geolocation and information from the institutions where they trained, in order to obtain results from distances covered by doctors. As a way of visualizing data, network analysis was used, and metrics were used to obtain an overview of how the Brazilian scientific exodus occurs. A high concentration of doctors is perceived in cities with a higher concentration of universities that have postgraduate programs at the master's and doctoral level, as well as being characterized by having higher incomes per capita.


Author(s):  
Reto Gmür ◽  
Donat Agosti

Taxonomic treatments, sections of publications documenting the features or distribution of a related group of organisms (called a “taxon”, plural “taxa”) in ways adhering to highly formalized conventions, and published in scientific journals, shape our understanding of global biodiversity (Catapano 2019). Treatments are the building blocks of the evolving scientific consensus on taxonomic entities. The semantics of these treatments and their relationships are highly structured: taxa are introduced, merged, made obsolete, split, renamed, associated with specimens and so on. Plazi makes this content available in machine-readable form using Resource Description Framework (RDF) . RDF is the standard model for Linked Data and the Semantic Web. RDF can be exchanged in different formats (aka concrete syntaxes) such as RDF/XML or Turtle. The data model describes graph structures and relies on Internationalized Resource Identifiers (IRIs) , ontologies such as Darwin Core basic vocabulary are used to assign meaning to the identifiers. For Synospecies, we unite all treatments into one large knowledge graph, modelling taxonomic knowledge and its evolution with complete references to quotable treatments. However, this knowledge graph expresses much more than any individual treatment could convey because every referenced entity is linked to every other relevant treatment. On synospecies.plazi.org, we provide a user-friendly interface to find the names and treatments related to a taxon. An advanced mode allows execution of queries using the SPARQL query language.


2018 ◽  
Vol 8 (1) ◽  
pp. 18-37 ◽  
Author(s):  
Median Hilal ◽  
Christoph G. Schuetz ◽  
Michael Schrefl

Abstract The foundations for traditional data analysis are Online Analytical Processing (OLAP) systems that operate on multidimensional (MD) data. The Resource Description Framework (RDF) serves as the foundation for the publication of a growing amount of semantic web data still largely untapped by companies for data analysis. Most RDF data sources, however, do not correspond to the MD modeling paradigm and, as a consequence, elude traditional OLAP. The complexity of RDF data in terms of structure, semantics, and query languages renders RDF data analysis challenging for a typical analyst not familiar with the underlying data model or the SPARQL query language. Hence, conducting RDF data analysis is not a straightforward task. We propose an approach for the definition of superimposed MD schemas over arbitrary RDF datasets and show how to represent the superimposed MD schemas using well-known semantic web technologies. On top of that, we introduce OLAP patterns for RDF data analysis, which are recurring, domain-independent elements of data analysis. Analysts may compose queries by instantiating a pattern using only the MD concepts and business terms. Upon pattern instantiation, the corresponding SPARQL query over the source data can be automatically generated, sparing analysts from technical details and fostering self-service capabilities.


Author(s):  
G. Hiebel ◽  
K. Hanke

The ancient mining landscape of Schwaz/Brixlegg in the Tyrol, Austria witnessed mining from prehistoric times to modern times creating a first order cultural landscape when it comes to one of the most important inventions in human history: the production of metal. In 1991 a part of this landscape was lost due to an enormous landslide that reshaped part of the mountain. With our work we want to propose a digital workflow to create a 3D semantic representation of this ancient mining landscape with its mining structures to preserve it for posterity. First, we define a conceptual model to integrate the data. It is based on the CIDOC CRM ontology and CRMgeo for geometric data. To transform our information sources to a formal representation of the classes and properties of the ontology we applied semantic web technologies and created a knowledge graph in RDF (Resource Description Framework). Through the CRMgeo extension coordinate information of mining features can be integrated into the RDF graph and thus related to the detailed digital elevation model that may be visualized together with the mining structures using Geoinformation systems or 3D visualization tools. The RDF network of the triple store can be queried using the SPARQL query language. We created a snapshot of mining, settlement and burial sites in the Bronze Age. The results of the query were loaded into a Geoinformation system and a visualization of known bronze age sites related to mining, settlement and burial activities was created.


2013 ◽  
Vol 07 (04) ◽  
pp. 455-477 ◽  
Author(s):  
EDGARD MARX ◽  
TOMMASO SORU ◽  
SAEEDEH SHEKARPOUR ◽  
SÖREN AUER ◽  
AXEL-CYRILLE NGONGA NGOMO ◽  
...  

Over the last years, a considerable amount of structured data has been published on the Web as Linked Open Data (LOD). Despite recent advances, consuming and using Linked Open Data within an organization is still a substantial challenge. Many of the LOD datasets are quite large and despite progress in Resource Description Framework (RDF) data management their loading and querying within a triple store is extremely time-consuming and resource-demanding. To overcome this consumption obstacle, we propose a process inspired by the classical Extract-Transform-Load (ETL) paradigm. In this article, we focus particularly on the selection and extraction steps of this process. We devise a fragment of SPARQL Protocol and RDF Query Language (SPARQL) dubbed SliceSPARQL, which enables the selection of well-defined slices of datasets fulfilling typical information needs. SliceSPARQL supports graph patterns for which each connected subgraph pattern involves a maximum of one variable or Internationalized resource identifier (IRI) in its join conditions. This restriction guarantees the efficient processing of the query against a sequential dataset dump stream. Furthermore, we evaluate our slicing approach on three different optimization strategies. Results show that dataset slices can be generated an order of magnitude faster than by using the conventional approach of loading the whole dataset into a triple store.


Author(s):  
S. Ronzhin ◽  
G. Bosch ◽  
E. Folmer ◽  
R. Lemmens

<p><strong>Abstract.</strong> Modern software tools for managing Linked Data are often designed for skilled users. Therefore, they cannot be used for education purposes because they require substantial a priori knowledge about the Resource Description Framework and the SPARQL query language. LinkDaLe is a single page application designed to teach students the concept of Linked Data and work with linked data at the same time. In the paper we showcase the interface and functionality of LinkDaLe by triplifying data on Geo4All member organizations. The application was built and evaluated within The Business Process Integration Lab, a master programme course in 2016 and 2017 years. Positive feedback from both students and teachers proved the relevance of the proposed design consideration. LinkDaLe showed usability working with domain specific data e.g. geospatial and logistic data.</p>


Author(s):  
Kamalendu Pal

Manufacturing communities around the globe are eagerly witnessing the recent developments in semantic web technology (SWT). This technology combines a set of new mechanisms with grounded knowledge representation techniques to address the needs of formal information modelling and reasoning for web-based services. This chapter provides a high-level summary of SWT to help better understand the impact that this technology will have on wider enterprise information architectures. In many cases it also reuses familiar concepts with a new twist. For example, “ontologies” for “data dictionaries” and “semantic models” for “data models.” This chapter presents the usefulness of a proposed architecture by applying a theory to integrating data from multiple heterogeneous sources which entails dealing with semantic mapping between source schema and a resource description framework (RDF) ontology described declaratively using specific query language (i.e. SPARQL) queries. Finally, the semantic of query rewriting is further discussed and a query rewriting algorithm is presented.


Sign in / Sign up

Export Citation Format

Share Document