Data Quality Issues in Linked Open Data

I will cover how the Global Biodiversity Information Facility (GBIF) handles data quality issues, with specific focus on coordinate location issues, such as gridded datasets (Fig. 1) and country centroids. I will highlight the challenges GBIF faces identifying potential data quality problems and what we and others (Zizka et al. 2019) are doing to discover and address them. GBIF is the largest open-data portal of biodiversity data, which is a large network of individual datasets (> 40k) from various sources and publishers. Since these datasets are variable both within themselves and dataset-to-dataset, this creates a challenge for users wanting to use data collected from museums, smartphones, atlases, satellite tracking, DNA sequencing, and various other sources for research or analysis. Data quality at GBIF will always be a moving target (Chapman 2005), and GBIF already handles many obvious errors such as zero/impossible coordinates, empty or invalid data fields, and fuzzy taxon matching. Since GBIF primarily (but not exclusively) serves lat-lon location information, there is an expectation that occurrences fall somewhat close to where the species actually occurs. This is not always the case. Occurrence data can be hundereds of kilometers away from where the species naturally occur, and there can be multiple reasons for why this can happen, which might not be entirely obvious to users. One reasons is that many GBIF datasets are gridded. Gridded datasets are datasets that have low resolution due to equally-spaced sampling. This can be a data quality issue because a user might assume an occurrence record was recorded exactly at its coordinates. Country centroids are another reason why a species occurrence record might be far from where it occurs naturally. GBIF does not yet flag country centroids, which are records where the dataset publishers has entered the lat-long center of a country instead of leaving the field blank. I will discuss the challenges surrounding locating these issues and the current solutions (such as the CoordinateCleaner R package). I will touch on how existing DWCA terms like coordinateUncertaintyInMeters and footprintWKT are being utilized to highlight low coordinate resolution. Finally, I will highlight some other emerging data quality issues and how GBIF is beginning to experiment with dataset-level flagging. Currently we have flagged around 500 datasets as gridded and around 400 datasets as citizen science, but there are many more potential dataset flags.

Download Full-text

SemQuire - Assessing the Data Quality of Linked Open Data Sources Based on DQV

Current Trends in Web Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-030-03056-8_14 ◽

2018 ◽

pp. 163-175 ◽

Cited By ~ 1

Author(s):

André Langer ◽

Valentin Siegert ◽

Christoph Göpfert ◽

Martin Gaedke

Keyword(s):

Data Quality ◽

Open Data ◽

Linked Open Data ◽

Data Sources

Download Full-text

LODQuMa: A Free-Ontology Process for Linked (Open) Data Quality Management

Journal of King Saud University - Computer and Information Sciences ◽

10.1016/j.jksuci.2021.06.001 ◽

2021 ◽

Author(s):

Samah Salem ◽

Fouzia Benchikha

Keyword(s):

Quality Management ◽

Data Quality ◽

Open Data ◽

Linked Open Data ◽

Data Quality Management

Download Full-text

Towards Open Data Quality Improvements Based on Root Cause Analysis of Quality Issues

Lecture Notes in Computer Science - Electronic Government ◽

10.1007/978-3-319-98690-6_18 ◽

2018 ◽

pp. 208-220

Author(s):

Csaba Csáki

Keyword(s):

Data Quality ◽

Open Data ◽

Root Cause Analysis ◽

Quality Improvements ◽

Cause Analysis ◽

Root Cause ◽

Quality Issues

Download Full-text

From Data Quality to Big Data Quality

Journal of Database Management ◽

10.4018/jdm.2015010103 ◽

2015 ◽

Vol 26 (1) ◽

pp. 60-82 ◽

Cited By ~ 30

Author(s):

Carlo Batini ◽

Anisa Rula ◽

Monica Scannapieco ◽

Gianluigi Viscusi

Keyword(s):

Big Data ◽

Data Quality ◽

Relational Databases ◽

Structural Characteristics ◽

Open Data ◽

A Posteriori ◽

Data Types ◽

Quality Dimensions ◽

Quality Issues ◽

The Relationship

This article investigates the evolution of data quality issues from traditional structured data managed in relational databases to Big Data. In particular, the paper examines the nature of the relationship between Data Quality and several research coordinates that are relevant in Big Data, such as the variety of data types, data sources and application domains, focusing on maps, semi-structured texts, linked open data, sensor & sensor networks and official statistics. Consequently a set of structural characteristics is identified and a systematization of the a posteriori correlation between them and quality dimensions is provided. Finally, Big Data quality issues are considered in a conceptual framework suitable to map the evolution of the quality paradigm according to three core coordinates that are significant in the context of the Big Data phenomenon: the data type considered, the source of data, and the application domain. Thus, the framework allows ascertaining the relevant changes in data quality emerging with the Big Data phenomenon, through an integrative and theoretical literature review.

Download Full-text

Evaluation of the Linked Open Data Quality Based on a Fuzzy Logic Model

IFIP Advances in Information and Communication Technology - Artificial Intelligence Applications and Innovations ◽

10.1007/978-3-319-92007-8_47 ◽

2018 ◽

pp. 556-567

Author(s):

Esteban Arias Caracas ◽

Daniel Fernando Mendoza López ◽

Paulo Alonso Gaona-García ◽

Jhon Francined Herrera Cubides ◽

Carlos Enrique Montenegro-Marín

Keyword(s):

Fuzzy Logic ◽

Data Quality ◽

Open Data ◽

Logic Model ◽

Linked Open Data ◽

Fuzzy Logic Model

Download Full-text

From Data Quality to Big Data Quality

Big Data ◽

10.4018/978-1-4666-9840-6.ch089 ◽

2016 ◽

pp. 1934-1956 ◽

Cited By ~ 5

Author(s):

Carlo Batini ◽

Anisa Rula ◽

Monica Scannapieco ◽

Gianluigi Viscusi

Keyword(s):

Big Data ◽

Data Quality ◽

Relational Databases ◽

Structural Characteristics ◽

Open Data ◽

Application Domain ◽

Data Types ◽

Quality Dimensions ◽

Quality Issues ◽

The Relationship

This chapter investigates the evolution of data quality issues from traditional structured data managed in relational databases to Big Data. In particular, the paper examines the nature of the relationship between Data Quality and several research coordinates that are relevant in Big Data, such as the variety of data types, data sources and application domains, focusing on maps, semi-structured texts, linked open data, sensor & sensor networks and official statistics. Consequently a set of structural characteristics is identified and a systematization of the a posteriori correlation between them and quality dimensions is provided. Finally, Big Data quality issues are considered in a conceptual framework suitable to map the evolution of the quality paradigm according to three core coordinates that are significant in the context of the Big Data phenomenon: the data type considered, the source of data, and the application domain. Thus, the framework allows ascertaining the relevant changes in data quality emerging with the Big Data phenomenon, through an integrative and theoretical literature review.

Download Full-text

Data Quality Assessment in the Integration Process of Linked Open Data (LOD)

2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA) ◽

10.1109/aiccsa.2017.178 ◽

2017 ◽

Author(s):

Hana Haj Ahmed

Keyword(s):

Data Quality ◽

Quality Assessment ◽

Open Data ◽

Linked Open Data ◽

Integration Process ◽

Data Quality Assessment

Download Full-text

Proposed metrics for data accessibility in the context of linked open data

Program electronic library and information systems ◽

10.1108/prog-01-2015-0007 ◽

2016 ◽

Vol 50 (2) ◽

pp. 184-194

Author(s):

Mahdi Zahedi Nooghabi ◽

Akram Fathian Dastgerdi

Keyword(s):

Semantic Web ◽

Data Quality ◽

Security Policy ◽

Open Data ◽

Linked Open Data ◽

Data Availability ◽

Data Accessibility ◽

Quality Models ◽

Content Type ◽

Goal Question Metric

Purpose – One of the most important categories in linked open data (LOD) quality models is “data accessibility.” The purpose of this paper is to propose some metrics and indicators for assessing data accessibility in LOD and the semantic web context. Design/methodology/approach – In this paper, at first the authors consider some data quality and LOD quality models to review proposed subcategories for data accessibility dimension in related texts. Then, based on goal question metric (GQM) approach, the authors specify the project goals, main issues and some questions. Finally, the authors propose some metrics for assessing the data accessibility in the context of the semantic web. Findings – Based on GQM approach, the authors determined three main issues for data accessibility, including data availability, data performance, and data security policy. Then the authors created four main questions related to these issues. As a conclusion, the authors proposed 27 metrics for measuring these questions. Originality/value – Nowadays, one of the main challenges regarding data quality is the lack of agreement on widespread quality metrics and practical instruments for evaluating quality. Accessibility is an important aspect of data quality. However, few researches have been done to provide metrics and indicators for assessing data accessibility in the context of the semantic web. So, in this research, the authors consider the data accessibility dimension and propose a comparatively comprehensive set of metrics.

Download Full-text