Journal of Data and Information Quality
Latest Publications


TOTAL DOCUMENTS

231
(FIVE YEARS 119)

H-INDEX

18
(FIVE YEARS 6)

Published By Association For Computing Machinery

1936-1955

2022 ◽  
Vol 14 (1) ◽  
pp. 1-27
Author(s):  
Khalid Belhajjame

Workflows have been adopted in several scientific fields as a tool for the specification and execution of scientific experiments. In addition to automating the execution of experiments, workflow systems often include capabilities to record provenance information, which contains, among other things, data records used and generated by the workflow as a whole but also by its component modules. It is widely recognized that provenance information can be useful for the interpretation, verification, and re-use of workflow results, justifying its sharing and publication among scientists. However, workflow execution in some branches of science can manipulate sensitive datasets that contain information about individuals. To address this problem, we investigate, in this article, the problem of anonymizing the provenance of workflows. In doing so, we consider a popular class of workflows in which component modules use and generate collections of data records as a result of their invocation, as opposed to a single data record. The solution we propose offers guarantees of confidentiality without compromising lineage information, which provides transparency as to the relationships between the data records used and generated by the workflow modules. We provide algorithmic solutions that show how the provenance of a single module and an entire workflow can be anonymized and present the results of experiments that we conducted for their evaluation.


2022 ◽  
Vol 14 (1) ◽  
pp. 1-12
Author(s):  
Sandra Geisler ◽  
Maria-Esther Vidal ◽  
Cinzia Cappiello ◽  
Bernadette Farias Lóscio ◽  
Avigdor Gal ◽  
...  

A data ecosystem (DE) offers a keystone-player or alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. However, despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. In this work, we focus on requirements and challenges that DEs face when ensuring data transparency. Requirements are derived from the data and organizational management, as well as from broader legal and ethical considerations. We propose a novel knowledge-driven DE architecture, providing the pillars for satisfying the analyzed requirements. We illustrate the potential of our proposal in a real-world scenario. Last, we discuss and rate the potential of the proposed architecture in the fulfillmentof these requirements.


2022 ◽  
Vol 14 (2) ◽  
pp. 1-24
Author(s):  
Bin Wang ◽  
Pengfei Guo ◽  
Xing Wang ◽  
Yongzhong He ◽  
Wei Wang

Aspect-level sentiment analysis identifies fine-grained emotion for target words. There are three major issues in current models of aspect-level sentiment analysis. First, few models consider the natural language semantic characteristics of the texts. Second, many models consider the location characteristics of the target words, but ignore the relationships among the target words and among the overall sentences. Third, many models lack transparency in data collection, data processing, and results generating in sentiment analysis. In order to resolve these issues, we propose an aspect-level sentiment analysis model that combines a bidirectional Long Short-Term Memory (LSTM) network and a Graph Convolutional Network (GCN) based on Dependency syntax analysis (Bi-LSTM-DGCN). Our model integrates the dependency syntax analysis of the texts, and explicitly considers the natural language semantic characteristics of the texts. It further fuses the target words and overall sentences. Extensive experiments are conducted on four benchmark datasets, i.e., Restaurant14, Laptop, Restaurant16, and Twitter. The experimental results demonstrate that our model outperforms other models like Target-Dependent LSTM (TD-LSTM), Attention-based LSTM with Aspect Embedding (ATAE-LSTM), LSTM+SynATT+TarRep and Convolution over a Dependency Tree (CDT). Our model is further applied to aspect-level sentiment analysis on “government” and “lockdown” of 1,658,250 tweets about “#COVID-19” that we collected from March 1, 2020 to July 1, 2020. The experimental results show that Twitter users’ positive and negative sentiments fluctuated over time. Through the transparency analysis in data collection, data processing, and results generating, we discuss the reasons for the evolution of users’ emotions over time based on the tweets and on our models.


2022 ◽  
Vol 14 (2) ◽  
pp. 1-15
Author(s):  
Lara Mauri ◽  
Ernesto Damiani

Large-scale adoption of Artificial Intelligence and Machine Learning (AI-ML) models fed by heterogeneous, possibly untrustworthy data sources has spurred interest in estimating degradation of such models due to spurious, adversarial, or low-quality data assets. We propose a quantitative estimate of the severity of classifiers’ training set degradation: an index expressing the deformation of the convex hulls of the classes computed on a held-out dataset generated via an unsupervised technique. We show that our index is computationally light, can be calculated incrementally and complements well existing ML data assets’ quality measures. As an experimentation, we present the computation of our index on a benchmark convolutional image classifier.


2022 ◽  
Vol 14 (1) ◽  
pp. 1-10
Author(s):  
Tooska Dargahi ◽  
Hossein Ahmadvand ◽  
Mansour Naser Alraja ◽  
Chia-Mu Yu

Connected and Autonomous Vehicles (CAVs) are introduced to improve individuals’ quality of life by offering a wide range of services. They collect a huge amount of data and exchange them with each other and the infrastructure. The collected data usually includes sensitive information about the users and the surrounding environment. Therefore, data security and privacy are among the main challenges in this industry. Blockchain, an emerging distributed ledger, has been considered by the research community as a potential solution for enhancing data security, integrity, and transparency in Intelligent Transportation Systems (ITS). However, despite the emphasis of governments on the transparency of personal data protection practices, CAV stakeholders have not been successful in communicating appropriate information with the end users regarding the procedure of collecting, storing, and processing their personal data, as well as the data ownership. This article provides a vision of the opportunities and challenges of adopting blockchain in ITS from the “data transparency” and “privacy” perspective. The main aim is to answer the following questions: (1) Considering the amount of personal data collected by the CAVs, such as location, how would the integration of blockchain technology affect transparency , fairness , and lawfulness of personal data processing concerning the data subjects (as this is one of the main principles in the existing data protection regulations)? (2) How can the trade-off between transparency and privacy be addressed in blockchain-based ITS use cases?


2022 ◽  
Vol 14 (1) ◽  
pp. 1-9
Author(s):  
Saravanan Thirumuruganathan ◽  
Mayuresh Kunjir ◽  
Mourad Ouzzani ◽  
Sanjay Chawla

The data and Artificial Intelligence revolution has had a massive impact on enterprises, governments, and society alike. It is fueled by two key factors. First, data have become increasingly abundant and are often available openly. Enterprises have more data than they can process. Governments are spearheading open data initiatives by setting up data portals such as data.gov and releasing large amounts of data to the public. Second, AI engineering development is becoming increasingly democratized. Open source frameworks have enabled even an individual developer to engineer sophisticated AI systems. But with such ease of use comes the potential for irresponsible use of data. Ensuring that AI systems adhere to a set of ethical principles is one of the major problems of our age. We believe that data and model transparency has a key role to play in mitigating the deleterious effects of AI systems. In this article, we describe a framework to synthesize ideas from various domains such as data transparency, data quality, data governance among others to tackle this problem. Specifically, we advocate an approach based on automated annotations (of both data and the AI model), which has a number of appealing properties. The annotations could be used by enterprises to get visibility of potential issues, prepare data transparency reports, create and ensure policy compliance, and evaluate the readiness of data for diverse downstream AI applications. We propose a model architecture and enumerate its key components that could achieve these requirements. Finally, we describe a number of interesting challenges and opportunities.


2021 ◽  
Vol 13 (4) ◽  
pp. 1-7
Author(s):  
Jakub Kubiczek ◽  
BartŁomiej Hadasik

2021 ◽  
Vol 13 (4) ◽  
pp. 1-35
Author(s):  
Gabriel Amaral ◽  
Alessandro Piscopo ◽  
Lucie-aimée Kaffee ◽  
Odinaldo Rodrigues ◽  
Elena Simperl

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.


2021 ◽  
Vol 13 (4) ◽  
pp. 1-24
Author(s):  
Jessica Chen ◽  
Henry Milner ◽  
Ion Stoica ◽  
Jibin Zhan

The HTTP adaptive streaming technique opened the door to cope with the fluctuating network conditions during the streaming process by dynamically adjusting the volume of the future chunks to be downloaded. The bitrate selection in this adjustment inevitably involves the task of predicting the future throughput of a video session, owing to which various heuristic solutions have been explored. The ultimate goal of the present work is to explore the theoretical upper bounds of the QoE that any ABR algorithm can possibly reach, therefore providing an essential step to benchmarking the performance evaluation of ABR algorithms. In our setting, the QoE is defined in terms of a linear combination of the average perceptual quality and the buffering ratio. The optimization problem is proven to be NP-hard when the perceptual quality is defined by chunk size and conditions are given under which the problem becomes polynomially solvable. Enriched by a global lower bound, a pseudo-polynomial time algorithm along the dynamic programming approach is presented. When the minimum buffering is given higher priority over higher perceptual quality, the problem is shown to be also NP-hard, and the above algorithm is simplified and enhanced by a sequence of lower bounds on the completion time of chunk downloading, which, according to our experiment, brings a 36.0% performance improvement in terms of computation time. To handle large amounts of data more efficiently, a polynomial-time algorithm is also introduced to approximate the optimal values when minimum buffering is prioritized. Besides its performance guarantee, this algorithm is shown to reach 99.938% close to the optimal results, while taking only 0.024% of the computation time compared to the exact algorithm in dynamic programming.


Sign in / Sign up

Export Citation Format

Share Document