scholarly journals Practical use of aggregator data quality metrics in a collection scenario

2018 ◽  
Vol 2 ◽  
pp. e25970
Author(s):  
Andrew Bentley

The recent incorporation of standardized data quality metrics into the GBIF, iDigBio, and ALA portal infrastructures enables data providers with useful information they can use to clean or augment Darwin Core data at the source based on these recommendations. Numerous taxonomic and geographic based metrics provide useful information on the quality of various Darwin Core fields in this realm, while also providing input on Darwin Core compliance for others. As a provider/data manager for the Biodiversity Institute, University of Kansas, and having spent some time evaluating their efficacy and reliability, this presentation will highlight some of the positive and negative aspects of my experience with specific examples while highlighting concerns regarding the user experience and standardization of these metrics across the aggregator landscape. These metrics have indicated both data and publishing issues that have increased the utility and cleanliness of our data while also highlighting batch processing challenges and issues with the process of inferring "bad" data. The integration of these metrics into source database infrastructure will also be postulated, with Specify Software as an example.

2008 ◽  
pp. 3067-3084
Author(s):  
John Talburt ◽  
Richard Wang ◽  
Kimberly Hess ◽  
Emily Kuo

This chapter introduces abstract algebra as a means of understanding and creating data quality metrics for entity resolution, the process in which records determined to represent the same real-world entity are successively located and merged. Entity resolution is a particular form of data mining that is foundational to a number of applications in both industry and government. Examples include commercial customer recognition systems and information sharing on “persons of interest” across federal intelligence agencies. Despite the importance of these applications, most of the data quality literature focuses on measuring the intrinsic quality of individual records than the quality of record grouping or integration. In this chapter, the authors describe current research into the creation and validation of quality metrics for entity resolution, primarily in the context of customer recognition systems. The approach is based on an algebraic view of the system as creating a partition of a set of entity records based on the indicative information for the entities in question. In this view, the relative quality of entity identification between two systems can be measured in terms of the similarity between the partitions they produce. The authors discuss the difficulty of applying statistical cluster analysis to this problem when the datasets are large and propose an alternative index suitable for these situations. They also report some preliminary experimental results, and outlines areas and approaches to further research in this area.


Author(s):  
John Talburt ◽  
Richard Wang ◽  
Kimberly Hess ◽  
Emily Kuo

This chapter introduces abstract algebra as a means of understanding and creating data quality metrics for entity resolution, the process in which records determined to represent the same real-world entity are successively located and merged. Entity resolution is a particular form of data mining that is foundational to a number of applications in both industry and government. Examples include commercial customer recognition systems and information sharing on “persons of interest” across federal intelligence agencies. Despite the importance of these applications, most of the data quality literature focuses on measuring the intrinsic quality of individual records than the quality of record grouping or integration. In this chapter, the authors describe current research into the creation and validation of quality metrics for entity resolution, primarily in the context of customer recognition systems. The approach is based on an algebraic view of the system as creating a partition of a set of entity records based on the indicative information for the entities in question. In this view, the relative quality of entity identification between two systems can be measured in terms of the similarity between the partitions they produce. The authors discuss the difficulty of applying statistical cluster analysis to this problem when the datasets are large and propose an alternative index suitable for these situations. They also report some preliminary experimental results, and outlines areas and approaches to further research in this area.


2021 ◽  
Author(s):  
Thomas Naake ◽  
Wolfgang Huber

Motivation: First-line data quality assessment and exploratory data analysis are integral parts of any data analysis workflow. In high-throughput quantitative omics experiments (e.g. transcriptomics, proteomics, metabolomics), after initial processing, the data are typically presented as a matrix of numbers (feature IDs x samples). Efficient and standardized data-quality metrics calculation and visualization are key to track the within-experiment quality of these rectangular data types and to guarantee for high-quality data sets and subsequent biological question-driven inference. Results: We present MatrixQCvis, which provides interactive visualization of data quality metrics at the per-sample and per-feature level using R's shiny framework. It provides efficient and standardized ways to analyze data quality of quantitative omics data types that come in a matrix-like format (features IDs x samples). MatrixQCvis builds upon the Bioconductor SummarizedExperiment S4 class and thus facilitates the integration into existing workflows. Availability: MatrixQCVis is implemented in R. It is available via Bioconductor and released under the GPL v3.0 license.


2018 ◽  
Vol 10 (1) ◽  
pp. 1-26 ◽  
Author(s):  
Christian Bors ◽  
Theresia Gschwandtner ◽  
Simone Kriglstein ◽  
Silvia Miksch ◽  
Margit Pohl

2011 ◽  
Vol 11 (2) ◽  
pp. 1412-1419 ◽  
Author(s):  
Christopher R. Kinsinger ◽  
James Apffel ◽  
Mark Baker ◽  
Xiaopeng Bian ◽  
Christoph H. Borchers ◽  
...  

2017 ◽  
Vol 20 (2) ◽  
Author(s):  
Flavia Serra ◽  
Adriana Marotta

The fact that Data Quality (DQ) depends on the context, in which data are produced, stored and used, is widely recognized in the research community. Data Warehouse Systems (DWS), whose main goal is to give support to decision making based on data, have had a huge growth in the last years, in research and industry. DQ in this kind of systems becomes essential. This work presents a proposal for identifying DQ problems in the domain of DWS, considering the different contexts that exist in each system component. This proposal may act as a first conceptual framework that guides the DQ-responsible in the management of DQ in DWS. The main contributions of this work are a thorough literature review about how contexts are used for evaluating DQ in DWS, and a proposal for assessing DQ in DWS through context-based DQ metrics.


2011 ◽  
Vol 10 (12) ◽  
pp. O111.015446 ◽  
Author(s):  
Christopher R. Kinsinger ◽  
James Apffel ◽  
Mark Baker ◽  
Xiaopeng Bian ◽  
Christoph H. Borchers ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document