Basic Data Quality Tools

Author(s):  
Thomas N. Herzog ◽  
Fritz J. Scheuren ◽  
William E. Winkler
2019 ◽  
Vol 214 ◽  
pp. 01030
Author(s):  
Juraj Smiesko

An integrated system for data quality and conditions assessment for the ATLAS Tile Calorimeter is known amongst the ATLAS Tile Calorimeter as the Tile-in-One. It is a platform for combining all of the ATLAS Tile Calorimeter offline data quality tools in one unified web interface. It achieves this by using simple main web server to serve as central hub and group of small web applications called plugins, which provide the data quality assessment tools. Every plugin runs in its own virtual machine in order to prevent interference between the plugins and also to increase stability of the platform.


Author(s):  
Thomas Foken ◽  
Mathias Göckede ◽  
Johannes Lüers ◽  
Lukas Siebicke ◽  
Corinna Rebmann ◽  
...  
Keyword(s):  

2017 ◽  
Vol 56 (S 01) ◽  
pp. e67-e73 ◽  
Author(s):  
Martin Bialke ◽  
Henriette Rau ◽  
Thea Schwaneberg ◽  
Rene Walk ◽  
Thomas Bahls ◽  
...  

SummaryBackground: Epidemiological studies are based on a considerable amount of personal, medical and socio-economic data. To answer research questions with reliable results, epidemiological research projects face the challenge of providing high quality data. Consequently, gathered data has to be reviewed continuously during the data collection period.Objectives: This article describes the development of the mosaicQA-library for non-statistical experts consisting of a set of reusable R functions to provide support for a basic data quality assurance for a wide range of application scenarios in epidemiological research.Methods: To generate valid quality reports for various scenarios and data sets, a general and flexible development approach was needed. As a first step, a set of quality-related questions, targeting quality aspects on a more general level, was identified. The next step included the design of specific R-scripts to produce proper reports for metric and categorical data. For more flexibility, the third development step focussed on the generalization of the developed R-scripts, e.g. extracting characteristics and parameters. As a last step the generic characteristics of the developed R functionalities and generated reports have been evaluated using different metric and categorical datasets.Results: The developed mosaicQA-library generates basic data quality reports for multivariate input data. If needed, more detailed results for single-variable data, including definition of units, variables, descriptions, code lists and categories of qualified missings, can easily be produced.Conclusions: The mosaicQA-library enables researchers to generate reports for various kinds of metric and categorical data without the need for computational or scripting knowledge. At the moment, the library focusses on the data structure quality and supports the assessment of several quality indicators, including frequency, distribution and plausibility of research variables as well as the occurrence of missing and extreme values. To simplify the installation process, mosaicQA has been released as an official R-package.


2011 ◽  
Vol 4 (0) ◽  
Author(s):  
Ian Painter ◽  
Julie Eaton ◽  
Don Olson ◽  
William Lober ◽  
Debra Revere
Keyword(s):  

2021 ◽  
Author(s):  
Aaron J Moss ◽  
Cheskie Rosenzweig ◽  
Shalom Noach Jaffe ◽  
Richa Gautam ◽  
Jonathan Robinson ◽  
...  

Online data collection has become indispensable to the social sciences, polling, marketing, and corporate research. However, in recent years, online data collection has been inundated with low quality data. Low quality data threatens the validity of online research and, at times, invalidates entire studies. It is often assumed that random, inconsistent, and fraudulent data in online surveys comes from ‘bots.’ But little is known about whether bad data is caused by bots or ill-intentioned or inattentive humans. We examined this issue on Mechanical Turk (MTurk), a popular online data collection platform. In the summer of 2018, researchers noticed a sharp increase in the number of data quality problems on MTurk, problems that were commonly attributed to bots. Despite this assumption, few studies have directly examined whether problematic data on MTurk are from bots or inattentive humans, even though identifying the source of bad data has important implications for creating the right solutions. Using CloudResearch’s data quality tools to identify problematic participants in 2018 and 2020, we provide evidence that much of the data quality problems on MTurk can be tied to fraudulent users from outside of the U.S. who pose as American workers. Hence, our evidence strongly suggests that the source of low quality data is real humans, not bots. We additionally present evidence that these fraudulent users are behind data quality problems on other platforms.


Sign in / Sign up

Export Citation Format

Share Document