Data Quality Automation: a Generic Approach for Large Linked Research Datasets

IntroductionWhen datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced automation to speed up the process and save effort. Objectives and ApproachWe have devised a set of automated generic quality checks and reporting, which can be run on any dataset in a relational database without any dataset-specific knowledge or configuration. The code is written in Python. Checks include: linkage quality, agreement with a population data source, comparison with previous data version, duplication checks, null count, value distribution and range, etc. Where dataset metadata is available, checks for validity against lookup tables are included, and the output report includes documentation on data contents. An HTML report with dynamic datatables and interactive graphs, allowing easy exploration of the results, is produced using RMarkdown. ResultsThe automation of the generic data quality check provides an easy and quick tool to report on data issues with minimal effort. It allows comparison with reference tables, lookups and previous versions of the same table to highlight differences. Moreover, this tool can be provided for researchers as a means to get more detailed understanding about their data. While other research data quality tools exist, this tool is distinguished by its features specific to linked data research, as well as implementation in a relational database environment. It has been successfully tested on datasets of over two billion rows. The tool was designed for use within the SAIL Databank, but could easily be adapted and used in other settings. Conclusion/ImplicationsThe effort spent on automating generic testing and reporting on data quality of research datasets is more than compensated by its outputs. Benefits include quick detection and scrutiny of many sources of invalid and incomplete data. This process can easily be expanded to accommodate more standard tests.

Download Full-text

Getting the Numbers Right on China's Actual Overseas Investment: The Case of the Netherlands

Journal of Current Chinese Affairs ◽

10.1177/186810261704600108 ◽

2017 ◽

Vol 46 (1) ◽

pp. 187-209 ◽

Cited By ~ 1

Author(s):

Piter De Jong ◽

Mark J. Greeven ◽

Haico Ebbers

Keyword(s):

The Netherlands ◽

Data Quality ◽

Chinese Firms ◽

Chamber Of Commerce ◽

Overseas Investment ◽

Data Source

This study assesses the quality of Chinese outbound FDI data. In our case study of the Netherlands, we checked the data quality of the often-used Orbis/Amadeus database and its data source, the Dutch Chamber of Commerce (Kamer van Koophandel, KVK), which has one of the oldest and, arguably, one of the better databases within Europe. We analysed Chinese investments in the Netherlands and show that six adjustments are necessary to clean up the data. We also show that not making these adjustments can significantly impact the outcome of research. The cleaned-up data show that sampled Chinese firms are young, small, and private.

Download Full-text

Data Quality Management Maturity Measurement of Government-Owned Property Transaction in BMKG

CommIT (Communication and Information Technology) Journal ◽

10.21512/commit.v12i2.4470 ◽

2018 ◽

Vol 12 (2) ◽

pp. 59

Author(s):

Septian Bagus Wibisono ◽

Achmad Nizar Hidayanto ◽

Widijanto Satyo Nugroho

Keyword(s):

Quality Management ◽

Data Quality ◽

Central Government ◽

Maturity Model ◽

Data Quality Management ◽

Financial Report ◽

Transaction Data ◽

Body Of Knowledge ◽

Inaccurate Data

Government-Owned Property (GOP) management, including the bookkeeping of GOP transaction, is part of GOP Officer responsibility to ensure the quality of transaction data. This responsibility also applies to GOP Officer in Indonesian Agency for Meteorological, Climatological and Geophysics ‘Badan Meteorologi, Klimatologi, dan Geofisika’ (BMKG). GOP data as the source for the Central Government Financial Report is expected to be well-maintained. It must be presented as accurate as possible, although there are still inaccurate data presented in the latest BMKG GOP Report. This qualitative research using document study and some interview sessions aims to measure how well the Data Quality Management (DQM) maturity of GOP transaction in BMKG using Loshin’s Data Quality Maturity model. Thus, the result of maturity assessment is analyzed to recommend and implement DQM activities from the Data Management Body of Knowledge (DMBOK). The purpose is to improve GOP DQM. The research shows that the level of DQM maturity is at a repeatable level to defined level. Moreover, 52 maturity characteristics need to be followed through with DQM activities.

Download Full-text

Increasing Trust in Real-World Evidence Through Evaluation of Observational Data Quality

10.1101/2021.03.25.21254341 ◽

2021 ◽

Author(s):

Clair Blacketer ◽

Frank J Defalco ◽

Patrick B Ryan ◽

Peter R Rijnbeek

Keyword(s):

Data Quality ◽

Real World ◽

R Package ◽

Observational Research ◽

Real World Data ◽

Quality Reporting ◽

Healthcare Data ◽

Real World Evidence ◽

Quality Checks

Advances in standardization of observational healthcare data have enabled methodological breakthroughs, rapid global collaboration, and generation of real-world evidence to improve patient outcomes. Standardizations in data structure, such as use of Common Data Models (CDM), need to be coupled with standardized approaches for data quality assessment. To ensure confidence in real-world evidence generated from the analysis of real-world data, one must first have confidence in the data itself. The Data Quality Dashboard is an open-source R package that reports potential quality issues in an OMOP CDM instance through the systematic execution and summarization of over 3,300 configurable data quality checks. We describe the implementation of check types across a data quality framework of conformance, completeness, plausibility, with both verification and validation. We illustrate how data quality checks, paired with decision thresholds, can be configured to customize data quality reporting across a range of observational health data sources. We discuss how data quality reporting can become part of the overall real-world evidence generation and dissemination process to promote transparency and build confidence in the resulting output. Transparently communicating how well CDM standardized databases adhere to a set of quality measures adds a crucial piece that is currently missing from observational research. Assessing and improving the quality of our data will inherently improve the quality of the evidence we generate.

Download Full-text

Neuroscience publishing is too important to leave to publishers

Neuroanatomy and Behaviour ◽

10.35430/nab.2019.e7 ◽

2019 ◽

Vol 1 ◽

pp. ed1

Author(s):

Shaun Yon-Seng Khoo

Keyword(s):

Open Access ◽

Data Quality ◽

Open Data ◽

Open Access Journal ◽

Non Profit ◽

Quality Checks

Almost every open access neuroscience journal is pay-to-publish. This leaves neuroscientists with a choice of submitting to journals that not all of our colleagues can legitimately access and choosing to pay large sums of money to publish open access. Neuroanatomy and Behaviour is a new platinum open access journal published by a non-profit association of scientists. Since we do not charge fees, we will focus entirely on the quality of submitted articles and encourage the adoption of reproducibility-enhancing practices, like open data, preregistration, and data quality checks. We hope that our colleagues will join us in this endeavour so that we can support good neuroscience no matter where it comes from.

Download Full-text

Using OpenStreetMap as a Data Source for Attractiveness in Travel Demand Models

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198121997415 ◽

2021 ◽

pp. 036119812199741

Author(s):

Christian Klinkhardt ◽

Tim Woerle ◽

Lars Briem ◽

Michael Heilig ◽

Martin Kagerbauer ◽

...

Keyword(s):

Data Quality ◽

Travel Demand ◽

Trip Generation ◽

Points Of Interest ◽

Demand Models ◽

Data Source ◽

Flexible Generation ◽

Official Sources

We present a methodology to extract points of interest (POIs) data from OpenStreetMap (OSM) for application in travel demand models. We use custom taglists to identify and assign POI elements to typical activities used in travel demand models. We then compare the extracted OSM data with official sources and point out that the OSM data quality depends on the type of POI and that it generally matches the quality of official sources. It can therefore be used in travel demand models. However, we recommend that plausibility checks should be done to ensure a certain quality. Further, we present a methodology for calculating attractiveness measures for typical activities from single POIs and national trip generation guidelines. We show that the quality of these calculated measures is good enough for them to be used in travel demand models. Using our approach, therefore, allows the quick, automated, and flexible generation of attractiveness measures for travel demand models.

Download Full-text

Quality of death counts and adult mortality registration in Suriname and its main regions

Revista Brasileira de Estudos de População ◽

10.20947/s0102-3098a0102 ◽

2019 ◽

Vol 36 ◽

pp. 1-20

Author(s):

Andrea Fernand Jubithana ◽

Bernardo Lanza Queiroz

Keyword(s):

Data Quality ◽

Population Data ◽

Adult Mortality ◽

Primary Objective ◽

Mortality Data ◽

Central Regions ◽

Quality Of Death ◽

Demographic Methods ◽

Better Than

Suriname statistical office assumes that mortality data in the country is of good quality and does not perform any test before producing life table estimates. However, lack of data quality is a concern in the less developed areas of the world. The primary objective of this article is to evaluate the quality of death counts registration in the country and its main regions from 2004 to 2012 and to produce estimates of adult mortality by sex. We use data from population, by age and sex, from the last censuses and death counts from the Statistical office. We use traditional demographic methods to perform the analysis. We find that the quality of the death countregistration in Suriname and its central regions is reasonably good. We also find that population data can be considered good. The results reveal a small difference in the completeness for males and females and that for the sub-national population the choice of method has implication on the results. To sum up, data quality in Suriname is better than in most countries in the region, but there are considerable regional differences as observed in other locations.

Download Full-text

Human Capital and Innovativeness as Means to Bridging Development Gaps. Poland and the Czech Republic as Case Studies

Review of Economic Perspectives ◽

10.2478/v10135-009-0011-6 ◽

2010 ◽

Vol 10 (3) ◽

pp. 87-108

Author(s):

Teresa Bal-Woźniak

Keyword(s):

Human Capital ◽

Czech Republic ◽

Case Studies ◽

Knowledge Assessment ◽

The Czech Republic ◽

Speed Up ◽

Data Source ◽

Development Gap ◽

Institutional Order

Human Capital and Innovativeness as Means to Bridging Development Gaps. Poland and the Czech Republic as Case Studies The aim of this article is to analyze the innovative achievements of selected economies: Polish and Czech. This issue is of fundamental significance for all post socialist countries. Post communist heritage in form of homo sovieticus is really far from innovative performance. The author assumed that innovativeness is the component of human capital whilst the conceptions of innovativeness were dealt with as the development challenge and the criterion of efficiency for contemporary economies, creating the opportunity to speed up the pace of narrowing the development gap. It is reflected in the title of the study. The methodological basis and data source are Knowledge Assessment Methodology (KAM 2009) and European Innovation Scorecard (EIS 2009). The fulfillment of this aim, in the author's opinion, relied on presenting the coordination of innovative actions of managing entities and underlining the growing significance of network structures. On the basis of the conducted empirical analysis encompassing the years 2003-2008, there was observed, mostly in Poland and to smaller extent in the Czech Republic, a low level of innovativeness and its unsatisfactory dynamics, as well as poor use of relatively numerous human capital for attaining goals. In the conclusion part of the article, there were presented problems connected with the necessity of consequent impact on the quality of human capital and level of innovativeness. In order to overcome barriers, the author postulates to establish a pro-innovative institutional order and indicates the need for systemic attitude towards these reforms.

Download Full-text

The Ranking of Deep Web Sources Based on Data Quality

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.303-306.2437 ◽

2013 ◽

Vol 303-306 ◽

pp. 2437-2444

Author(s):

Hu Yin ◽

Yun Fei Lv ◽

Wei Wei Wang

Keyword(s):

Data Quality ◽

Sampling Method ◽

Quality Standards ◽

Weight Vector ◽

Deep Web ◽

Quality Of Data ◽

Random Data ◽

Method Of Calculation ◽

Data Source

Deep Web technology makes a large number of useful information which hidden behind the interface easier to be found by users. However，with the increase of data source , how to find a suitable result quickly from a number of sources is becoming more and more important. In this paper, we start discussing from the quality of the data, setting 6 quality standards for the data source and giving the method of calculation. Meanwhile, we solve corresponding weight vector of quality standards by the feeling of the users; and based on this quality standards, we calculate a random data source according to weight vector to gain a general score. Then this paper discusses the sampling theory and proposes a reasonable sampling method for the experiment. The experiment result shows that it is of good veracity and operability to evaluate and score the data quality of data source according to sampling analysis.

Download Full-text

Assessing the quality of clinical and administrative data extracted from hospitals: The General Medicine Inpatient Initiative (GEMINI) experience

10.1101/2020.03.16.20036962 ◽

2020 ◽

Cited By ~ 1

Author(s):

Sachin V. Pasricha ◽

Hae Young Jung ◽

Vladyslav Kushnir ◽

Denise Mak ◽

Radha Koppula ◽

...

Keyword(s):

Data Quality ◽

Predictive Value ◽

Gold Standard ◽

General Medicine ◽

Quality Issue ◽

Clinical Databases ◽

Computational Data ◽

Quality Checks ◽

Data Tables

AbstractObjectiveLarge clinical databases are increasingly being used for research and quality improvement, but there remains uncertainty about how computational and manual approaches can be used together to assess and improve the quality of extracted data. The General Medicine Inpatient Initiative (GEMINI) database extracts and standardizes a broad range of data from clinical and administrative hospital data systems, including information about attending physicians, room transfers, laboratory tests, diagnostic imaging reports, and outcomes such as death in-hospital. We describe computational data quality assessment and manual data validation techniques that were used for GEMINI.MethodsThe GEMINI database currently contains 245,559 General Internal Medicine patient admissions at 7 hospital sites in Ontario, Canada from 2010-2017. We performed 7 computational data quality checks followed by manual validation of 23,419 selected data points on a sample of 7,488 patients across participating hospitals. After iteratively re-extracting data as needed based on the computational data quality checks, we manually validated GEMINI data against the data that could be obtained using the hospital’s electronic medical record (i.e. the data clinicians would see when providing care), which we considered the gold standard. We calculated accuracy, sensitivity, specificity, and positive and negative predictive values of GEMINI data.ResultsComputational checks identified multiple data quality issues – for example, the inclusion of cancelled radiology tests, a time shift of transfusion data, and mistakenly processing the symbol for sodium, “Na”, as a missing value. Manual data validation revealed that GEMINI data were ultimately highly reliable compared to the gold standard across nearly all data tables. One important data quality issue was identified by manual validation that was not detected by computational checks, which was that the dates and times of blood transfusion data at one site were not reliable. This resulted in low sensitivity (66%) and positive predictive value (75%) for blood transfusion data at that site. Apart from this single issue, GEMINI data were highly reliable across all data tables, with high overall accuracy (ranging from 98-100%), sensitivity (95-100%), specificity (99-100%), positive predictive value (93-100%), and negative predictive value (99-100%) compared to the gold standard.Discussion and ConclusionIterative assessment and improvement of data quality based primarily on computational checks permitted highly reliable extraction of multisite clinical and administrative data. Computational checks identified nearly all of the data quality issues in this initiative but one critical quality issue was only identified during manual validation. Combining computational checks and manual validation may be the optimal method for assessing and improving the quality of large multi-site clinical databases.

Download Full-text

Assessing the quality of clinical and administrative data extracted from hospitals: the General Medicine Inpatient Initiative (GEMINI) experience

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa225 ◽

2020 ◽

Author(s):

Amol A Verma ◽

Sachin V Pasricha ◽

Hae Young Jung ◽

Vladyslav Kushnir ◽

Denise Y F Mak ◽

...

Keyword(s):

Data Quality ◽

Predictive Value ◽

General Medicine ◽

Time Shift ◽

Single Issue ◽

Quality Issue ◽

Clinical Databases ◽

Computational Data ◽

Quality Checks

Abstract Objective Large clinical databases are increasingly used for research and quality improvement. We describe an approach to data quality assessment from the General Medicine Inpatient Initiative (GEMINI), which collects and standardizes administrative and clinical data from hospitals. Methods The GEMINI database contained 245 559 patient admissions at 7 hospitals in Ontario, Canada from 2010 to 2017. We performed 7 computational data quality checks and iteratively re-extracted data from hospitals to correct problems. Thereafter, GEMINI data were compared to data that were manually abstracted from the hospital’s electronic medical record for 23 419 selected data points on a sample of 7488 patients. Results Computational checks flagged 103 potential data quality issues, which were either corrected or documented to inform future analysis. For example, we identified the inclusion of canceled radiology tests, a time shift of transfusion data, and mistakenly processing the chemical symbol for sodium (“Na”) as a missing value. Manual validation identified 1 important data quality issue that was not detected by computational checks: transfusion dates and times at 1 site were unreliable. Apart from that single issue, across all data tables, GEMINI data had high overall accuracy (ranging from 98%–100%), sensitivity (95%–100%), specificity (99%–100%), positive predictive value (93%–100%), and negative predictive value (99%–100%) compared to the gold standard. Discussion and Conclusion Computational data quality checks with iterative re-extraction facilitated reliable data collection from hospitals but missed 1 critical quality issue. Combining computational and manual approaches may be optimal for assessing the quality of large multisite clinical databases.

Download Full-text