Improving Data Quality: Actors, Incentives, and Capabilities

2007 ◽  
Vol 15 (4) ◽  
pp. 365-386 ◽  
Author(s):  
Yoshiko M. Herrera ◽  
Devesh Kapur

This paper examines the construction and use of data sets in political science. We focus on three interrelated questions: How might we assess data quality? What factors shape data quality? and How can these factors be addressed to improve data quality? We first outline some problems with existing data set quality, including issues of validity, coverage, and accuracy, and we discuss some ways of identifying problems as well as some consequences of data quality problems. The core of the paper addresses the second question by analyzing the incentives and capabilities facing four key actors in a data supply chain: respondents, data collection agencies (including state bureaucracies and private organizations), international organizations, and finally, academic scholars. We conclude by making some suggestions for improving the use and construction of data sets.It is a capital mistake, Watson, to theorise before you have all the evidence. It biases the judgment.—Sherlock Holmes in “A Study in Scarlet”Statistics make officials, and officials make statistics.”—Chinese proverb

2013 ◽  
Vol 69 (7) ◽  
pp. 1215-1222 ◽  
Author(s):  
K. Diederichs ◽  
P. A. Karplus

In macromolecular X-ray crystallography, typical data sets have substantial multiplicity. This can be used to calculate the consistency of repeated measurements and thereby assess data quality. Recently, the properties of a correlation coefficient, CC1/2, that can be used for this purpose were characterized and it was shown that CC1/2has superior properties compared with `merging'Rvalues. A derived quantity, CC*, links data and model quality. Using experimental data sets, the behaviour of CC1/2and the more conventional indicators were compared in two situations of practical importance: merging data sets from different crystals and selectively rejecting weak observations or (merged) unique reflections from a data set. In these situations controlled `paired-refinement' tests show that even though discarding the weaker data leads to improvements in the mergingRvalues, the refined models based on these data are of lower quality. These results show the folly of such data-filtering practices aimed at improving the mergingRvalues. Interestingly, in all of these tests CC1/2is the one data-quality indicator for which the behaviour accurately reflects which of the alternative data-handling strategies results in the best-quality refined model. Its properties in the presence of systematic error are documented and discussed.


1997 ◽  
Vol 1997 ◽  
pp. 143-143
Author(s):  
B.L. Nielsen ◽  
R.F. Veerkamp ◽  
J.E. Pryce ◽  
G. Simm ◽  
J.D. Oldham

High producing dairy cows have been found to be more susceptible to disease (Jones et al., 1994; Göhn et al., 1995) raising concerns about the welfare of the modern dairy cow. Genotype and number of lactations may affect various health problems differently, and their relative importance may vary. The categorical nature and low incidence of health events necessitates large data-sets, but the use of data collected across herds may introduce unwanted variation. Analysis of a comprehensive data-set from a single herd was carried out to investigate the effects of genetic line and lactation number on the incidence of various health and reproductive problems.


2018 ◽  
Vol 2 ◽  
pp. e26539 ◽  
Author(s):  
Paul J. Morris ◽  
James Hanken ◽  
David Lowery ◽  
Bertram Ludäscher ◽  
James Macklin ◽  
...  

As curators of biodiversity data in natural science collections, we are deeply concerned with data quality, but quality is an elusive concept. An effective way to think about data quality is in terms of fitness for use (Veiga 2016). To use data to manage physical collections, the data must be able to accurately answer questions such as what objects are in the collections, where are they and where are they from. Some research uses aggregate data across collections, which involves exchange of data using standard vocabularies. Some research uses require accurate georeferences, collecting dates, and current identifications. It is well understood that the costs of data capture and data quality improvement increase with increasing time from the original observation. These factors point towards two engineering principles for software that is intended to maintain or enhance data quality: build small modular data quality tests that can be easily assembled in suites to assess the fitness of use of data for some particular need; and produce tools that can be applied by users with a wide range of technical skill levels at different points in the data life cycle. In the Kurator project, we have produced code (e.g. Wieczorek et al. 2017, Morris 2016) which consists of small modules that can be incorporated into data management processes as small libraries that address particular data quality tests. These modules can be combined into customizable data quality scripts, which can be run on single computers or scalable architecture and can be incorporated into other software, run as command line programs, or run as suites of canned workflows through a web interface. Kurator modules can be integrated into early stage data capture applications, run to help prepare data for aggregation by matching it to standard vocabularies, be run for quality control or quality assurance on data sets, and can report on data quality in terms of a fitness-for-use framework (Veiga et al. 2017). One of our goals is simple tests usable by anyone anywhere.


2018 ◽  
Vol 2 ◽  
pp. e25317
Author(s):  
Stijn Van Hoey ◽  
Peter Desmet

The ability to communicate and assess the quality and fitness for use of data is crucial to ensure maximum utility and re-use. Data consumers have certain requirements for the data they seek and need to be able to check if a data set conforms with these requirements. Data publishers aim to provide data with the highest possible quality and need to be able to identify potential errors that can be addressed with the available information at hand. The development and adoption of data publication guidelines is one approach to define and meet those requirements. However, the use of a guideline, the mapping decisions, and the requirements a dataset is expected to meet, are generally not communicated with the provided data. Moreover, these guidelines are typically intended for humans only. In this talk, we will present 'whip': a proposed syntax for data specifications. With whip, one can define column-based constraints for tabular (tidy) data using a number of rules, e.g. how data is structured following Darwin Core, how a term uses controlled vocabulary values, or what the expected minimum and maximum values are. These rules are human- and machine-readable, which communicates the specifications, and allows to automatically validate those in pipelines for data publication and quality assessment, such as Kurator. Whip can be formatted as a (yaml) text file that can be provided with the published data, communicating the specifications a dataset is expected to meet. The scope of these specifications can be specific to a dataset, but can also be used to express expected data quality and fitness for use of a publisher, consumer or community, allowing bottom-up and top-down adoption. As such, these specifications are complementary to the core set of data quality tests as currently under development by the TDWG Biodiversity Data Quality Task 2 Group 2. Whip rules are currently generic, but more specific ones can be defined to address requirements for biodiversity information.


2019 ◽  
Vol 2 (2) ◽  
pp. 169-187 ◽  
Author(s):  
Ruben C. Arslan

Data documentation in psychology lags behind not only many other disciplines, but also basic standards of usefulness. Psychological scientists often prefer to invest the time and effort that would be necessary to document existing data well in other duties, such as writing and collecting more data. Codebooks therefore tend to be unstandardized and stored in proprietary formats, and they are rarely properly indexed in search engines. This means that rich data sets are sometimes used only once—by their creators—and left to disappear into oblivion. Even if they can find an existing data set, researchers are unlikely to publish analyses based on it if they cannot be confident that they understand it well enough. My codebook package makes it easier to generate rich metadata in human- and machine-readable codebooks. It uses metadata from existing sources and automates some tedious tasks, such as documenting psychological scales and reliabilities, summarizing descriptive statistics, and identifying patterns of missingness. The codebook R package and Web app make it possible to generate a rich codebook in a few minutes and just three clicks. Over time, its use could lead to psychological data becoming findable, accessible, interoperable, and reusable, thereby reducing research waste and benefiting both its users and the scientific community as a whole.


2016 ◽  
Vol 9 (1) ◽  
pp. 60-69
Author(s):  
Robert M. Zink

It is sometimes said that scientists are entitled to their own opinions but not their own set of facts. This suggests that application of the scientific method ought to lead to a single conclusion from a given set of data. However, sometimes scientists have conflicting opinions about which analytical methods are most appropriate or which subsets of existing data are most relevant, resulting in different conclusions. Thus, scientists might actually lay claim to different sets of facts. However, if a contrary conclusion is reached by selecting a subset of data, this conclusion should be carefully scrutinized to determine whether consideration of the full data set leads to different conclusions. This is important because conservation agencies are required to consider all of the best available data and make a decision based on them. Therefore, exploring reasons why different conclusions are reached from the same body of data has relevance for management of species. The purpose of this paper was to explore how two groups of researchers can examine the same data and reach opposite conclusions in the case of the taxonomy of the endangered subspecies Southwestern Willow Flycatcher (Empidonax traillii extimus). It was shown that use of subsets of data and characters rather than reliance on entire data sets can explain conflicting conclusions. It was recommend that agencies tasked with making conservation decisions rely on analyses that include all relevant molecular, ecological, behavioral, and morphological data, which in this case show that the subspecies is not valid, and hence its listing is likely not warranted.


2020 ◽  
Vol 45 (4) ◽  
pp. 737-763 ◽  
Author(s):  
Anirban Laha ◽  
Parag Jain ◽  
Abhijit Mishra ◽  
Karthik Sankaranarayanan

We present a framework for generating natural language description from structured data such as tables; the problem comes under the category of data-to-text natural language generation (NLG). Modern data-to-text NLG systems typically use end-to-end statistical and neural architectures that learn from a limited amount of task-specific labeled data, and therefore exhibit limited scalability, domain-adaptability, and interpretability. Unlike these systems, ours is a modular, pipeline-based approach, and does not require task-specific parallel data. Rather, it relies on monolingual corpora and basic off-the-shelf NLP tools. This makes our system more scalable and easily adaptable to newer domains. Our system utilizes a three-staged pipeline that: (i) converts entries in the structured data to canonical form, (ii) generates simple sentences for each atomic entry in the canonicalized representation, and (iii) combines the sentences to produce a coherent, fluent, and adequate paragraph description through sentence compounding and co-reference replacement modules. Experiments on a benchmark mixed-domain data set curated for paragraph description from tables reveals the superiority of our system over existing data-to-text approaches. We also demonstrate the robustness of our system in accepting other popular data sets covering diverse data types such as knowledge graphs and key-value maps.


Author(s):  
Alan J. Silman ◽  
Gary J. Macfarlane ◽  
Tatiana Macfarlane

Although epidemiological studies are increasingly based on the analysis of existing data sets (including linked data sets), many studies still require primary data collection. Such data may come from patient questionnaires, interviews, abstraction from records, and/or the results of tests and measures such as weight or blood test results. The next stage is to analyse the data gathered from individual subjects to provide the answers required. Before commencing with the statistical analysis of any data set, the data themselves must be prepared in a format so that the detailed statistical analysis can achieve its goals. Items to be considered include the format the data are initially collected in and how they are transferred to an appropriate electronic form. This chapter explores how errors are minimized and the quality of the data set ensured. These tasks are not trivial and need to be planned as part of a detailed study methodology.


2018 ◽  
Vol 11 (2) ◽  
pp. 1207-1231 ◽  
Author(s):  
Taku Umezawa ◽  
Carl A. M. Brenninkmeijer ◽  
Thomas Röckmann ◽  
Carina van der Veen ◽  
Stanley C. Tyler ◽  
...  

Abstract. We report results from a worldwide interlaboratory comparison of samples among laboratories that measure (or measured) stable carbon and hydrogen isotope ratios of atmospheric CH4 (δ13C-CH4 and δD-CH4). The offsets among the laboratories are larger than the measurement reproducibility of individual laboratories. To disentangle plausible measurement offsets, we evaluated and critically assessed a large number of intercomparison results, some of which have been documented previously in the literature. The results indicate significant offsets of δ13C-CH4 and δD-CH4 measurements among data sets reported from different laboratories; the differences among laboratories at modern atmospheric CH4 level spread over ranges of 0.5 ‰ for δ13C-CH4 and 13 ‰ for δD-CH4. The intercomparison results summarized in this study may be of help in future attempts to harmonize δ13C-CH4 and δD-CH4 data sets from different laboratories in order to jointly incorporate them into modelling studies. However, establishing a merged data set, which includes δ13C-CH4 and δD-CH4 data from multiple laboratories with desirable compatibility, is still challenging due to differences among laboratories in instrument settings, correction methods, traceability to reference materials and long-term data management. Further efforts are needed to identify causes of the interlaboratory measurement offsets and to decrease those to move towards the best use of available δ13C-CH4 and δD-CH4 data sets.


2021 ◽  
Author(s):  
Rishabh Deo Pandey ◽  
Itu Snigdh

Abstract Data quality became significant with the emergence of data warehouse systems. While accuracy is intrinsic data quality, validity of data presents a wider perspective, which is more representational and contextual in nature. Through our article we present a different perspective in data collection and collation. We focus on faults experienced in data sets and present validity as a function of allied parameters such as completeness, usability, availability and timeliness for determining the data quality. We also analyze the applicability of these metrics and apply modifications to make it conform to IoT applications. Another major focus of this article is to verify these metrics on aggregated data set instead of separate data values. This work focuses on using the different validation parameters for determining the quality of data generated in a pervasive environment. Analysis approach presented is simple and can be employed to test the validity of collected data, isolate faults in the data set and also measure the suitability of data before applying algorithms for analysis.


Sign in / Sign up

Export Citation Format

Share Document