Data Sets and Data Quality in Software Engineering

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.

Download Full-text

Kurator: Tools for Improving Fitness for Use of Biodiversity Data.

Biodiversity Information Science and Standards ◽

10.3897/biss.2.26539 ◽

2018 ◽

Vol 2 ◽

pp. e26539 ◽

Cited By ~ 1

Author(s):

Paul J. Morris ◽

James Hanken ◽

David Lowery ◽

Bertram Ludäscher ◽

James Macklin ◽

...

Keyword(s):

Data Quality ◽

Early Stage ◽

Data Capture ◽

Data Sets ◽

Biodiversity Data ◽

Skill Levels ◽

Use Of Data ◽

Data Life Cycle ◽

Wide Range ◽

Original Observation

As curators of biodiversity data in natural science collections, we are deeply concerned with data quality, but quality is an elusive concept. An effective way to think about data quality is in terms of fitness for use (Veiga 2016). To use data to manage physical collections, the data must be able to accurately answer questions such as what objects are in the collections, where are they and where are they from. Some research uses aggregate data across collections, which involves exchange of data using standard vocabularies. Some research uses require accurate georeferences, collecting dates, and current identifications. It is well understood that the costs of data capture and data quality improvement increase with increasing time from the original observation. These factors point towards two engineering principles for software that is intended to maintain or enhance data quality: build small modular data quality tests that can be easily assembled in suites to assess the fitness of use of data for some particular need; and produce tools that can be applied by users with a wide range of technical skill levels at different points in the data life cycle. In the Kurator project, we have produced code (e.g. Wieczorek et al. 2017, Morris 2016) which consists of small modules that can be incorporated into data management processes as small libraries that address particular data quality tests. These modules can be combined into customizable data quality scripts, which can be run on single computers or scalable architecture and can be incorporated into other software, run as command line programs, or run as suites of canned workflows through a web interface. Kurator modules can be integrated into early stage data capture applications, run to help prepare data for aggregation by matching it to standard vocabularies, be run for quality control or quality assurance on data sets, and can report on data quality in terms of a fitness-for-use framework (Veiga et al. 2017). One of our goals is simple tests usable by anyone anywhere.

Download Full-text

Initial evaluation of data quality in a TSP software engineering project data repository

Proceedings of the 2014 International Conference on Software and System Process - ICSSP 2014 ◽

10.1145/2600821.2600841 ◽

2014 ◽

Cited By ~ 2

Author(s):

Yasutaka Shirai ◽

William Nichols ◽

Mark Kasunic

Keyword(s):

Software Engineering ◽

Data Quality ◽

Initial Evaluation ◽

Data Repository ◽

Engineering Project ◽

Project Data ◽

Evaluation Of Data

Download Full-text

Survey Research in the Arab World

10.1093/oxfordhb/9780190213299.013.14 ◽

2017 ◽

Author(s):

Lindsay J. Benstead

Keyword(s):

Data Quality ◽

Survey Research ◽

Research Agenda ◽

Voting Behavior ◽

Arab World ◽

Data Sets ◽

World Values Survey ◽

Methodological Research ◽

Survey Error ◽

The Arab Spring

Since the first surveys were conducted there in the late 1980s, survey research has expanded rapidly in the Arab world. Almost every country in the region is now included in the Arab Barometer, Afrobarometer, or World Values Survey. Moreover, the Arab spring marked a watershed, with the inclusion of Tunisia and Libya and addition of many topics, such as voting behavior, that were previously considered too sensitive. As a result, political scientists have dozens of largely untapped data sets to answer theoretical and policy questions. To make progress toward measuring and reducing total survey error, discussion is needed about quality issues, such as high rates of missingness and sampling challenges. Ongoing attention to ethics is also critical. This chapter discusses these developments and frames a substantive and methodological research agenda for improving data quality and survey practice in the Arab world.

Download Full-text

Data quality in empirical software engineering

Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering - EASE '13 ◽

10.1145/2460999.2461024 ◽

2013 ◽

Cited By ~ 10

Author(s):

Michael Franklin Bosu ◽

Stephen G. MacDonell

Keyword(s):

Software Engineering ◽

Data Quality ◽

Empirical Software Engineering

Download Full-text

Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering

Cancer Informatics ◽

10.4137/cin.s33076 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S33076 ◽

Cited By ~ 1

Author(s):

Kevin K. Mcdade ◽

Uma Chandran ◽

Roger S. Day

Keyword(s):

Data Quality ◽

Research Question ◽

Quality Data ◽

Data Sets ◽

Expression Data ◽

Test Bed ◽

Filtering Method ◽

Proteomic Data ◽

Combination Strategies ◽

Test Beds

Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other.

Download Full-text

RESAMPLING METHODS IN SOFTWARE QUALITY CLASSIFICATION

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194012400037 ◽

2012 ◽

Vol 22 (02) ◽

pp. 203-223 ◽

Cited By ~ 7

Author(s):

WASIF AFZAL ◽

RICHARD TORKAR ◽

ROBERT FELDT

Keyword(s):

Software Engineering ◽

Software Quality ◽

Cross Validation ◽

Predictor Variables ◽

Primary Study ◽

Data Sets ◽

Resampling Methods ◽

Quality Classification ◽

Leave One Out ◽

Fold Cross Validation

In the presence of a number of algorithms for classification and prediction in software engineering, there is a need to have a systematic way of assessing their performances. The performance assessment is typically done by some form of partitioning or resampling of the original data to alleviate biased estimation. For predictive and classification studies in software engineering, there is a lack of a definitive advice on the most appropriate resampling method to use. This is seen as one of the contributing factors for not being able to draw general conclusions on what modeling technique or set of predictor variables are the most appropriate. Furthermore, the use of a variety of resampling methods make it impossible to perform any formal meta-analysis of the primary study results. Therefore, it is desirable to examine the influence of various resampling methods and to quantify possible differences. Objective and method: This study empirically compares five common resampling methods (hold-out validation, repeated random sub-sampling, 10-fold cross-validation, leave-one-out cross-validation and non-parametric bootstrapping) using 8 publicly available data sets with genetic programming (GP) and multiple linear regression (MLR) as software quality classification approaches. Location of (PF, PD) pairs in the ROC (receiver operating characteristics) space and area under an ROC curve (AUC) are used as accuracy indicators. Results: The results show that in terms of the location of (PF, PD) pairs in the ROC space, bootstrapping results are in the preferred region for 3 of the 8 data sets for GP and for 4 of the 8 data sets for MLR. Based on the AUC measure, there are no significant differences between the different resampling methods using GP and MLR. Conclusion: There can be certain data set properties responsible for insignificant differences between the resampling methods based on AUC. These include imbalanced data sets, insignificant predictor variables and high-dimensional data sets. With the current selection of data sets and classification techniques, bootstrapping is a preferred method based on the location of (PF, PD) pair data in the ROC space. Hold-out validation is not a good choice for comparatively smaller data sets, where leave-one-out cross-validation (LOOCV) performs better. For comparatively larger data sets, 10-fold cross-validation performs better than LOOCV.

Download Full-text

Applying a Data Quality Model to Experiments in Software Engineering

Lecture Notes in Computer Science - Advances in Conceptual Modeling ◽

10.1007/978-3-319-12256-4_18 ◽

2014 ◽

pp. 168-177 ◽

Cited By ~ 2

Author(s):

María Carolina Valverde ◽

Diego Vallespir ◽

Adriana Marotta ◽

Jose Ignacio Panach

Keyword(s):

Software Engineering ◽

Data Quality ◽

Quality Model

Download Full-text

Developing a Legal Form Classification and Extraction Approach for Company Entity Matching

Business Information Systems ◽

10.52825/bis.v1i.44 ◽

2021 ◽

pp. 13-26

Author(s):

Felix Kruse ◽

Jan-Philipp Awick ◽

Jorge Marx Gómez ◽

Peter Loos

Keyword(s):

Data Quality ◽

Record Linkage ◽

Hybrid Approach ◽

Supervised Machine Learning ◽

Data Sets ◽

Legal Form ◽

Process Step ◽

Machine Learning Model ◽

Rule Set ◽

Processing Steps

This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.

Download Full-text

Data Sets and Data Quality in Software Engineering