Data repair of density-based data cleaning approach using conditional functional dependencies

PurposeData quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.Design/methodology/approachA set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.FindingsThis new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.Originality/valueConditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.

Download Full-text

Domain knowledge and data quality perceptions in genome curation work

Journal of Documentation ◽

10.1108/jd-08-2013-0104 ◽

2015 ◽

Vol 71 (1) ◽

pp. 116-142 ◽

Cited By ~ 4

Author(s):

Hong Huang

Keyword(s):

Data Quality ◽

Domain Knowledge ◽

Design Methodology ◽

Quality Criteria ◽

Computational Science ◽

Survey Method ◽

New Approach ◽

Content Type ◽

Genome Data ◽

Quality Perceptions

Purpose – The purpose of this paper is to understand genomics scientists’ perceptions in data quality assurances based on their domain knowledge. Design/methodology/approach – The study used a survey method to collect responses from 149 genomics scientists grouped by domain knowledge. They ranked the top-five quality criteria based on hypothetical curation scenarios. The results were compared using χ2 test. Findings – Scientists with domain knowledge of biology, bioinformatics, and computational science did not reach a consensus in ranking data quality criteria. Findings showed that biologists cared more about curated data that can be concise and traceable. They were also concerned about skills dealing with information overloading. Computational scientists on the other hand value making curation understandable. They paid more attention to the specific skills for data wrangling. Originality/value – This study takes a new approach in comparing the data quality perceptions for scientists across different domains of knowledge. Few studies have been able to synthesize models to interpret data quality perception across domains. The findings may help develop data quality assurance policies, training seminars, and maximize the efficiency of genome data management.

Download Full-text

Detecting Fraudulent Interviewers by Improved Clustering Methods – The Case of Falsifications of Answers to Parts of a Questionnaire

Journal of Official Statistics ◽

10.1515/jos-2016-0033 ◽

2016 ◽

Vol 32 (3) ◽

pp. 643-660 ◽

Cited By ~ 3

Author(s):

Samuel De Haas ◽

Peter Winker

Keyword(s):

Cluster Analysis ◽

Data Quality ◽

Survey Data ◽

Empirical Research ◽

A Priori ◽

Cluster Method ◽

Clustering Methods ◽

New Approach ◽

Bootstrap Approach ◽

Synthetic Datasets

Abstract Falsified interviews represent a serious threat to empirical research based on survey data. The identification of such cases is important to ensure data quality. Applying cluster analysis to a set of indicators helps to identify suspicious interviewers when a substantial share of all of their interviews are complete falsifications, as shown by previous research. This analysis is extended to the case when only a share of questions within all interviews provided by an interviewer is fabricated. The assessment is based on synthetic datasets with a priori set properties. These are constructed from a unique experimental dataset containing both real and fabricated data for each respondent. Such a bootstrap approach makes it possible to evaluate the robustness of the method when the share of fabricated answers per interview decreases. The results indicate a substantial loss of discriminatory power in the standard cluster analysis if the share of fabricated answers within an interview becomes small. Using a novel cluster method which allows imposing constraints on cluster sizes, performance can be improved, in particular when only few falsifiers are present. This new approach will help to increase the robustness of survey data by detecting potential falsifiers more reliably.

Download Full-text

“I have just returned from the moon:” online survey fraud

Supply Chain Management An International Journal ◽

10.1108/scm-12-2019-0466 ◽

2020 ◽

Vol 25 (4) ◽

pp. 489-503

Author(s):

Vitaly Brazhkin

Keyword(s):

Supply Chain ◽

Data Quality ◽

Online Survey ◽

Data Cleaning ◽

Population Surveys ◽

Panel Surveys ◽

Comprehensive Review ◽

Content Type ◽

Online Panel ◽

First Time

Purpose The purpose of this paper is to provide a comprehensive review of the respondents’ fraud phenomenon in online panel surveys, delineate data quality issues from surveys of broad and narrow populations, alert fellow researchers about higher incidence of respondents’ fraud in online panel surveys of narrow populations, such as logistics professionals and recommend ways to protect the quality of data received from such surveys. Design/methodology/approach This general review paper has two parts, namely, descriptive and instructional. The current state of online survey and panel data use in supply chain research is examined first through a survey method literature review. Then, a more focused understanding of the phenomenon of fraud in surveys is provided through an analysis of online panel industry literature and psychological academic literature. Common survey design and data cleaning recommendations are critically assessed for their applicability to narrow populations. A survey of warehouse professionals is used to illustrate fraud detection techniques and glean additional, supply chain specific data protection recommendations. Findings Surveys of narrow populations, such as those typically targeted by supply chain researchers, are much more prone to respondents’ fraud. To protect and clean survey data, supply chain researchers need to use many measures that are different from those commonly recommended in methodological survey literature. Research limitations/implications For the first time, the need to distinguish between narrow and broad population surveys has been stated when it comes to data quality issues. The confusion and previously reported “mixed results” from literature reviews on the subject have been explained and a clear direction for future research is suggested: the two categories should be considered separately. Practical implications Specific fraud protection advice is provided to supply chain researchers on the strategic choices and specific aspects for all phases of surveying narrow populations, namely, survey preparation, administration and data cleaning. Originality/value This paper can greatly benefit researchers in several ways. It provides a comprehensive review and analysis of respondents’ fraud in online surveys, an issue poorly understood and rarely addressed in academic research. Drawing from literature from several fields, this paper, for the first time in literature, offers a systematic set of recommendations for narrow population surveys by clearly contrasting them with general population surveys.

Download Full-text

A Survey on Cleaning Dirty Data Using Machine Learning Paradigm for Big Data Analytics

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v10.i3.pp1234-1243 ◽

2018 ◽

Vol 10 (3) ◽

pp. 1234 ◽

Cited By ~ 2

Author(s):

Jesmeen M. Z. H ◽

J. Hossen ◽

S. Sayeed ◽

CK Ho ◽

Tawsif K ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Quality ◽

Data Analytics ◽

Large Scale ◽

Data Cleaning ◽

Big Data Analytics ◽

Quality Criteria ◽

Machine Learning Algorithms ◽

Dirty Data

<span>Recently Big Data has become one of the important new factors in the business field. This needs to have strategies to manage large volumes of structured, unstructured and semi-structured data. It’s challenging to analyze such large scale of data to extract data meaning and handling uncertain outcomes. Almost all big data sets are dirty, i.e. the set may contain inaccuracies, missing data, miscoding and other issues that influence the strength of big data analytics. One of the biggest challenges in big data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics and unpredictable conclusions. Data cleaning is an essential part of managing and analyzing data. In this survey paper, data quality troubles which may occur in big data processing to understand clearly why an organization requires data cleaning are examined, followed by data quality criteria (dimensions used to indicate data quality). Then, cleaning tools available in market are summarized. Also challenges faced in cleaning big data due to nature of data are discussed. Machine learning algorithms can be used to analyze data and make predictions and finally clean data automatically.</span>

Download Full-text

Survey Data Quality in Analyzing Harmonized Indicators of Protest Behavior: A Survey Data Recycling Approach

American Behavioral Scientist ◽

10.1177/00027642211021623 ◽

2021 ◽

pp. 000276422110216

Author(s):

Kazimierz M. Slomczynski ◽

Irina Tomescu-Dubrow ◽

Ilona Wysmulek

Keyword(s):

Data Processing ◽

Data Quality ◽

Survey Data ◽

A Priori ◽

Data Sets ◽

New Approach ◽

Survey Quality ◽

Survey Error ◽

Ex Post ◽

The Impact

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.

Download Full-text

A new approach to "storage management" restrictions using the "data quality" concept

Proceedings of the 3rd Annual Haifa Experimental Systems Conference on - SYSTOR '10 ◽

10.1145/1815695.1815706 ◽

2010 ◽

Author(s):

Koby Biller

Keyword(s):

Data Quality ◽

Storage Management ◽

New Approach

Download Full-text

The concept of neutrality: a new approach

Journal of Documentation ◽

10.1108/jd-05-2019-0102 ◽

2019 ◽

Vol 76 (1) ◽

pp. 333-353 ◽

Cited By ~ 1

Author(s):

Stephen Macdonald ◽

Briony Birdi

Keyword(s):

Literature Review ◽

Design Methodology ◽

Information Science ◽

Social Institutions ◽

Future Research ◽

Library And Information Science ◽

New Approach ◽

Content Type ◽

Normative Framework ◽

Contextual Awareness

Purpose Neutrality is a much debated value in library and information science (LIS). The “neutrality debate” is characterised by opinionated discussions in contrasting contexts. The purpose of this paper is to fill a gap in the literature by bringing these conceptions together holistically, with potential to deepen understanding of LIS neutrality. Design/methodology/approach First, a literature review identified conceptions of neutrality reported in the LIS literature. Second, seven phenomenographic interviews with LIS professionals were conducted across three professional sectors. To maximise variation, each sector comprised at least one interview with a professional of five or fewer years’ experience and one with ten or more years’ experience. Third, conceptions from the literature and interviews were compared for similarities and disparities. Findings In four conceptions, each were found in the literature and interviews. In the literature, these were labelled: “favourable”, “tacit value”, “social institutions” and “value-laden profession”, whilst in interviews they were labelled: “core value”, “subservient”, “ambivalent”, and “hidden values”. The study’s main finding notes the “ambivalent” conception in interviews is not captured by a largely polarised literature, which oversimplifies neutrality’s complexity. To accommodate this complexity, it is suggested that future research should look to reconcile perceptions from either side of the “neutral non-neutral divide” through an inclusive normative framework. Originality/value This study’s value lies in its descriptive methodology, which brings LIS neutrality together in a holistic framework. This framework brings a contextual awareness to LIS neutrality lacking in previous research. This awareness has the potential to change the tone of the LIS neutrality debate.

Download Full-text

Development of insect repellent finish by a simple coacervation microencapsulation technique

International Journal of Clothing Science and Technology ◽

10.1108/ijcst-02-2017-0022 ◽

2018 ◽

Vol 30 (2) ◽

pp. 152-158 ◽

Cited By ~ 3

Author(s):

Rachna Sharma ◽

Alka Goel

Keyword(s):

Essential Oils ◽

Design Methodology ◽

Insect Repellent ◽

New Approach ◽

Content Type ◽

Eucalyptus Oil ◽

Gum Acacia ◽

Major Benefit ◽

Natural Insecticide ◽

Practical Implications

Purpose The paper focused onto the development of microcapsules by using two essential oils. It proposes the uses of eucalyptus oil and cedarwood oil as a natural insecticide. The purpose of this paper is to demonstrate the application of developed microcapsules to impart insect repellency on textile substrate. Design/methodology/approach The paper opted for an experimental study using two essential oils and gum in formations of microcapsules through a simple coaseravtion encapsulation technique. The developed solution was analyzed, including confirmation of size and structure through. Application of developed finish on substrate was also undertaken to prove better ability as repellent fabric. Findings The paper highlights useful invention of microencapsulated fabric developed with the combination of gum acacia and eucalyptus oil as core and shell material. The developed fabric has better ability to repel silverfish as compared to microencapsulated fabric developed with gum acacia (shell) and cedarwood oil (core). Research limitations/implications Due to the lack of time and less availability of essential oils, only two oils were used to test the insect repellent behavior. Practical implications This paper fulfills an identified need, it includes implications for the development of a very useful natural insecticide to repel silverfish (Lepisma saccharina) insect. This insect is a very common problem found in cloth wardrobes and bookshelves; it mainly attacks the fabric with cellulosic content and starch. Social implications Society will get major benefit of using these microencapsulated finished fabrics, which repel silverfish from their home and keep their clothing and books safe for longer period. The natural fragrance and medicinal benefits of these essential oils can never be ignored. Originality/value This study sets a new approach to repel insects like silverfish from the bookshelves and clothing wardrobes. A layer of insect repellent microencapsulated finished fabric can be added in these shelves and wardrobes. It is an eco-friendly approach of using natural essential oils instead of chemical insecticides.

Download Full-text

Organizational diversity: making the case for contextual interpretivism

Equality Diversity and Inclusion An International Journal ◽

10.1108/edi-02-2014-0010 ◽

2015 ◽

Vol 34 (6) ◽

pp. 496-509 ◽

Cited By ~ 7

Author(s):

W. J. Greeff

Keyword(s):

Diversity Management ◽

Internal Communication ◽

Organizational Diversity ◽

New Approach ◽

Content Type ◽

Organizational Contexts ◽

Organizational Settings ◽

Managing Diversity ◽

Functional Aspects ◽

Excellence Theory

Purpose – The purpose of this paper is to make a case for contextual interpretivism in managing diversity in organizational settings, specifically in its bearing on internal communication, going against the dominating functionalistic stance of venerated and ubiquitous approaches. Design/methodology/approach – Qualitative and quantitative methodologies were employed to explore the potential of contextual interpretivism within the mining and construction industries of South Africa, due to the fecund diversity context of its employee population. Findings – This paper points to the enriched understanding that could result from following a contextual interpretivistic approach to internal communication for diversity management, and in so doing discusses the ways in which this could take hold in organizations through the application of germane theoretical assertions of revered internal organizational communication literature, specifically the excellence theory and communication satisfaction. Research limitations/implications – The main limitation to this research is the restricted generalizability of its empirical research. Further research is required for the exploration of the central premise in other organizational contexts. Practical implications – The paper provides insights into the ways in which organizations could approach its diversity management so as to speak to more than just the functional aspects thereof, and rather to the importance of nurturing an understanding of employees’ interpretation of the organization’s diversity endeavors. Originality/value – The implications of applying a new approach to diversity management in organizational settings is discussed and argued, offering an empirical application thereof, which gives way to practical, data-driven recommendations for use in organizational settings.

Download Full-text

A multi-manned assembly line balancing problem with classified teams: a new approach

Assembly Automation ◽

10.1108/aa-04-2015-035 ◽

2016 ◽

Vol 36 (1) ◽

pp. 51-59 ◽

Cited By ~ 15

Author(s):

Hamid Yilmaz ◽

Mustafa Yilmaz

Keyword(s):

Design Methodology ◽

Assembly Line ◽

Assembly Line Balancing ◽

Line Balancing ◽

Test Problems ◽

New Approach ◽

Content Type ◽

Line Test ◽

Assembly Line Balancing Problem ◽

First Time

Purpose – Within team-oriented approaches, tasks are assigned to teams before being assigned to workstations as a reality of industry. So it becomes clear, which workers assemble which tasks. Design/methodology/approach – Team numbers of the assembly line can increase with the number of tasks, but at the same time, due to physical situations of the stations, there will be limitations of maximum working team numbers in a station. For this purpose, heuristic assembly line balancing (ALB) procedure is used and mathematical model is developed for the problem. Findings – Well-known assembly line test problems widely used in the literature are solved to indicate the effectiveness and applicability of the proposed approach in practice. Originality/value – This paper draws attention to ALB problem in which workers have been assigned to teams in advance due to the need for specialized skills or equipment on the line for the first time.

Download Full-text