dirty data Latest Research Papers

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Samir Al-Janabi ◽

Ryszard Janicki

Keyword(s):

Data Quality ◽

Data Cleaning ◽

Functional Dependencies ◽

Huge Amount ◽

New Approach ◽

Content Type ◽

Inconsistent Data ◽

Dirty Data ◽

Semantic Errors ◽

Synthetic Datasets

PurposeData quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.Design/methodology/approachA set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.FindingsThis new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.Originality/valueConditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.

Download Full-text

The Dangers of Dirty Data

10.29085/9781783305049.002 ◽

2021 ◽

pp. 1-36

Keyword(s):

Dirty Data

Download Full-text

Between the Spreadsheets

10.29085/9781783305049 ◽

2021 ◽

Author(s):

Susan Walsh

Keyword(s):

Data Quality ◽

Deep Understanding ◽

Data Classification ◽

Dirty Data ◽

Quality Issues ◽

Book Covers ◽

Level Of Experience ◽

The Impact

Dirty data is a problem that costs businesses thousands, if not millions, every year. In organisations large and small across the globe you will hear talk of data quality issues. What you will rarely hear about is the consequences or how to fix it. Between the Spreadsheets: Classifying and Fixing Dirty Data draws on classification expert Susan Walsh's decade of experience in data classification to present a fool-proof method for cleaning and classifying your data. The book covers everything from the very basics of data classification to normalisation, taxonomies and presents the author's proven COAT methodology, helping ensure an organisation's data is Consistent, Organised, Accurate and Trustworthy. A series of data horror stories outlines what can go wrong in managing data, and if it does, how it can be fixed. After reading this book, regardless of your level of experience, not only will you be able to work with your data more efficiently, but you will also understand the impact the work you do with it has, and how it affects the rest of the organisation. Written in an engaging and highly practical manner, Between the Spreadsheets gives readers of all levels a deep understanding of the dangers of dirty data and the confidence and skills to work more efficiently and effectively with it.

Download Full-text

The Dirty Data Maturity Model

10.29085/9781783305049.008 ◽

2021 ◽

pp. 131-138

Keyword(s):

Maturity Model ◽

Dirty Data

Download Full-text

Data Preprocessing

10.1093/oso/9780192897879.003.0002 ◽

2021 ◽

pp. 33-58

Author(s):

Magy Seif El-Nasr ◽

Truong Huy Nguyen Dinh ◽

Alessandro Canossa ◽

Anders Drachen

Keyword(s):

Data Preprocessing ◽

Data Consistency ◽

Practical Approach ◽

Outlier Identification ◽

Dirty Data ◽

Standardization Process

This chapter focuses on the process of cleaning data and preparing it for further processing. Specifically, the chapter discusses various techniques that you will use, including preprocessing, outlier identification, data consistency, and the normalization or standardization process, used to normalize your data. The chapter further discusses different measurement types and what methods can be used for which types. The chapter also discusses ways to deal with issues you may encounter with inconsistent or dirty data. The chapter takes a more practical approach by integrating several labs with actual game data to demonstrate how you can perform these steps on real game data.

Download Full-text

Clean and Dirty Data

10.2307/j.ctv1htpf51.8 ◽

2021 ◽

pp. 113-134

Keyword(s):

Dirty Data

Download Full-text

Alliance Rules- based Algorithm on Detecting Duplicate Entry Email

Journal of Computer Science and Information Technology ◽

10.35134/jcsitech.v7i2.7 ◽

2021 ◽

pp. 46-53

Author(s):

Arif Hanafi ◽

Sulaiman Harun ◽

Sofika Enggari ◽

Larissa Navia Rani

Keyword(s):

Data Warehouse ◽

Association Rules ◽

Data Cleaning ◽

Business Communication ◽

Point Of View ◽

Data Detection ◽

Data Sets ◽

Mathematical Association ◽

Dirty Data ◽

Existing Data

The way that email has extraordinary significance in present day business communication is certain. Consistently, a bulk of emails is sent from organizations to clients and suppliers, from representatives to their managers and starting with one colleague then onto the next. In this way there is vast of email in data warehouse. Data cleaning is an activity performed on the data sets of data warehouse to upgrade and keep up the quality and consistency of the data. This paper underlines the issues related with dirty data, detection of duplicatein email column. The paper identifies the strategy of data cleaning from adifferent point of view. It provides an algorithm to the discovery of error and duplicates entries in the data sets of existing data warehouse. The paper characterizes the alliance rules based on the concept of mathematical association rules to determine the duplicate entries in email column in data sets.

Download Full-text

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Journal of Computer Science and Technology ◽

10.1007/s11390-021-1344-6 ◽

2021 ◽

Vol 36 (4) ◽

pp. 806-821

Author(s):

Zhi-Xin Qi ◽

Hong-Zhi Wang ◽

An-Jie Wang

Keyword(s):

Experimental Evaluation ◽

Dirty Data ◽

Classification And Clustering

Download Full-text

Detecting dirty data using SQL: Rigorous house insurance case

Journal of Accounting Education ◽

10.1016/j.jaccedu.2021.100714 ◽

2021 ◽

Vol 55 ◽

pp. 100714

Author(s):

James G. Lawson ◽

Daniel A. Street

Keyword(s):

Dirty Data ◽

Insurance Case

Download Full-text

The Technology of Biotechnology and Big Data in Medicine Development

Digital Health ◽

10.1093/oso/9780197503133.003.0006 ◽

2021 ◽

pp. 58-73

Author(s):

Eric D. Perakslis ◽

Martin Stanley

Keyword(s):

Clinical Trials ◽

Big Data ◽

Digital Health ◽

Wearable Devices ◽

Sequencing Data ◽

Detailed Understanding ◽

Regulatory Decision ◽

Device Discovery ◽

Medicine Development ◽

Dirty Data

The rise of big data and digital health in medicine have been concurrent over the last two decades. Often confused, while virtually all digital health solutions, such as sensors wearable devices, and diagnostic algorithms, involve big data, not all big data in health care originates from digital health tools. Genomic sequencing data being one example of this. In this chapter, the role and importance of big data in medicines and medical device discovery and development are detailed with the specific focus of providing a detailed understanding of the product discovery, product development, clinical trials, regulatory authorization, and marketing processes. Concepts such as “dirty data,” regulatory decision-making, remote and virtualized clinical trials, and other key elements of digital health are discussed.

Download Full-text

dirty data
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Data repair of density-based data cleaning approach using conditional functional dependencies

The Dangers of Dirty Data

Between the Spreadsheets

The Dirty Data Maturity Model

Data Preprocessing

Clean and Dirty Data

Alliance Rules- based Algorithm on Detecting Duplicate Entry Email

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Detecting dirty data using SQL: Rigorous house insurance case

The Technology of Biotechnology and Big Data in Medicine Development

Export Citation Format

dirty dataRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Data repair of density-based data cleaning approach using conditional functional dependencies

The Dangers of Dirty Data

Between the Spreadsheets

The Dirty Data Maturity Model

Data Preprocessing

Clean and Dirty Data

Alliance Rules- based Algorithm on Detecting Duplicate Entry Email

Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation

Detecting dirty data using SQL: Rigorous house insurance case

The Technology of Biotechnology and Big Data in Medicine Development

dirty data
Recently Published Documents