dirty data
Recently Published Documents


TOTAL DOCUMENTS

90
(FIVE YEARS 30)

H-INDEX

11
(FIVE YEARS 1)

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Samir Al-Janabi ◽  
Ryszard Janicki

PurposeData quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.Design/methodology/approachA set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.FindingsThis new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.Originality/valueConditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.


2021 ◽  
Author(s):  
Susan Walsh

Dirty data is a problem that costs businesses thousands, if not millions, every year. In organisations large and small across the globe you will hear talk of data quality issues. What you will rarely hear about is the consequences or how to fix it.<br><br><i>Between the Spreadsheets: Classifying and Fixing Dirty Data</i> draws on classification expert Susan Walsh's decade of experience in data classification to present a fool-proof method for cleaning and classifying your data. The book covers everything from the very basics of data classification to normalisation, taxonomies and presents the author's proven <b>COAT</b> methodology, helping ensure an organisation's data is <b>Consistent</b>, <b>Organised</b>, <b>Accurate</b> and <b>Trustworthy</b>. A series of data horror stories outlines what can go wrong in managing data, and if it does, how it can be fixed. <br><br>After reading this book, regardless of your level of experience, not only will you be able to work with your data more efficiently, but you will also understand the impact the work you do with it has, and how it affects the rest of the organisation.<br><br>Written in an engaging and highly practical manner, <i>Between the Spreadsheets</i> gives readers of all levels a deep understanding of the dangers of dirty data and the confidence and skills to work more efficiently and effectively with it.


2021 ◽  
pp. 33-58
Author(s):  
Magy Seif El-Nasr ◽  
Truong Huy Nguyen Dinh ◽  
Alessandro Canossa ◽  
Anders Drachen

This chapter focuses on the process of cleaning data and preparing it for further processing. Specifically, the chapter discusses various techniques that you will use, including preprocessing, outlier identification, data consistency, and the normalization or standardization process, used to normalize your data. The chapter further discusses different measurement types and what methods can be used for which types. The chapter also discusses ways to deal with issues you may encounter with inconsistent or dirty data. The chapter takes a more practical approach by integrating several labs with actual game data to demonstrate how you can perform these steps on real game data.


Author(s):  
Arif Hanafi ◽  
Sulaiman Harun ◽  
Sofika Enggari ◽  
Larissa Navia Rani

The way that email has extraordinary significance in present day business communication is certain. Consistently, a bulk of emails is sent from organizations to clients and suppliers, from representatives to their managers and starting with one colleague then onto the next. In this way there is vast of email in data warehouse. Data cleaning is an activity performed on the data sets of data warehouse to upgrade and keep up the quality and consistency of the data. This paper underlines the issues related with dirty data, detection of duplicatein email column. The paper identifies the strategy of data cleaning from adifferent point of view. It provides an algorithm to the discovery of error and duplicates entries in the data sets of existing data warehouse. The paper characterizes the alliance rules based on the concept of mathematical association rules to determine the duplicate entries in email column in data sets.


2021 ◽  
Vol 55 ◽  
pp. 100714
Author(s):  
James G. Lawson ◽  
Daniel A. Street
Keyword(s):  

2021 ◽  
pp. 58-73
Author(s):  
Eric D. Perakslis ◽  
Martin Stanley

The rise of big data and digital health in medicine have been concurrent over the last two decades. Often confused, while virtually all digital health solutions, such as sensors wearable devices, and diagnostic algorithms, involve big data, not all big data in health care originates from digital health tools. Genomic sequencing data being one example of this. In this chapter, the role and importance of big data in medicines and medical device discovery and development are detailed with the specific focus of providing a detailed understanding of the product discovery, product development, clinical trials, regulatory authorization, and marketing processes. Concepts such as “dirty data,” regulatory decision-making, remote and virtualized clinical trials, and other key elements of digital health are discussed.


Sign in / Sign up

Export Citation Format

Share Document