Data Quality in Data Warehouses

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch086 ◽

2011 ◽

pp. 550-555 ◽

Cited By ~ 2

Author(s):

William E. Winkler

Keyword(s):

Missing Data ◽

Data Quality ◽

Data Warehouse ◽

Record Linkage ◽

Data Warehouses ◽

Business Rules ◽

Imputation Methods ◽

Linkage Methods

Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This paper provides an overview of two methods for improving quality. The first is record linkage for finding duplicates within files or across files. The second is edit/imputation for maintaining business rules and for filling-in missing data. The fastest record linkage methods are suitable for files with hundreds of millions of records (Winkler, 2004a, 2008). The fastest edit/imputation methods are suitable for files with millions of records (Winkler, 2004b, 2007a).

Download Full-text

BST Algorithm for Duplicate Elimination in Data Warehouse

INTERNATIONAL JOURNAL OF MANAGEMENT & INFORMATION TECHNOLOGY ◽

10.24297/ijmit.v4i1.4636 ◽

2013 ◽

Vol 4 (1) ◽

pp. 190-197

Author(s):

Payal Pahwa ◽

Rashmi Chhabra

Keyword(s):

Data Quality ◽

Data Warehouse ◽

Data Warehousing ◽

Data Cleaning ◽

Emerging Technology ◽

Quality Data ◽

Business Organization ◽

Data Cleansing ◽

Business Decisions ◽

Related Data

Data warehousing is an emerging technology and has proved to be very important for an organization. Today every business organization needs accurate and large amount of information to make proper decisions. For taking the business decisions the data should be of good quality. To improve the data quality data cleansing is needed. Data cleansing is fundamental to warehouse data reliability, and to data warehousing success. There are various methods for datacleansing. This paper addresses issues related data cleaning. We focus on the detection of duplicate records. Also anefficient algorithm for data cleaning is proposed. A review of data cleansing methods and comparison between them is presented.

Download Full-text

Data Quality Evaluation, Outlier Detection and Missing Data Imputation Methods for IoT in Smart Cities

Studies in Computational Intelligence - Machine Intelligence and Data Analytics for Sustainable Future Smart Cities ◽

10.1007/978-3-030-72065-0_1 ◽

2021 ◽

pp. 1-18

Author(s):

Vera Van Zoest ◽

Xiuming Liu ◽

Edith Ngai

Keyword(s):

Missing Data ◽

Data Quality ◽

Outlier Detection ◽

Quality Evaluation ◽

Smart Cities ◽

Data Imputation ◽

Missing Data Imputation ◽

Imputation Methods

Download Full-text

A framework to implement data cleaning in enterprise data warehouse for robust data quality

2010 International Conference on Information and Emerging Technologies ◽

10.1109/iciet.2010.5625701 ◽

2010 ◽

Cited By ~ 5

Author(s):

Kamran Ali ◽

Mubeen Ahmed Warraich

Keyword(s):

Data Quality ◽

Data Warehouse ◽

Data Cleaning ◽

Enterprise Data Warehouse

Download Full-text

Duplicate File Detection and Elimination

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit19544 ◽

2019 ◽

pp. 23-27

Author(s):

Kanupriya Joshi ◽

Mrs. Mamta

Keyword(s):

Data Quality ◽

Data Warehouse ◽

Real World ◽

Data Cleaning ◽

Research Work ◽

Threshold Values ◽

Broad Area ◽

Data Set ◽

Elimination Process ◽

Similarity Threshold

The problem of detecting and eliminating duplicated file is one of the major problems in the broad area of data cleaning and data quality in system. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. In the duplicate elimination step, only one copy of exact duplicated records or file are retained and eliminated other duplicate records or file. The elimination process is very important to produce a cleaned data. Before the elimination process, the similarity threshold values are calculated for all the records which are available in the data set. The similarity threshold values are important for the elimination process.

Download Full-text

An Efficient Algorithm for Data Cleaning

Advances in Business Information Systems and Analytics - Intelligence Methods and Systems Advancements for Knowledge-Based Business ◽

10.4018/978-1-4666-1873-2.ch017 ◽

2012 ◽

pp. 305-320

Author(s):

Payal Pahwa ◽

Rajiv Arora ◽

Garima Thakur

Keyword(s):

Data Quality ◽

Data Warehouse ◽

Real World ◽

Efficient Algorithm ◽

Data Cleaning ◽

Real World Data ◽

Source Data ◽

Potential Benefits ◽

New Framework

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.

Download Full-text

A Methodology for Building XML Data Warehouses

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch037 ◽

2008 ◽

pp. 530-555

Author(s):

Laura Irina Rusu ◽

J. Wenny Rahayu ◽

David Taniar

Keyword(s):

Data Warehouse ◽

Systematic Approach ◽

Data Cleaning ◽

The Other ◽

Data Warehouses ◽

Xml Data ◽

Xml Database ◽

Xml Documents

Developing a data warehouse for XML documents involves two major processes: one of creating it, by processing XML raw documents into a specified data warehouse repository; and the other of querying it, by applying techniques to better answer users’ queries. This paper focuses on the first part; that is identifying a systematic approach for building a data warehouse of XML documents, specifically for transferring data from an underlying XML database into a defined XML data warehouse. The proposed methodology on building XML data warehouses covers processes including data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables. In this paper, we also present a case study on how to put this methodology into practice. We utilise the XQuery technology in all of the above processes.

Download Full-text

An Efficient Algorithm for Data Cleaning

International Journal of Knowledge-Based Organizations ◽

10.4018/ijkbo.2011100104 ◽

2011 ◽

Vol 1 (4) ◽

pp. 56-71 ◽

Cited By ~ 6

Author(s):

Payal Pahwa ◽

Rajiv Arora ◽

Garima Thakur

Keyword(s):

Data Quality ◽

Data Warehouse ◽

Real World ◽

Efficient Algorithm ◽

Data Cleaning ◽

Real World Data ◽

Source Data ◽

Potential Benefits ◽

New Framework

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.

Download Full-text