Data Quality in Data Warehouses

Author(s):  
William E. Winkler

Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This article provides an overview of two methods for improving quality. The first is data cleaning for finding duplicates within files or across files. The second is edit/imputation for maintaining business rules and for filling in missing data. The fastest data-cleaning methods are suitable for files with hundreds of millions of records (Winkler, 1999b, 2003b). The fastest edit/imputation methods are suitable for files with millions of records (Winkler, 1999a, 2004b).

Author(s):  
William E. Winkler

Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This paper provides an overview of two methods for improving quality. The first is record linkage for finding duplicates within files or across files. The second is edit/imputation for maintaining business rules and for filling-in missing data. The fastest record linkage methods are suitable for files with hundreds of millions of records (Winkler, 2004a, 2008). The fastest edit/imputation methods are suitable for files with millions of records (Winkler, 2004b, 2007a).


2013 ◽  
Vol 4 (1) ◽  
pp. 190-197
Author(s):  
Payal Pahwa ◽  
Rashmi Chhabra

Data warehousing is an emerging technology and has proved to be very important for an organization. Today every  business organization needs accurate and large amount of information to make proper decisions. For taking the business  decisions the data should be of good quality. To improve the data quality data cleansing is needed. Data cleansing is fundamental to warehouse data reliability, and to data warehousing success. There are various methods for datacleansing. This paper addresses issues related data cleaning. We focus on the detection of duplicate records. Also anefficient algorithm for data cleaning is proposed. A review of data cleansing methods and comparison between them is presented.


Author(s):  
Kanupriya Joshi ◽  
Mrs. Mamta

The problem of detecting and eliminating duplicated file is one of the major problems in the broad area of data cleaning and data quality in system. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. In the duplicate elimination step, only one copy of exact duplicated records or file are retained and eliminated other duplicate records or file. The elimination process is very important to produce a cleaned data. Before the elimination process, the similarity threshold values are calculated for all the records which are available in the data set. The similarity threshold values are important for the elimination process.


Author(s):  
Payal Pahwa ◽  
Rajiv Arora ◽  
Garima Thakur

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.


2008 ◽  
pp. 530-555
Author(s):  
Laura Irina Rusu ◽  
J. Wenny Rahayu ◽  
David Taniar

Developing a data warehouse for XML documents involves two major processes: one of creating it, by processing XML raw documents into a specified data warehouse repository; and the other of querying it, by applying techniques to better answer users’ queries. This paper focuses on the first part; that is identifying a systematic approach for building a data warehouse of XML documents, specifically for transferring data from an underlying XML database into a defined XML data warehouse. The proposed methodology on building XML data warehouses covers processes including data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables. In this paper, we also present a case study on how to put this methodology into practice. We utilise the XQuery technology in all of the above processes.


2011 ◽  
Vol 1 (4) ◽  
pp. 56-71 ◽  
Author(s):  
Payal Pahwa ◽  
Rajiv Arora ◽  
Garima Thakur

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.


2021 ◽  
Author(s):  
M. B. Mohammed ◽  
H. S. Zulkafli ◽  
M. B. Adam ◽  
N. Ali ◽  
I. A. Baba

Sign in / Sign up

Export Citation Format

Share Document