scholarly journals BST Algorithm for Duplicate Elimination in Data Warehouse

2013 ◽  
Vol 4 (1) ◽  
pp. 190-197
Author(s):  
Payal Pahwa ◽  
Rashmi Chhabra

Data warehousing is an emerging technology and has proved to be very important for an organization. Today every  business organization needs accurate and large amount of information to make proper decisions. For taking the business  decisions the data should be of good quality. To improve the data quality data cleansing is needed. Data cleansing is fundamental to warehouse data reliability, and to data warehousing success. There are various methods for datacleansing. This paper addresses issues related data cleaning. We focus on the detection of duplicate records. Also anefficient algorithm for data cleaning is proposed. A review of data cleansing methods and comparison between them is presented.

2014 ◽  
Vol 668-669 ◽  
pp. 1374-1377 ◽  
Author(s):  
Wei Jun Wen

ETL refers to the process of data extracting, transformation and loading and is deemed as a critical step in ensuring the quality, data specification and standardization of marine environmental data. Marine data, due to their complication, field diversity and huge volume, still remain decentralized, polyphyletic and isomerous with different semantics and hence far from being able to provide effective data sources for decision making. ETL enables the construction of marine environmental data warehouse in the form of cleaning, transformation, integration, loading and periodic updating of basic marine data warehouse. The paper presents a research on rules for cleaning, transformation and integration of marine data, based on which original ETL system of marine environmental data warehouse is so designed and developed. The system further guarantees data quality and correctness in analysis and decision-making based on marine environmental data in the future.


2008 ◽  
pp. 18-25
Author(s):  
James E. Yao ◽  
Chang Liu ◽  
Qiyang Chen ◽  
June Lu

As internal and external demands on information from managers are increasing rapidly, especially the information that is processed to serve managers’ specific needs, regular databases and decision support systems (DSS) cannot provide the information needed. Data warehouses came into existence to meet these needs, consolidating and integrating information from many internal and external sources and arranging it in a meaningful format for making accurate business decisions (Martin, 1997). In the past five years, there has been a significant growth in data warehousing (Hoffer, Prescott, & McFadden, 2005). Correspondingly, this occurrence has brought up the issue of data warehouse administration and management. Data warehousing has been increasingly recognized as an effective tool for organizations to transform data into useful information for strategic decision-making. To achieve competitive advantages via data warehousing, data warehouse management is crucial (Ma, Chou, & Yen, 2000).


Author(s):  
James E. Yao ◽  
Chang Liu ◽  
Qiyang Chen ◽  
June Lu

As internal and external demands on information from managers are increasing rapidly, especially the information that is processed to serve managers’ specific needs, regular databases and decision support systems (DSS) cannot provide the information needed. Data warehouses came into existence to meet these needs, consolidating and integrating information from many internal and external sources and arranging it in a meaningful format for making accurate business decisions (Martin, 1997). In the past five years, there has been a significant growth in data warehousing (Hoffer, Prescott, & McFadden, 2005). Correspondingly, this occurrence has brought up the issue of data warehouse administration and management. Data warehousing has been increasingly recognized as an effective tool for organizations to transform data into useful information for strategic decision-making. To achieve competitive advantages via data warehousing, data warehouse management is crucial (Ma, Chou, & Yen, 2000).


Author(s):  
John M. Artz

Data warehousing is an emerging technology that greatly extends the capabilities of relational databases specifically in the analysis of very large sets of time-oriented data. The emergence of data warehousing has been somewhat eclipsed over the past decade by the simultaneous emergence of Web technologies. However, Web technologies and data warehousing have some natural synergies that are not immediately obvious. First, Web technologies make data warehouse data more easily available to a much wider variety of users. Second, data warehouse technologies can be used to analyze traffic to a Web site in order to gain a much better understanding of the visitors to the Web site. It is this second synergy that is the focus of this article.


2021 ◽  
Vol Special Issue (2) ◽  
Author(s):  
Bernard Ntsama ◽  
Ado Bwaka ◽  
Reggis Katsande ◽  
Regis Maurin Obiang ◽  
Daniel Rasheed Oyaole ◽  
...  

The polio Eradication Initiative (PEI) is one of the most important public health interventions in Africa. Quality data is necessary to monitor activities and key performance indicators and access year by year progress made. This process has been possible with a solid polio health information system that has been consolidated over the years. This study describes the whole process to have data for decision making. The main components are the data flow, the role of the different levels, data capture and tools, standards and codes, the data cleaning process, the integration of data from various sources, the introduction of innovative technologies, feedback and information products and capacity building. The results show the improvement in the timeliness of reporting data to the next level, the availability of quality data for analysis to monitor key surveillance performance indicators, the output of the data cleaning exercise pointing out data quality gaps, the integration of data from various sources to produce meaningful outputs and feedback for information dissemination. From the review of the process, it is observed an improvement in the quality of polio data resulting from a well-defined information system with standardized tools and Standard Operating Procedures (SOPs) and the introduction of innovative technologies. However, there is room for improvement; for example, multiple data entries from the field to the surveillance unit and the laboratory. Innovative technologies are implemented for the time being in areas hard to reach due to the high cost of the investment. A strong information system has been put in place from the community level to the global level with a link between surveillance, laboratory and immunization coverage data. To maintain standards in Polio Information system, there is need for continuous training of the staff on areas of surveillance, information systems, data analysis and information sharing. The use of innovative technologies on web-based system and mobile devices with validation rules and information check will avoid multiple entries.


Author(s):  
Kanupriya Joshi ◽  
Mrs. Mamta

The problem of detecting and eliminating duplicated file is one of the major problems in the broad area of data cleaning and data quality in system. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. In the duplicate elimination step, only one copy of exact duplicated records or file are retained and eliminated other duplicate records or file. The elimination process is very important to produce a cleaned data. Before the elimination process, the similarity threshold values are calculated for all the records which are available in the data set. The similarity threshold values are important for the elimination process.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2083
Author(s):  
John Byabazaire ◽  
Gregory O’Hare ◽  
Declan Delaney

Existing research recognizes the critical role of quality data in the current big-data and Internet of Things (IoT) era. Quality data has a direct impact on model results and hence business decisions. The growth in the number of IoT-connected devices makes it hard to access data quality using traditional assessments methods. This is exacerbated by the need to share data across different IoT domains as it increases the heterogeneity of the data. Data-shared IoT defines a new perspective of IoT applications which benefit from sharing data among different domains of IoT to create new use-case applications. For example, sharing data between smart transport and smart industry can lead to other use-case applications such as intelligent logistics management and warehouse management. The benefits of such applications, however, can only be achieved if the shared data is of acceptable quality. There are three main practices in data quality (DQ) determination approaches that are restricting their effective use in data-shared platforms: (1) most DQ techniques validate test data against a known quantity considered to be a reference; a gold reference. (2) narrow sets of static metrics are used to describe the quality. Each consumer uses these metrics in similar ways. (3) data quality is evaluated in isolated stages throughout the processing pipeline. Data-shared IoT presents unique challenges; (1) each application and use-case in shared IoT has a unique description of data quality and requires a different set of metrics. This leads to an extensive list of DQ dimensions which are difficult to implement in real-world applications. (2) most data in IoT scenarios does not have a gold reference. (3) factors endangering DQ in shared IoT exist throughout the entire big-data model from data collection to data visualization, and data use. This paper aims to describe data-shared IoT and shared data pools while highlighting the importance of sharing quality data across various domains. The article examines how we can use trust as a measure of quality in data-shared IoT. We conclude that researchers can combine such trust-based techniques with blockchain for secure end-to-end data quality assessment.


Author(s):  
Payal Pahwa ◽  
Rajiv Arora ◽  
Garima Thakur

The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.


2008 ◽  
pp. 3411-3415
Author(s):  
John M. Artz

Data warehousing is an emerging technology that greatly extends the capabilities of relational databases specifically in the analysis of very large sets of time-oriented data. The emergence of data warehousing has been somewhat eclipsed over the past decade by the simultaneous emergence of Web technologies. However, Web technologies and data warehousing have some natural synergies that are not immediately obvious. First, Web technologies make data warehouse data more easily available to a much wider variety of users. Second, data warehouse technologies can be used to analyze traffic to a Web site in order to gain a much better understanding of the visitors to the Web site. It is this second synergy that is the focus of this article.


Sign in / Sign up

Export Citation Format

Share Document