Systematic Development of Data Mining-Based Data Quality Tools

Author(s):  
Dominik Luebbers ◽  
Udo Grimmer ◽  
Matthias Jarke

Data quality is a main issue in quality information management. Data quality problems occur anywhere in information systems. These problems are solved by Data Cleaning (DC). DC is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors and omissions. Various process of DC have been discussed in the previous studies, but there is no standard or formalized the DC process. The Domain Driven Data Mining (DDDM) is one of the KDD methodology often used for this purpose. This paper review and emphasize the important of DC in data preparation. The future works was also being highlight.


2008 ◽  
pp. 2566-2582
Author(s):  
Jeff Zeanah

This chapter discusses impediments to exploratory data mining success. These impediments were identified based on anecdotal observations from multiple projects either reviewed or undertaken by the author and are classified into four main areas: data quality; lack of secondary or supporting data; insufficient analysis manpower; lack of openness to new results. Each is explained, and recommendations are made to prevent the impediment from interfering with the organization’s data mining efforts. The intent of the chapter is to provide an organization with a structure to anticipate these problems and to prevent the occurrence of these problems.


Author(s):  
Arun Thotapalli Sundararaman

Study of data quality for data mining application has always been a complex topic; in the recent years, this topic has gained further complexity with the advent of big data as the source for data mining and business intelligence (BI) applications. In a big data environment, data is consumed in various states and various forms serving as input for data mining, and this is the main source of added complexity. These new complexities and challenges arise from the underlying dimensions of big data (volume, variety, velocity, and value) together with the ability to consume data at various stages of transition from raw data to standardized datasets. These have created a need for expanding the traditional data quality (DQ) factors into BDQ (big data quality) factors besides the need for new BDQ assessment and measurement frameworks for data mining and BI applications. However, very limited advancement has been made in research and industry in the topic of BDQ and their relevance and criticality for data mining and BI applications. Data quality in data mining refers to the quality of the patterns or results of the models built using mining algorithms. DQ for data mining in business intelligence applications should be aligned with the objectives of the BI application. Objective measures, training/modeling approaches, and subjective measures are three major approaches that exist to measure DQ for data mining. However, there is no agreement yet on definitions or measurements or interpretations of DQ for data mining. Defining the factors of DQ for data mining and their measurement for a BI system has been one of the major challenges for researchers as well as practitioners. This chapter provides an overview of existing research in the area of BDQ definitions and measurement for data mining for BI, analyzes the gaps therein, and provides a direction for future research and practice in this area.


2019 ◽  
Vol 214 ◽  
pp. 01030
Author(s):  
Juraj Smiesko

An integrated system for data quality and conditions assessment for the ATLAS Tile Calorimeter is known amongst the ATLAS Tile Calorimeter as the Tile-in-One. It is a platform for combining all of the ATLAS Tile Calorimeter offline data quality tools in one unified web interface. It achieves this by using simple main web server to serve as central hub and group of small web applications called plugins, which provide the data quality assessment tools. Every plugin runs in its own virtual machine in order to prevent interference between the plugins and also to increase stability of the platform.


Author(s):  
David J. Yates ◽  
Jennifer Xu

This research is motivated by data mining for wireless sensor network applications. The authors consider applications where data is acquired in real-time, and thus data mining is performed on live streams of data rather than on stored databases. One challenge in supporting such applications is that sensor node power is a precious resource that needs to be managed as such. To conserve energy in the sensor field, the authors propose and evaluate several approaches to acquiring, and then caching data in a sensor field data server. The authors show that for true real-time applications, for which response time dictates data quality, policies that emulate cache hits by computing and returning approximate values for sensor data yield a simultaneous quality improvement and cost saving. This “win-win” is because when data acquisition response time is sufficiently important, the decrease in resource consumption and increase in data quality achieved by using approximate values outweighs the negative impact on data accuracy due to the approximation. In contrast, when data accuracy drives quality, a linear trade-off between resource consumption and data accuracy emerges. The authors then identify caching and lookup policies for which the sensor field query rate is bounded when servicing an arbitrary workload of user queries. This upper bound is achieved by having multiple user queries share the cost of a sensor field query. Finally, the authors discuss the challenges facing sensor network data mining applications in terms of data collection, warehousing, and mining techniques.


2016 ◽  
Vol 14 (7) ◽  
pp. 309-319
Author(s):  
Kyu-Yeon Hwang ◽  
Eun-Sook Lee ◽  
Go-Won Kim ◽  
Sung-Ok Hong ◽  
Jong-Son Park ◽  
...  

2007 ◽  
Vol 6 ◽  
pp. S658-S666 ◽  
Author(s):  
Xingsen Li ◽  
Yong Shi ◽  
Jun Li ◽  
Peng Zhang

Sign in / Sign up

Export Citation Format

Share Document