Data Quality for Data Mining in Business Intelligence Applications

Author(s):  
Arun Thotapalli Sundararaman

Data Quality (DQ) in data mining refers to the quality of the patterns or results of the models built using mining algorithms. DQ for data mining in Business Intelligence (BI) applications should be aligned with the objectives of the BI application. Objective measures, training/modeling approaches, and subjective measures are three major approaches that exist to measure DQ for data mining. However, there is no agreement yet on definitions or measurements or interpretations of DQ for data mining. Defining the factors of DQ for data mining and their measurement for a BI System has been one of the major challenges for researchers as well as practitioners. This chapter provides an overview of existing research in the area of DQ definition and measurement for data mining for BI, analyzes the gaps therein, besides reviewing proposed solutions and providing a direction for future research and practice in this area.

Author(s):  
Arun Thotapalli Sundararaman

Study of data quality for data mining application has always been a complex topic; in the recent years, this topic has gained further complexity with the advent of big data as the source for data mining and business intelligence (BI) applications. In a big data environment, data is consumed in various states and various forms serving as input for data mining, and this is the main source of added complexity. These new complexities and challenges arise from the underlying dimensions of big data (volume, variety, velocity, and value) together with the ability to consume data at various stages of transition from raw data to standardized datasets. These have created a need for expanding the traditional data quality (DQ) factors into BDQ (big data quality) factors besides the need for new BDQ assessment and measurement frameworks for data mining and BI applications. However, very limited advancement has been made in research and industry in the topic of BDQ and their relevance and criticality for data mining and BI applications. Data quality in data mining refers to the quality of the patterns or results of the models built using mining algorithms. DQ for data mining in business intelligence applications should be aligned with the objectives of the BI application. Objective measures, training/modeling approaches, and subjective measures are three major approaches that exist to measure DQ for data mining. However, there is no agreement yet on definitions or measurements or interpretations of DQ for data mining. Defining the factors of DQ for data mining and their measurement for a BI system has been one of the major challenges for researchers as well as practitioners. This chapter provides an overview of existing research in the area of BDQ definitions and measurement for data mining for BI, analyzes the gaps therein, and provides a direction for future research and practice in this area.


Author(s):  
Arun Thotapalli Sundararaman

DQ/IQ measurement in general and in the specific context of BI has always been a topic of high interest for researchers. The topic of Data Quality (DQ) in the field of Information Management has been well researched, published, and studied. Despite such research advances, there has been very little understanding either from a theoretical or from a practical perspective of DQ/IQ measurement for BI. Assessing the quality of data for a BI System has been one of the major challenges for researchers as well as practitioners, leading to the need for frameworks to measure DQ for BI. The objective of this chapter is to provide an overview of the existing frameworks for measurement of DQ for BI, analyze the gaps therein, review proposed solutions, and provide a direction for future research and practice in this area.


2021 ◽  
Vol 11 (3) ◽  
pp. 208-218
Author(s):  
Sadeq Darrab ◽  
◽  
David Broneske ◽  
Gunter Saake

Data mining is the process of extracting useful unknown knowledge from large datasets. Frequent itemset mining is the fundamental task of data mining that aims at discovering interesting itemsets that frequently appear together in a dataset. However, mining infrequent (rare) itemsets may be more interesting in many real-life applications such as predicting telecommunication equipment failures, genetics, medical diagnosis, or anomaly detection. In this paper, we survey up-to-date methods of rare itemset mining. The main goal of this survey is to provide a comprehensive overview of the state-of-the-art algorithms of rare itemset mining and its applications. The main contributions of this survey can be summarized as follows. In the first part, we define the task of rare itemset mining by explaining key concepts and terminology, motivation examples, and comparisons with underlying concepts. Then, we highlight the state-of-art methods for rare itemsets mining. Furthermore, we present variations of the task of rare itemset mining to discuss limitations of traditional rare itemset mining algorithms. After that, we highlight the fundamental applications of rare itemset mining. In the last, we point out research opportunities and challenges for rare itemset mining for future research.


Author(s):  
Zbigniew W. Ras ◽  
Elzbieta Wyrzykowska ◽  
Li-Shiang Tsay

There are two aspects of interestingness of rules that have been studied in data mining literature, objective and subjective measures (Liu et al., 1997), (Adomavicius & Tuzhilin, 1997), (Silberschatz & Tuzhilin, 1995, 1996). Objective measures are data-driven and domain-independent. Generally, they evaluate the rules based on their quality and similarity between them. Subjective measures, including unexpectedness, novelty and actionability, are user-driven and domaindependent. A rule is actionable if user can do an action to his/her advantage based on this rule (Liu et al., 1997). This definition, in spite of its importance, is too vague and it leaves open door to a number of different interpretations of actionability. In order to narrow it down, a new class of rules (called action rules) constructed from certain pairs of association rules, has been proposed in (Ras & Wieczorkowska, 2000). Interventions introduced in (Greco et al., 2006) and the concept of information changes proposed in (Skowron & Synak, 2006) are conceptually very similar to action rules. Action rules have been investigated further in (Wang at al., 2002), (Tsay & Ras, 2005, 2006), (Tzacheva & Ras, 2005), (He at al., 2005), (Ras & Dardzinska, 2006), (Dardzinska & Ras, 2006), (Ras & Wyrzykowska, 2007). To give an example justifying the need of action rules, let us assume that a number of customers have closed their accounts at one of the banks. We construct, possibly the simplest, description of that group of people and next search for a new description, similar to the one we have, with a goal to identify a new group of customers from which no-one left that bank. If these descriptions have a form of rules, then they can be seen as actionable rules. Now, by comparing these two descriptions, we may find the cause why these accounts have been closed and formulate an action which if undertaken by the bank, may prevent other customers from closing their accounts. Such actions are stimulated by action rules and they are seen as precise hints for actionability of rules. For example, an action rule may say that by inviting people from a certain group of customers for a glass of wine by a bank, it is guaranteed that these customers will not close their accounts and they do not move to another bank. Sending invitations by regular mail to all these customers or inviting them personally by giving them a call are examples of an action associated with that action rule.


2015 ◽  
Vol 48 (4) ◽  
pp. 431-456 ◽  
Author(s):  
Jasmine Fledderjohann ◽  
David R. Johnson

SummaryWhat is the most appropriate measure of impaired fertility for understanding its social consequences in sub-Saharan Africa? The dearth of subjective measures in surveys in the region has prevented comparisons of subjective and objective measures. Perceived difficulties conceiving may have a greater impact than objective measures for social outcomes such as divorce, stigmatization and distress. This study compares 12- (clinical) and 24- (epidemiological) month measures from biomedicine and 5- and 7-year measures from demography with a subjective measure of impaired fertility using correlations, random effects models and test–retest models to assess relationships between measures, their association with sociodemographic characteristics and the stability of measures across time. Secondary panel data (1998–2004) from 1350 Ghanaian women aged 15–49 of all marital statuses are used. Longer waiting times to identification of impaired fertility required by demographic measures result in more stable measures, but perceived difficulties conceiving are most closely aligned with clinical infertility (r=0.61; p<0.05). Epidemiological infertility is also closely aligned with the subjective measure. A large proportion of those identified as having impaired fertility based purely on waiting times are successful contraceptors. Where subjective measures are not available, epidemiological (24-month) measures may be most appropriate for studies of the social consequences of impaired fertility. Accounting for contraceptive use is important in order to avoid false positives. Future research should consider a variety of measures of perceived difficulties conceiving and self-identified infertility to assess which is most valid; in order to accomplish this, it is imperative that subjective measures of infertility be included in social surveys in sub-Saharan Africa.


2019 ◽  
Vol 36 (4) ◽  
pp. 299-313 ◽  
Author(s):  
Armelle Brun ◽  
Geoffray Bonnin ◽  
Sylvain Castagnos ◽  
Azim Roussanaly ◽  
Anne Boyer

Purpose The purpose of this paper is to present the METAL project, a French open learning analytics (LA) project for secondary school, that aims at improving the quality of teaching. The originality of METAL is that it relies on research through exploratory activities and focuses on all the aspects of a learning analytics environment. Design/methodology/approach This work introduces the different concerns of the project: collection and storage of multi-source data owned by a variety of stakeholders, selection and promotion of standards, design of an open-source LRS, conception of dashboards with their final users, trust, usability, design of explainable multi-source data-mining algorithms. Findings All the dimensions of METAL are presented, as well as the way they are approached: data sources, data storage, through the implementation of an LRS, design of dashboards for secondary school, based on co-design sessions data mining algorithms and experiments, in line with privacy and ethics concerns. Originality/value The issue of a global dissemination of LA at an institution level or at a broader level such as a territory or a study level is still a hot topic in the literature, and is one of the focus and originality of this paper, associated with the large spectrum of different concerns.


Author(s):  
Geert Wets ◽  
Koen Vanhoof ◽  
Theo Arentze ◽  
Harry Timmermans

The utility-maximizing framework—in particular, the logit model—is the dominantly used framework in transportation demand modeling. Computational process modeling has been introduced as an alternative approach to deal with the complexity of activity-based models of travel demand. Current rule-based systems, however, lack a methodology to derive rules from data. The relevance and performance of data-mining algorithms that potentially can provide the required methodology are explored. In particular, the C4 algorithm is applied to derive a decision tree for transport mode choice in the context of activity scheduling from a large activity diary data set. The algorithm is compared with both an alternative method of inducing decision trees (CHAID) and a logit model on the basis of goodness-of-fit on the same data set. The ratio of correctly predicted cases of a holdout sample is almost identical for the three methods. This suggests that for data sets of comparable complexity, the accuracy of predictions does not provide grounds for either rejecting or choosing the C4 method. However, the method may have advantages related to robustness. Future research is required to determine the ability of decision tree-based models in predicting behavioral change.


2009 ◽  
Vol 11 (2) ◽  
Author(s):  
L. Marshall ◽  
R. De la Harpe

[email protected] Making decisions in a business intelligence (BI) environment can become extremely challenging and sometimes even impossible if the data on which the decisions are based are of poor quality. It is only possible to utilise data effectively when it is accurate, up-to-date, complete and available when needed. The BI decision makers and users are in the best position to determine the quality of the data available to them. It is important to ask the right questions of them; therefore the issues of information quality in the BI environment were established through a literature study. Information-related problems may cause supplier relationships to deteriorate, reduce internal productivity and the business' confidence in IT. Ultimately it can have implications for an organisation's ability to perform and remain competitive. The purpose of this article is aimed at identifying the underlying factors that prevent information from being easily and effectively utilised and understanding how these factors can influence the decision-making process, particularly within a BI environment. An exploratory investigation was conducted at a large retail organisation in South Africa to collect empirical data from BI users through unstructured interviews. Some of the main findings indicate specific causes that impact the decisions of BI users, including accuracy, inconsistency, understandability and availability of information. Key performance measures that are directly impacted by the quality of data on decision-making include waste, availability, sales and supplier fulfilment. The time spent on investigating and resolving data quality issues has a major impact on productivity. The importance of documentation was highlighted as an important issue that requires further investigation. The initial results indicate the value of


2009 ◽  
Vol 131 (3) ◽  
Author(s):  
Haiyang Zheng ◽  
Andrew Kusiak

In this paper, multivariate time series models were built to predict the power ramp rates of a wind farm. The power changes were predicted at 10 min intervals. Multivariate time series models were built with data-mining algorithms. Five different data-mining algorithms were tested using data collected at a wind farm. The support vector machine regression algorithm performed best out of the five algorithms studied in this research. It provided predictions of the power ramp rate for a time horizon of 10–60 min. The boosting tree algorithm selects parameters for enhancement of the prediction accuracy of the power ramp rate. The data used in this research originated at a wind farm of 100 turbines. The test results of multivariate time series models were presented in this paper. Suggestions for future research were provided.


Sign in / Sign up

Export Citation Format

Share Document