Data Mining and Knowledge Discovery Technologies
Latest Publications


TOTAL DOCUMENTS

14
(FIVE YEARS 0)

H-INDEX

2
(FIVE YEARS 0)

Published By IGI Global

9781599049601, 9781599049618

Author(s):  
Anthony Scime ◽  
Gregg R. Murray ◽  
Wan Huang ◽  
Carol Brownstein-Evans

Immense public resources are expended to collect large stores of social data, but often these data are under-examined thereby missing potential opportunities to shed light on some of society’s pressing problems. This chapter proposes and demonstrates data mining in general and an iterative attribute-elimination process in particular as important analytical tools to exploit more fully these important data from the social sciences. We use an iterative domain-expert and data mining process to identify attributes that are useful for addressing distinct and nontrivial research issues in social science—presidential vote choice and living arrangement outcomes for maltreated children—using the American National Election Studies (ANES) from political science and the National Survey on Child and Adolescent Well-Being (NSCAW) from social work. We conclude that data mining is useful for more fully exploiting important but under-evaluated data collections for the purpose of addressing some important questions in the social sciences.


Author(s):  
Longbing Cao ◽  
Chengqi Zhang

Quantitative intelligence based traditional data mining is facing grand challenges from real-world enterprise and cross-organization applications. For instance, the usual demonstration of specific algorithms cannot support business users to take actions to their advantage and needs. We think this is due to Quantitative Intelligence focused data-driven philosophy. It either views data mining as an autonomous data-driven, trial-and-error process, or only analyzes business issues in an isolated, case-by-case manner. Based on experience and lessons learnt from real-world data mining and complex systems, this article proposes a practical data mining methodology referred to as Domain-Driven Data Mining. On top of quantitative intelligence and hidden knowledge in data, domain-driven data mining aims to meta-synthesize quantitative intelligence and qualitative intelligence in mining complex applications in which human is in the loop. It targets actionable knowledge discovery in constrained environment for satisfying user preference. Domain-driven methodology consists of key components including understanding constrained environment, business-technical questionnaire, representing and involving domain knowledge, human-mining cooperation and interaction, constructing next-generation mining infrastructure, in-depth pattern mining and postprocessing, business interestingness and actionability enhancement, and loop-closed human-cooperated iterative refinement. Domain-driven data mining complements the data-driven methodology, the metasynthesis of qualitative intelligence and quantitative intelligence has potential to discover knowledge from complex systems, and enhance knowledge actionability for practical use by industry and business.


Author(s):  
Marco A. Alvarez ◽  
SeungJin Lim

Current search engines impose an overhead to motivated students and Internet users who employ the Web as a valuable resource for education. The user, searching for good educational materials for a technical subject, often spends extra time to filter irrelevant pages or ends up with commercial advertisements. It would be ideal if, given a technical subject by user who is educationally motivated, suitable materials with respect to the given subject are automatically identified by an affordable machine processing of the recommendation set returned by a search engine for the subject. In this scenario, the user can save a significant amount of time in filtering out less useful Web pages, and subsequently the user’s learning goal on the subject can be achieved more efficiently without clicking through numerous pages. This type of convenient learning is called One-Stop Learning (OSL). In this paper, the contributions made by Lim and Ko in (Lim and Ko, 2006) for OSL are redefined and modeled using machine learning algorithms. Four selected supervised learning algorithms: Support Vector Machine (SVM), AdaBoost, Naive Bayes and Neural Networks are evaluated using the same data used in (Lim and Ko, 2006). The results presented in this paper are promising, where the highest precision (98.9%) and overall accuracy (96.7%) obtained by using SVM is superior to the results presented by Lim and Ko. Furthermore, the machine learning approach presented here, demonstrates that the small set of features used to represent each Web page yields a good solution for the OSL problem.


Author(s):  
Tushar ◽  
Tushar ◽  
Shibendu Shekhar Roy ◽  
Dilip Kumar Pratihar

Clustering is a potential tool of data mining. A clustering method analyzes the pattern of a data set and groups the data into several clusters based on the similarity among themselves. Clusters may be either crisp or fuzzy in nature. The present chapter deals with clustering of some data sets using Fuzzy C-Means (FCM) algorithm and Entropy-based Fuzzy Clustering (EFC) algorithm. In FCM algorithm, the nature and quality of clusters depend on the pre-defined number of clusters, level of cluster fuzziness and a threshold value utilized for obtaining the number of outliers (if any). On the other hand, the quality of clusters obtained by the EFC algorithm is dependent on a constant used to establish the relationship between the distance and similarity of two data points, a threshold value of similarity and another threshold value used for determining the number of outliers. The clusters should ideally be distinct and at the same time compact in nature. Moreover, the number of outliers should be as minimum as possible. Thus, the above problem may be posed as an optimization problem, which will be solved using a Genetic Algorithm (GA). The best set of multi-dimensional clusters will be mapped into 2-D for visualization using a Self-Organizing Map (SOM).


Author(s):  
Yue-Shi Lee ◽  
Show-Jane Yen

Web mining is one of the mining technologies, which applies data mining techniques in large amount of web data to improve the web services. Web traversal pattern mining discovers most of the users’ access patterns from web logs. This information can provide the navigation suggestions for web users such that appropriate actions can be adopted. However, the web data will grow rapidly in the short time, and some of the web data may be antiquated. The user behaviors may be changed when the new web data is inserted into and the old web data is deleted from web logs. Besides, it is considerably difficult to select a perfect minimum support threshold during the mining process to find the interesting rules. Even though the experienced experts, they also cannot determine the appropriate minimum support. Thus, we must constantly adjust the minimum support until the satisfactory mining results can be found. The essences of incremental or interactive data mining are that we can use the previous mining results to reduce the unnecessary processes when the minimum support is changed or web logs are updated. In this paper, we propose efficient incremental and interactive data mining algorithms to discover web traversal patterns and make the mining results to satisfy the users’ requirements. The experimental results show that our algorithms are more efficient than the other approaches.


Author(s):  
Riadh Ben Messaoud ◽  
Sabine Loudcher Rabaséda ◽  
Rokia Missaoui ◽  
Omar Boussaid

Data warehouses and OLAP (online analytical processing) provide tools to explore and navigate through data cubes in order to extract interesting information under different perspectives and levels of granularity. Nevertheless, OLAP techniques do not allow the identification of relationships, groupings, or exceptions that could hold in a data cube. To that end, we propose to enrich OLAP techniques with data mining facilities to benefit from the capabilities they offer. In this chapter, we propose an online environment for mining association rules in data cubes. Our environment called OLEMAR (online environment for mining association rules), is designed to extract associations from multidimensional data. It allows the extraction of inter-dimensional association rules from data cubes according to a sum-based aggregate measure, a more general indicator than aggregate values provided by the traditional COUNT measure. In our approach, OLAP users are able to drive a mining process guided by a meta-rule, which meets their analysis objectives. In addition, the environment is based on a formalization, which exploits aggregate measures to revisit the definition of the support and the confidence of discovered rules. This formalization also helps evaluate the interestingness of association rules according to two additional quality measures: lift and loevinger. Furthermore, in order to focus on the discovered associations and validate them, we provide a visual representation based on the graphic semiology principles. Such a representation consists in a graphic encoding of frequent patterns and association rules in the same multidimensional space as the one associated with the mined data cube. We have developed our approach as a component in a general online analysis platform called Miningcubes according to an Apriori-like algorithm, which helps extract inter-dimensional association rules directly from materialized multidimensional structures of data. In order to illustrate the effectiveness and the efficiency of our proposal, we analyze a real-life case study about breast cancer data and conduct performance experimentation of the mining process.


Author(s):  
Justin Zhan

To conduct data mining, we often need to collect data from various parties. Privacy concerns may prevent the parties from directly sharing the data and some types of information about the data. How multiple parties collaboratively conduct data mining without breaching data privacy presents a challenge. The goal of this paper is to provide solutions for privacy-preserving k-nearest neighbor classification which is one of data mining tasks. Our goal is to obtain accurate data mining results without disclosing private data. We propose a formal definition of privacy and show that our solutions preserve data privacy.


Author(s):  
Pradeep Kumar ◽  
P. Radha Krishna ◽  
Raju S. Bapi ◽  
T. M. Padmaja

In recent years, advanced information systems have enabled collection of increasingly large amounts of data that are sequential in nature. To analyze huge amounts of sequential data, the interdisciplinary field of Knowledge Discovery in Databases (KDD) is very useful. The most important step within the process of KDD is data mining, which is concerned with the extraction of the valid patterns. Recent research focus in data mining includes stream data mining, sequence data mining, web mining, text mining, visual mining, multimedia mining and multi-relational data mining. Sequence data may be discrete or continuous in nature. Most of the research on discrete sequence data concentrated on the discovery of frequently occurring patterns. However, comparatively less amount of work has been carried out in the area of discrete sequence data classification. In this chapter, data taxonomy is introduced with a review of the state of art for sequence data classification. The usefulness of embedding partial subsequence information extracted using sliding window technique into traditional classifier like kNN has been demonstrated. kNN has been tested with various vector based distance/similarity metrics. Further, with the use of S3M similarity metric, the full subsequence information embedded in the data sequences is extracted. The experimental data taken is DARPA’98 IDS benchmark dataset collected from UCIML dataset repository. The chapter closes by pointing out various application areas of sequence data and also the open issues in sequence data classification problem.


Author(s):  
ABM Shawkat Ali

Clustering technique in data mining has received a significant amount of attention from machine learning community in the last few years as one of the fundamental research area. Among the vast range of clustering algorithm, K-means is one of the most popular clustering algorithm. In this research we extend K-means algorithm by adding well known radial basis function (rbf) kernel and find better performance than classical K-means algorithm. It is a critical issue for rbf kernel, how can we select a unique parameter for optimum clustering task. This present chapter will provide a statistical based solution on this issue. The best parameter selection is considered on the basis of prior information of the data by Maximum Likelihood (ML) method and Nelder-Mead (N-M) simplex method. A rule based meta-learning approach is then proposed for automatic rbf kernel parameter selection.We consider 112 supervised data set and measure the statistical data characteristics using basic statistics, central tendency measure and entropy based approach. We split this data characteristics using well known decision tree approach to generate the rules. Finally we use the generated rules to select the unique parameter value for rbf kernel and then adopt in K-means algorithm. The experiment has been demonstrated with 112 problems and 10 fold cross validation methods. Finally the proposed algorithm can solve any clustering task very quickly with optimum performance.


Author(s):  
Tu Bao Ho ◽  
Thanh Phuong Nguyen ◽  
Tuan Nam Tran

The objective of this paper is twofold. First is to provide a survey of computational methods for protein-protein interaction (PPI) study. Second is to introduce our work and results in using inductive logic programming to learn prediction rules for PPI and DDI (domain-domain interactions) from multiple data sources. We show advantages of ex-ploiting various types of data in these important problems of bioinformatics.


Sign in / Sign up

Export Citation Format

Share Document