Data Mining and Knowledge Discovery Technologies

Data Mining in the Social Sciences and Iterative Attribute Elimination

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch012 ◽

2008 ◽

pp. 308-332 ◽

Cited By ~ 5

Author(s):

Anthony Scime ◽

Gregg R. Murray ◽

Wan Huang ◽

Carol Brownstein-Evans

Keyword(s):

Social Sciences ◽

Data Mining ◽

Living Arrangement ◽

Well Being ◽

Research Issues ◽

The Social ◽

Public Resources ◽

Data Collections ◽

Analytical Tools ◽

Election Studies

Immense public resources are expended to collect large stores of social data, but often these data are under-examined thereby missing potential opportunities to shed light on some of society’s pressing problems. This chapter proposes and demonstrates data mining in general and an iterative attribute-elimination process in particular as important analytical tools to exploit more fully these important data from the social sciences. We use an iterative domain-expert and data mining process to identify attributes that are useful for addressing distinct and nontrivial research issues in social science—presidential vote choice and living arrangement outcomes for maltreated children—using the American National Election Studies (ANES) from political science and the National Survey on Child and Adolescent Well-Being (NSCAW) from social work. We conclude that data mining is useful for more fully exploiting important but under-evaluated data collections for the purpose of addressing some important questions in the social sciences.

Get full-text (via PubEx)

Domain Driven Data Mining

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch008 ◽

2008 ◽

pp. 196-223 ◽

Cited By ~ 1

Author(s):

Longbing Cao ◽

Chengqi Zhang

Keyword(s):

Data Mining ◽

Complex Systems ◽

Real World ◽

Domain Knowledge ◽

Pattern Mining ◽

Iterative Refinement ◽

User Preference ◽

Data Driven ◽

Real World Data ◽

Hidden Knowledge

Quantitative intelligence based traditional data mining is facing grand challenges from real-world enterprise and cross-organization applications. For instance, the usual demonstration of specific algorithms cannot support business users to take actions to their advantage and needs. We think this is due to Quantitative Intelligence focused data-driven philosophy. It either views data mining as an autonomous data-driven, trial-and-error process, or only analyzes business issues in an isolated, case-by-case manner. Based on experience and lessons learnt from real-world data mining and complex systems, this article proposes a practical data mining methodology referred to as Domain-Driven Data Mining. On top of quantitative intelligence and hidden knowledge in data, domain-driven data mining aims to meta-synthesize quantitative intelligence and qualitative intelligence in mining complex applications in which human is in the loop. It targets actionable knowledge discovery in constrained environment for satisfying user preference. Domain-driven methodology consists of key components including understanding constrained environment, business-technical questionnaire, representing and involving domain knowledge, human-mining cooperation and interaction, constructing next-generation mining infrastructure, in-depth pattern mining and postprocessing, business interestingness and actionability enhancement, and loop-closed human-cooperated iterative refinement. Domain-driven data mining complements the data-driven methodology, the metasynthesis of qualitative intelligence and quantitative intelligence has potential to discover knowledge from complex systems, and enhance knowledge actionability for practical use by industry and business.

Get full-text (via PubEx)

A Machine Learning Approach for One-Stop Learning

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch013 ◽

2008 ◽

pp. 333-357 ◽

Cited By ~ 1

Author(s):

Marco A. Alvarez ◽

SeungJin Lim

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approach ◽

Internet Users ◽

Machine Learning Approach ◽

Supervised Learning Algorithms ◽

One Stop ◽

The Subject

Current search engines impose an overhead to motivated students and Internet users who employ the Web as a valuable resource for education. The user, searching for good educational materials for a technical subject, often spends extra time to filter irrelevant pages or ends up with commercial advertisements. It would be ideal if, given a technical subject by user who is educationally motivated, suitable materials with respect to the given subject are automatically identified by an affordable machine processing of the recommendation set returned by a search engine for the subject. In this scenario, the user can save a significant amount of time in filtering out less useful Web pages, and subsequently the user’s learning goal on the subject can be achieved more efficiently without clicking through numerous pages. This type of convenient learning is called One-Stop Learning (OSL). In this paper, the contributions made by Lim and Ko in (Lim and Ko, 2006) for OSL are redefined and modeled using machine learning algorithms. Four selected supervised learning algorithms: Support Vector Machine (SVM), AdaBoost, Naive Bayes and Neural Networks are evaluated using the same data used in (Lim and Ko, 2006). The results presented in this paper are promising, where the highest precision (98.9%) and overall accuracy (96.7%) obtained by using SVM is superior to the results presented by Lim and Ko. Furthermore, the machine learning approach presented here, demonstrates that the small set of features used to represent each Web page yields a good solution for the OSL problem.

Get full-text (via PubEx)

Determination of Optimal Clusters Using a Genetic Algorithm

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch005 ◽

2008 ◽

pp. 98-117 ◽

Cited By ~ 1

Author(s):

Tushar ◽

Shibendu Shekhar Roy ◽

Dilip Kumar Pratihar

Keyword(s):

Genetic Algorithm ◽

Threshold Value ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Fcm Algorithm ◽

Data Points ◽

The Relationship

Clustering is a potential tool of data mining. A clustering method analyzes the pattern of a data set and groups the data into several clusters based on the similarity among themselves. Clusters may be either crisp or fuzzy in nature. The present chapter deals with clustering of some data sets using Fuzzy C-Means (FCM) algorithm and Entropy-based Fuzzy Clustering (EFC) algorithm. In FCM algorithm, the nature and quality of clusters depend on the pre-defined number of clusters, level of cluster fuzziness and a threshold value utilized for obtaining the number of outliers (if any). On the other hand, the quality of clusters obtained by the EFC algorithm is dependent on a constant used to establish the relationship between the distance and similarity of two data points, a threshold value of similarity and another threshold value used for determining the number of outliers. The clusters should ideally be distinct and at the same time compact in nature. Moreover, the number of outliers should be as minimum as possible. Thus, the above problem may be posed as an optimization problem, which will be solved using a Genetic Algorithm (GA). The best set of multi-dimensional clusters will be mapped into 2-D for visualization using a Self-Organizing Map (SOM).

Get full-text (via PubEx)

A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch004 ◽

2008 ◽

pp. 72-96

Author(s):

Yue-Shi Lee ◽

Show-Jane Yen

Keyword(s):

Data Mining ◽

Pattern Mining ◽

Web Data ◽

Minimum Support ◽

User Behaviors ◽

Web Logs ◽

Interactive Data Mining ◽

Support Threshold ◽

Interactive Data ◽

The Web

Web mining is one of the mining technologies, which applies data mining techniques in large amount of web data to improve the web services. Web traversal pattern mining discovers most of the users’ access patterns from web logs. This information can provide the navigation suggestions for web users such that appropriate actions can be adopted. However, the web data will grow rapidly in the short time, and some of the web data may be antiquated. The user behaviors may be changed when the new web data is inserted into and the old web data is deleted from web logs. Besides, it is considerably difficult to select a perfect minimum support threshold during the mining process to find the interesting rules. Even though the experienced experts, they also cannot determine the appropriate minimum support. Thus, we must constantly adjust the minimum support until the satisfactory mining results can be found. The essences of incremental or interactive data mining are that we can use the previous mining results to reduce the unnecessary processes when the minimum support is changed or web logs are updated. In this paper, we propose efficient incremental and interactive data mining algorithms to discover web traversal patterns and make the mining results to satisfy the users’ requirements. The experimental results show that our algorithms are more efficient than the other approaches.

Get full-text (via PubEx)

OLEMAR

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch001 ◽

2008 ◽

pp. 1-35 ◽

Cited By ~ 5

Author(s):

Riadh Ben Messaoud ◽

Sabine Loudcher Rabaséda ◽

Rokia Missaoui ◽

Omar Boussaid

Keyword(s):

Association Rules ◽

Real Life ◽

Data Cube ◽

Multidimensional Space ◽

Online Environment ◽

Data Cubes ◽

Mining Association Rules ◽

The One ◽

Definition Of ◽

Analysis Platform

Data warehouses and OLAP (online analytical processing) provide tools to explore and navigate through data cubes in order to extract interesting information under different perspectives and levels of granularity. Nevertheless, OLAP techniques do not allow the identification of relationships, groupings, or exceptions that could hold in a data cube. To that end, we propose to enrich OLAP techniques with data mining facilities to benefit from the capabilities they offer. In this chapter, we propose an online environment for mining association rules in data cubes. Our environment called OLEMAR (online environment for mining association rules), is designed to extract associations from multidimensional data. It allows the extraction of inter-dimensional association rules from data cubes according to a sum-based aggregate measure, a more general indicator than aggregate values provided by the traditional COUNT measure. In our approach, OLAP users are able to drive a mining process guided by a meta-rule, which meets their analysis objectives. In addition, the environment is based on a formalization, which exploits aggregate measures to revisit the definition of the support and the confidence of discovered rules. This formalization also helps evaluate the interestingness of association rules according to two additional quality measures: lift and loevinger. Furthermore, in order to focus on the discovered associations and validate them, we provide a visual representation based on the graphic semiology principles. Such a representation consists in a graphic encoding of frequent patterns and association rules in the same multidimensional space as the one associated with the mined data cube. We have developed our approach as a component in a general online analysis platform called Miningcubes according to an Apriori-like algorithm, which helps extract inter-dimensional association rules directly from materialized multidimensional structures of data. In order to illustrate the effectiveness and the efficiency of our proposal, we analyze a real-life case study about breast cancer data and conduct performance experimentation of the mining process.

Get full-text (via PubEx)

Using Cryptography For Privacy-Preserving Data Mining

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-60566-218-3.ch014 ◽

2008 ◽

pp. 175-194

Author(s):

Justin Zhan

Keyword(s):

Data Mining ◽

Data Privacy ◽

Nearest Neighbor ◽

Privacy Preserving ◽

K Nearest Neighbor ◽

Privacy Concerns ◽

Private Data ◽

Definition Of ◽

Types Of Information ◽

Neighbor Classification

To conduct data mining, we often need to collect data from various parties. Privacy concerns may prevent the parties from directly sharing the data and some types of information about the data. How multiple parties collaboratively conduct data mining without breaching data privacy presents a challenge. The goal of this paper is to provide solutions for privacy-preserving k-nearest neighbor classification which is one of data mining tasks. Our goal is to obtain accurate data mining results without disclosing private data. We propose a formal definition of privacy and show that our solutions preserve data privacy.

Get full-text (via PubEx)

Advances in Classification of Sequence Data

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch007 ◽

2008 ◽

pp. 143-174 ◽

Cited By ~ 1

Author(s):

Pradeep Kumar ◽

P. Radha Krishna ◽

Raju S. Bapi ◽

T. M. Padmaja

Keyword(s):

Data Mining ◽

Web Mining ◽

Sequence Data ◽

Data Classification ◽

Knowledge Discovery In Databases ◽

Sequential Data ◽

Stream Data ◽

Stream Data Mining ◽

Discrete Sequence ◽

Mining Sequence

In recent years, advanced information systems have enabled collection of increasingly large amounts of data that are sequential in nature. To analyze huge amounts of sequential data, the interdisciplinary field of Knowledge Discovery in Databases (KDD) is very useful. The most important step within the process of KDD is data mining, which is concerned with the extraction of the valid patterns. Recent research focus in data mining includes stream data mining, sequence data mining, web mining, text mining, visual mining, multimedia mining and multi-relational data mining. Sequence data may be discrete or continuous in nature. Most of the research on discrete sequence data concentrated on the discovery of frequently occurring patterns. However, comparatively less amount of work has been carried out in the area of discrete sequence data classification. In this chapter, data taxonomy is introduced with a review of the state of art for sequence data classification. The usefulness of embedding partial subsequence information extracted using sliding window technique into traditional classifier like kNN has been demonstrated. kNN has been tested with various vector based distance/similarity metrics. Further, with the use of S3M similarity metric, the full subsequence information embedded in the data sequences is extracted. The experimental data taken is DARPA’98 IDS benchmark dataset collected from UCIML dataset repository. The chapter closes by pointing out various application areas of sequence data and also the open issues in sequence data classification problem.

Get full-text (via PubEx)

K-means Clustering Adopting rbf-Kernel

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch006 ◽

2008 ◽

pp. 118-142

Author(s):

ABM Shawkat Ali

Keyword(s):

Learning Community ◽

Clustering Algorithm ◽

Critical Issue ◽

Research Area ◽

Central Tendency Measure ◽

Rbf Kernel ◽

Meta Learning ◽

Fundamental Research ◽

Unique Parameter ◽

Best Parameter

Clustering technique in data mining has received a significant amount of attention from machine learning community in the last few years as one of the fundamental research area. Among the vast range of clustering algorithm, K-means is one of the most popular clustering algorithm. In this research we extend K-means algorithm by adding well known radial basis function (rbf) kernel and find better performance than classical K-means algorithm. It is a critical issue for rbf kernel, how can we select a unique parameter for optimum clustering task. This present chapter will provide a statistical based solution on this issue. The best parameter selection is considered on the basis of prior information of the data by Maximum Likelihood (ML) method and Nelder-Mead (N-M) simplex method. A rule based meta-learning approach is then proposed for automatic rbf kernel parameter selection.We consider 112 supervised data set and measure the statistical data characteristics using basic statistics, central tendency measure and entropy based approach. We split this data characteristics using well known decision tree approach to generate the rules. Finally we use the generated rules to select the unique parameter value for rbf kernel and then adopt in K-means algorithm. The experiment has been demonstrated with 112 problems and 10 fold cross validation methods. Finally the proposed algorithm can solve any clustering task very quickly with optimum performance.

Get full-text (via PubEx)

Study of Protein-Protein Interactions from Multiple Data Sources

Data Mining and Knowledge Discovery Technologies ◽

10.4018/978-1-59904-960-1.ch011 ◽

2008 ◽

pp. 280-307

Author(s):

Tu Bao Ho ◽

Thanh Phuong Nguyen ◽

Tuan Nam Tran

Keyword(s):

Protein Interactions ◽

Inductive Logic Programming ◽

Inductive Logic ◽

Data Sources ◽

Protein Protein Interactions ◽

Prediction Rules ◽

Multiple Data Sources ◽

Protein Protein Interaction ◽

Multiple Data ◽

Domain Interactions

The objective of this paper is twofold. First is to provide a survey of computational methods for protein-protein interaction (PPI) study. Second is to introduce our work and results in using inductive logic programming to learn prediction rules for PPI and DDI (domain-domain interactions) from multiple data sources. We show advantages of ex-ploiting various types of data in these important problems of bioinformatics.

Get full-text (via PubEx)

Data Mining and Knowledge Discovery Technologies
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Data Mining in the Social Sciences and Iterative Attribute Elimination

Domain Driven Data Mining

A Machine Learning Approach for One-Stop Learning

Determination of Optimal Clusters Using a Genetic Algorithm

A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns

OLEMAR

Using Cryptography For Privacy-Preserving Data Mining

Advances in Classification of Sequence Data

K-means Clustering Adopting rbf-Kernel

Study of Protein-Protein Interactions from Multiple Data Sources

Export Citation Format

Data Mining and Knowledge Discovery TechnologiesLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Data Mining in the Social Sciences and Iterative Attribute Elimination

Domain Driven Data Mining

A Machine Learning Approach for One-Stop Learning

Determination of Optimal Clusters Using a Genetic Algorithm

A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns

OLEMAR

Using Cryptography For Privacy-Preserving Data Mining

Advances in Classification of Sequence Data

K-means Clustering Adopting rbf-Kernel

Study of Protein-Protein Interactions from Multiple Data Sources

Data Mining and Knowledge Discovery Technologies
Latest Publications