Exploring Advances in Interdisciplinary Data Mining and Analytics

Finding associations among different diseases is an important task in medical data mining. The NHANES data is a valuable source in exploring disease associations. However, existing studies analyzing the NHANES data focus on using statistical techniques to test a small number of hypotheses. This NHANES data has not been systematically explored for mining disease association patterns. In this regard, this paper proposes a direct disease pattern mining method and an interactive disease pattern mining method to explore the NHANES data. The results on the latest NHANES data demonstrate that these methods can mine meaningful disease associations consistent with the existing knowledge and literatures. Furthermore, this study provides summarization of the data set via a disease influence graph and a disease hierarchical tree.

Download Full-text

Combining kNN Imputation and Bootstrap Calibrated

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch016 ◽

2011 ◽

pp. 278-289

Author(s):

Yongsong Qin ◽

Shichao Zhang ◽

Chengqi Zhang

Keyword(s):

Confidence Intervals ◽

Incomplete Data ◽

Nearest Neighbor ◽

Simple Procedure ◽

Imputation Method ◽

Important Research ◽

K Nearest Neighbor ◽

Data Discovery ◽

Industrial Data ◽

Simulation Results

The k-nearest neighbor (kNN) imputation, as one of the most important research topics in incomplete data discovery, has been developed with great successes on industrial data. However, it is difficult to obtain a mathematical valid and simple procedure to construct confidence intervals for evaluating the imputed data. This chapter studies a new estimation for missing (or incomplete) data that is a combination of the kNN imputation and bootstrap calibrated EL (Empirical Likelihood). The combination not only releases the burden of seeking a mathematical valid asymptotic theory for the kNN imputation, but also inherits the advantages of the EL method compared to the normal approximation method. Simulation results demonstrate that the bootstrap calibrated EL method performs quite well in estimating confidence intervals for the imputed data with kNN imputation method.

Download Full-text

ASCCN

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch013 ◽

2011 ◽

pp. 219-232

Author(s):

Renxia Wan ◽

Lixin Wang ◽

Xiaoke Su

Keyword(s):

Clustering Algorithms ◽

Computational Cost ◽

Dissimilarity Measure ◽

Data Points ◽

Partition Method

Special clustering algorithms are attractive for the task of grouping an arbitrary shaped database into several proper classes. Until now, a wide variety of clustering algorithms for this task have been proposed, although the majority of these algorithms are density-based. In this chapter, the authors extend the dissimilarity measure to compatible measure and propose a new algorithm (ASCCN) based on the results. ASCCN is an unambiguous partition method that groups objects to compatible nucleoids, and merges these nucleoids into different clusters. The application of cluster grids significantly reduces the computational cost of ASCCN, and experiments show that ASCCN can efficiently and effectively group arbitrary shaped data points into meaningful clusters.

Download Full-text

Classification of Peer-to-Peer Traffic Using a Two-Stage Window-Based Classifier with Fast Decision Tree and IP Layer Attributes

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch011 ◽

2011 ◽

pp. 174-188

Author(s):

Bijan Raahemi ◽

Ali Mumtaz

Keyword(s):

Decision Tree ◽

High Accuracy ◽

Peer To Peer ◽

Training Data ◽

Two Stage ◽

Layer 4 ◽

Using Data ◽

P2p Applications ◽

Fast Decision

This paper presents a new approach using data mining techniques, and in particular a two-stage architecture, for classification of Peer-to-Peer (P2P) traffic in IP networks where in the first stage the traffic is filtered using standard port numbers and layer 4 port matching to label well-known P2P and NonP2P traffic. The labeled traffic produced in the first stage is used to train a Fast Decision Tree (FDT) classifier with high accuracy. The Unknown traffic is then applied to the FDT model which classifies the traffic into P2P and NonP2P with high accuracy. The two-stage architecture not only classifies well-known P2P applications, but also classifies applications that use random or non-standard port numbers and cannot be classified otherwise. The authors captured the internet traffic at a gateway router, performed pre-processing on the data, selected the most significant attributes, and prepared a training data set to which the new algorithm was applied. Finally, the authors built several models using a combination of various attribute sets for different ratios of P2P to NonP2P traffic in the training data.

Download Full-text

User-Centric Similarity and Proximity Measures for Spatial Personalization

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch008 ◽

2011 ◽

pp. 128-146

Author(s):

Yanwu Yang ◽

Christophe Claramunt ◽

Marie-Aude Aufaure ◽

Wensheng Zhang

Keyword(s):

Information Needs ◽

Spatial Information ◽

Personal Information ◽

Similarity Measures ◽

Information Services ◽

Spatial Proximity ◽

Conceptual Approach ◽

User Centric ◽

Proximity Measures ◽

The Web

Spatial personalization can be defined as a novel way to fulfill user information needs when accessing spatial information services either on the web or in mobile environments. The research presented in this paper introduces a conceptual approach that models the spatial information offered to a given user into a user-centered conceptual map, and spatial proximity and similarity measures that considers her/his location, interests and preferences. This approach is based on the concepts of similarity in the semantic domain, and proximity in the spatial domain, but taking into account user’s personal information. Accordingly, these spatial proximity and similarity measures could directly support derivation of personalization services and refinement of the way spatial information is accessible to the user in spatially related applications. These modeling approaches are illustrated by some experimental case studies.

Download Full-text

The Dynamics of Content Popularity in Social Media

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch002 ◽

2011 ◽

pp. 17-33

Author(s):

Symeon Papadopoulos ◽

Athena Vakali ◽

Ioannis Kompatsiaris

Keyword(s):

Social Media ◽

Social Relations ◽

Online Social Networks ◽

Web Pages ◽

Social Bookmarking ◽

Consumption Behavior ◽

Content Consumption ◽

The Social ◽

Content Popularity ◽

The Impact

Social Bookmarking Systems (SBS) have been widely adopted in the last years, and thus they have had a significant impact on the way that online content is accessed, read and rated. Until recently, the decision on what content to display in a publisher’s web pages was made by one or at most few authorities. In contrast, modern SBS-based applications permit their users to submit their preferred content, to comment on and to rate the content of other users and establish social relations with each other. In that way, the vision of the social media is realized, i.e. the online users collectively decide upon the interestingness of the available bookmarked content. This paper attempts to provide insights into the dynamics emerging from the process of content rating by the user community. To this end, the paper proposes a framework for the study of the statistical properties of an SBS, the evolution of bookmarked content popularity and user activity in time, as well as the impact of online social networks on the content consumption behavior of individuals. The proposed analysis framework is applied to a large dataset collected from digg, a popular social media application.

Download Full-text

Constrained Cube Lattices for Multidimensional Database Mining

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch012 ◽

2011 ◽

pp. 189-218

Author(s):

Alain Casali ◽

Sébastien Nedjar ◽

Rosine Cicchetti ◽

Lotfi Lakhal

Keyword(s):

Search Space ◽

Database Mining ◽

Multidimensional Database ◽

Common Structure ◽

Power Set ◽

Condensed Representations ◽

Concise Representation ◽

Monotone Constraints ◽

Existing Data ◽

Cube Lattice

In multidimensional database mining, constrained multidimensional patterns differ from the well-known frequent patterns from both conceptual and logical points of view because of a common structure and the ability to support various types of constraints. Classical data mining techniques are based on the power set lattice of binary attribute values and, even adapted, are not suitable when addressing the discovery of constrained multidimensional patterns. In this chapter, the authors propose a foundation for various multidimensional database mining problems by introducing a new algebraic structure called cube lattice, which characterizes the search space to be explored. This chapter takes into consideration monotone and/or anti-monotone constraints enforced when mining multidimensional patterns. The authors propose condensed representations of the constrained cube lattice, which is a convex space, and present a generalized levelwise algorithm for computing them. Additionally, the authors consider the formalization of existing data cubes, and the discovery of frequent multidimensional patterns, while introducing a perfect concise representation from which any solution provided with its conjunction, disjunction and negation frequencies. Finally, emphasis on advantages of the cube lattice when compared to the power set lattice of binary attributes in multidimensional database mining are placed.

Download Full-text

Mining Frequent Generalized Patterns for Web Personalization in the Presence of Taxonomies

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch004 ◽

2011 ◽

pp. 52-68

Author(s):

Panagiotis Giannikopoulos ◽

Iraklis Varlamis ◽

Magdalini Eirinaki

Keyword(s):

Association Rules ◽

Web Site ◽

Aggregate Level ◽

Web Page ◽

Web Log ◽

Log Files ◽

Generalized Association Rules ◽

Generalized Patterns ◽

Navigation Patterns ◽

The Web

The Web is a continuously evolving environment, since its content is updated on a regular basis. As a result, the traditional usage-based approach to generate recommendations that takes as input the navigation paths recorded on the Web page level, is not as effective. Moreover, most of the content available online is either explicitly or implicitly characterized by a set of categories organized in a taxonomy, allowing the page-level navigation patterns to be generalized to a higher, aggregate level. In this direction, the authors present the Frequent Generalized Pattern (FGP) algorithm. FGP takes as input the transaction data and a hierarchy of categories and produces generalized association rules that contain transaction items and/or item categories. The results can be used to generate association rules and subsequently recommendations for the users. The algorithm can be applied to the log files of a typical Web site; however, it can be more helpful in a Web 2.0 application, such as a feed aggregator or a digital library mediator, where content is semantically annotated and the taxonomic nature is more complex, requiring us to extend FGP in a version called FGP+. The authors experimentally evaluate both algorithms using Web log data collected from a newspaper Web site.

Download Full-text

An Efficient Method for Discretizing Continuous Attributes

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch005 ◽

2011 ◽

pp. 69-90

Author(s):

Kelley M. Engle ◽

Aryya Gangopadhyay

Keyword(s):

Data Mining ◽

Efficient Method ◽

Information Gain ◽

Search Space ◽

Hill Climbing ◽

Data Sets ◽

Optimal Point ◽

Hill Climbing Algorithm ◽

Large Databases ◽

Novel Method

In this paper the authors present a novel method for finding optimal split points for discretization of continuous attributes. Such a method can be used in many data mining techniques for large databases. The method consists of two major steps. In the first step search space is pruned using a bisecting region method that partitions the search space and returns the point with the highest information gain based on its search. The second step consists of a hill climbing algorithm that starts with the point returned by the first step and greedily searches for an optimal point. The methods were tested using fifteen attributes from two data sets. The results show that the method reduces the number of searches drastically while identifying the optimal or near-optimal split points. On average, there was a 98% reduction in the number of information gain calculations with only 4% reduction in information gain.

Download Full-text

A New Similarity Metric for Sequential Data

Exploring Advances in Interdisciplinary Data Mining and Analytics ◽

10.4018/978-1-61350-474-1.ch014 ◽

2011 ◽

pp. 233-248 ◽

Cited By ~ 1

Author(s):

Pradeep Kumar ◽

Bapi S. Raju ◽

P. Radha Krishna

Keyword(s):

Data Mining ◽

Similarity Measure ◽

Web Mining ◽

Clustering Algorithms ◽

Sequential Data ◽

Similarity Metric ◽

Benchmark Datasets ◽

Similarity Preserving ◽

Sequential Nature ◽

Classification And Clustering

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this chapter, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.

Download Full-text

Exploring Advances in Interdisciplinary Data Mining and Analytics
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Exploring Disease Association from the NHANES Data

Combining kNN Imputation and Bootstrap Calibrated

ASCCN

Classification of Peer-to-Peer Traffic Using a Two-Stage Window-Based Classifier with Fast Decision Tree and IP Layer Attributes

User-Centric Similarity and Proximity Measures for Spatial Personalization

The Dynamics of Content Popularity in Social Media

Constrained Cube Lattices for Multidimensional Database Mining

Mining Frequent Generalized Patterns for Web Personalization in the Presence of Taxonomies

An Efficient Method for Discretizing Continuous Attributes

A New Similarity Metric for Sequential Data

Export Citation Format

Exploring Advances in Interdisciplinary Data Mining and AnalyticsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Exploring Disease Association from the NHANES Data

Combining kNN Imputation and Bootstrap Calibrated

ASCCN

Classification of Peer-to-Peer Traffic Using a Two-Stage Window-Based Classifier with Fast Decision Tree and IP Layer Attributes

User-Centric Similarity and Proximity Measures for Spatial Personalization

The Dynamics of Content Popularity in Social Media

Constrained Cube Lattices for Multidimensional Database Mining

Mining Frequent Generalized Patterns for Web Personalization in the Presence of Taxonomies

An Efficient Method for Discretizing Continuous Attributes

A New Similarity Metric for Sequential Data

Exploring Advances in Interdisciplinary Data Mining and Analytics
Latest Publications