Data Mining | ScienceGate

Parallel Data Mining

Data Mining ◽

10.4018/978-1-930708-25-9.ch013 ◽

2011 ◽

pp. 261-289 ◽

Cited By ~ 3

Author(s):

David Taniar ◽

J. Wenny Rahayu

Keyword(s):

Data Mining ◽

Scientific Data ◽

Data Capture ◽

Future Research ◽

Parallel Data ◽

Large Databases ◽

Knowledge Rules ◽

Comprehensive Survey ◽

Parallel Data Mining ◽

Single Processor

Data mining refers to a process on nontrivial extraction of implicit, previously unknown and potential useful information (such as knowledge rules, constraints, regularities) from data in databases. With the availability of inexpensive storage and the progress in data capture technology, many organizations have created ultra-large databases of business and scientific data, and this trend is expected to grow. Since the databases to be mined are likely to be very large (measured in terabytes and even petabytes), there is a critical need to investigate methods for parallel data mining techniques. Without parallelism, it is generally difficult for a single processor system to provide reasonable response time. In this chapter, we present a comprehensive survey of parallelism techniques for data mining. Parallel data mining offers new complexity as it incorporates techniques from parallel databases and parallel programming. Challenges that remain open for future research will also be presented.

Download Full-text

aiNet

Data Mining ◽

10.4018/978-1-930708-25-9.ch012 ◽

2011 ◽

pp. 231-260 ◽

Cited By ~ 24

Author(s):

Leandro Nunes de Castro ◽

Fernando J. Von Zuben

Keyword(s):

Data Analysis ◽

Network Performance ◽

Document Retrieval ◽

Benchmark Problems ◽

Data Sets ◽

Proposed Model ◽

Artificial Immune Network ◽

Immune Network Model ◽

Clustering Approach

This chapter shows that some of the basic aspects of the natural immune system discussed in the previous chapter can be used to propose a novel artificial immune network model with the main goals of clustering and filtering crude data sets described by high-dimensional samples. Our aim is not to reproduce with confidence any immune phenomenon, but demonstrate that immune concepts can be used as inspiration to develop novel computational tools for data analysis. As important results of our model, the network evolved will be capable of reducing redundancy and describing data structure, including their spatial distribution and cluster interrelations. Clustering is useful in several exploratory pattern analyses, grouping, decision-making and machine-learning tasks, including data mining, knowledge discovery, document retrieval, image segmentation and automatic pattern classification. The data clustering approach was implemented in association with hierarchical clustering and graphtheoretical techniques, and the network performance is illustrated using several benchmark problems. The computational complexity of the algorithm and a detailed sensitivity analysis of the user-defined parameters are presented. A trade-off among the proposed model for data analysis, connectionist models (artificial neural networks) and evolutionary algorithms is also discussed.

Download Full-text

Estimation of Distribution Algorithms for Feature Subset Selection in Large Dimensionality Domains

Data Mining ◽

10.4018/978-1-930708-25-9.ch005 ◽

2011 ◽

pp. 97-116 ◽

Cited By ~ 1

Author(s):

Inaki Inza ◽

Pedro Larranaga ◽

Basilio Sierra

Keyword(s):

Probabilistic Models ◽

Subset Selection ◽

Population Based ◽

Feature Subset Selection ◽

Feature Subset ◽

Estimation Of Distribution Algorithms ◽

Text Learning ◽

Estimation Of Distribution ◽

Selection Tasks ◽

Distribution Algorithms

Feature Subset Selection (FSS) is a well-known task of Machine Learning, Data Mining, Pattern Recognition or Text Learning paradigms. Genetic Algorithms (GAs) are possibly the most commonly used algorithms for Feature Subset Selection tasks. Although the FSS literature contains many papers, few of them tackle the task of FSS in domains with more than 50 features. In this chapter we present a novel search heuristic paradigm, called Estimation of Distribution Algorithms (EDAs), as an alternative to GAs, to perform a population-based and randomized search in datasets of a large dimensionality. The EDA paradigm avoids the use of genetic crossover and mutation operators to evolve the populations. In absence of these operators, the evolution is guaranteed by the factorization of the probability distribution of the best solutions found in a generation of the search and the subsequent simulation of this distribution to obtain a new pool of solutions. In this chapter we present four different probabilistic models to perform this factorization. In a comparison with two types of GAs in natural and artificial datasets of a large dimensionality, EDAbased approaches obtain encouraging results with regard to accuracy, and a fewer number of evaluations were needed than used in genetic approaches.

Download Full-text

Artificial Immune Systems

Data Mining ◽

10.4018/978-1-930708-25-9.ch011 ◽

2011 ◽

pp. 209-230 ◽

Cited By ~ 13

Author(s):

Jonathan Timmis ◽

Thomas Knight

Keyword(s):

Machine Learning ◽

Data Mining ◽

Immune System ◽

Data Analysis ◽

Artificial Immune Systems ◽

Artificial Immune ◽

General Introduction ◽

Immune Systems ◽

Novel Approaches ◽

The Creation

The immune system is highly distributed, highly adaptive, self-organising in nature, maintains a memory of past encounters and has the ability to continually learn about new encounters. From a computational viewpoint, the immune system has much to offer by way of inspiration. Recently there has been growing interest in the use of the natural immune system as inspiration for the creation of novel approaches to computational problems; this field of research is referred to as Immunological Computation (IC) or Artificial Immune Systems (AIS). This chapter describes the physiology of the immune system and provides a general introduction to Artificial Immune Systems. Significant applications that are relevant to data mining, in particular in the areas of machine learning and data analysis, are discussed in detail. Attention is paid both to the salient characteristics of the application and the details of the algorithms. This chapter concludes with an evaluation of the current and future contributions of Artificial Immune Systems in data mining.

Download Full-text

Approximating Proximity to Fast and Robust Distance-Based Clustering

Data Mining ◽

10.4018/978-1-930708-25-9.ch002 ◽

2011 ◽

pp. 1-21

Author(s):

Vladimir Estivill-Castro ◽

Michael Houle

Keyword(s):

Nearest Neighbor ◽

Optimization Problems ◽

Approximate Solutions ◽

Hill Climbing ◽

Clustering Methods ◽

High Quality ◽

Robust Clustering ◽

Trade Off ◽

Np Complete ◽

Feasible Solutions

Distance-based clustering results in optimization problems that typically are NP-hard or NP-complete and for which only approximate solutions are obtained. For the large instances emerging in data mining applications, the search for high-quality approximate solutions in the presence of noise and outliers is even more challenging. We exhibit fast and robust clustering methods that rely on the careful collection of proximity information for use by hill-climbing search strategies. The proximity information gathered approximates the nearest neighbor information produced using traditional, exact, but expensive methods. The proximity information is then used to produce fast approximations of robust objective optimization functions, and/or rapid comparison of two feasible solutions. These methods have been successfully applied for spatial and categorical data to surpass well-established methods such as k-MEANS in terms of the trade-off between quality and complexity.

Download Full-text

An Ant Colony Algorithm for Classification Rule Discovery

Data Mining ◽

10.4018/978-1-930708-25-9.ch010 ◽

2011 ◽

pp. 191-208 ◽

Cited By ~ 48

Author(s):

Rafael S. Parpinelli ◽

Heitor S. Lopes ◽

Alex A. Freitas

Keyword(s):

Data Mining ◽

Ant Colony Algorithm ◽

Predictive Accuracy ◽

Ant Colony ◽

Classification Rule ◽

Data Sets ◽

Rule Discovery ◽

Classification Rules ◽

Ant Colonies ◽

Rule Sets

This work proposes an algorithm for rule discovery called Ant-Miner (Ant Colony-Based Data Miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is based on recent research on the behavior of real ant colonies as well as in some data mining concepts. We compare the performance of Ant-Miner with the performance of the wellknown C4.5 algorithm on six public domain data sets. The results provide evidence that: (a) Ant-Miner is competitive with C4.5 with respect to predictive accuracy; and (b) the rule sets discovered by Ant-Miner are simpler (smaller) than the rule sets discovered by C4.5.

Download Full-text

A Building Block Approach to Genetic Programming for Rule Discovery

Data Mining ◽

10.4018/978-1-930708-25-9.ch009 ◽

2011 ◽

pp. 174-190 ◽

Cited By ~ 1

Author(s):

Andries P. Engelbrecht ◽

L. Schoeman ◽

Sonja Rouwhorst

Keyword(s):

Decision Tree ◽

Genetic Programming ◽

Decision Trees ◽

Building Block ◽

Evolutionary Process ◽

Building Blocks ◽

Experimental Results ◽

Rule Discovery ◽

New Approach ◽

Block Approach

Genetic programming has recently been used successfully to extract knowledge in the form of IF-THEN rules. For these genetic programming approaches to knowledge extraction from data, individuals represent decision trees. The main objective of the evolutionary process is therefore to evolve the best decision tree, or classifier, to describe the data. Rules are then extracted, after convergence, from the best individual. The current genetic programming approaches to evolve decision trees are computationally complex, since individuals are initialized to complete decision trees. This chapter discusses a new approach to genetic programming for rule extraction, namely the building block approach. This approach starts with individuals consisting of only one building block, and adds new building blocks during the evolutionary process when the simplicity of the individuals cannot account for the complexity in the underlying data. Experimental results are presented and compared with that of C4.5 and CN2. The chapter shows that the building block approach achieves very good accuracies compared to that of C4.5 and CN2. It is also shown that the building block approach extracts substantially less rules.

Download Full-text

Towards the Cross-Fertilization of Multiple Heuristics

Data Mining ◽

10.4018/978-1-930708-25-9.ch006 ◽

2011 ◽

pp. 117-142

Author(s):

Jorge Muruzabal

Keyword(s):

Data Mining ◽

Bayesian Learning ◽

Classification Rules ◽

Learning Classifier System ◽

New Approach ◽

Classifier System ◽

Learning Classifier ◽

Learning Rules ◽

Single Input ◽

Cross Fertilization

Evolutionary algorithms are by now well-known and appreciated in a number of disciplines including the emerging field of data mining. In the last couple of decades, Bayesian learning has also experienced enormous growth in the statistical literature. An interesting question refers to the possible synergetic effects between Bayesian and evolutionary ideas, particularly with an eye to large-sample applications. This chapter presents a new approach to classification based on the integration of a simple local Bayesian engine within the learning classifier system rulebased architecture. The new algorithm maintains and evolves a population of classification rules which individually learn to make better predictions on the basis of the data they get to observe. Certain reinforcement policy ensures that adequate teams of these learning rules be available in the population for every single input of interest. Links with related algorithms are established, and experimental results suggesting the parsimony, stability and usefulness of the approach are discussed.

Download Full-text

The Discovery of Interesting Nuggets Using Heuristic Techniques

Data Mining ◽

10.4018/978-1-930708-25-9.ch004 ◽

2011 ◽

pp. 48-71 ◽

Cited By ~ 1

Author(s):

Beatriz de la Iglesia ◽

Victor J. Rayward-Smith

Keyword(s):

Data Mining ◽

Genetic Algorithms ◽

Simulated Annealing ◽

Tabu Search ◽

Heuristic Algorithms ◽

Partial Ordering ◽

Knowledge Discovery In Databases ◽

Evaluation Function ◽

Interactive Process ◽

Partial Classification

Knowledge Discovery in Databases (KDD) is an iterative and interactive process involving many steps (Debuse, de la Iglesia, Howard & Rayward-Smith, 2000). Data mining (DM) is defined as one of the steps in the KDD process. According to Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy (1996), there are various data mining tasks including: classification, clustering, regression, summarisation, dependency modeling, and change and deviation detection. However, there is a very important data mining problem identified previously by Riddle, Segal and Etzioni (1994) and very relevant in the context of commercial databases, which is not properly addressed by any of those tasks: nugget discovery. This task has also been identified as partial classification (Ali, Manganaris & Srikant, 1997). Nugget discovery can be defined as the search for relatively rare, but potentially important, patterns or anomalies relating to some pre-determined class or classes. Patterns of this type are called nuggets. This chapter will present and justify the use of heuristic algorithms, namely Genetic Algorithms (GAs), Simulated Annealing (SA) and Tabu Search (TS), on the data mining task of nugget discovery. First, the concept of nugget discovery will be introduced. Then the concept of the interest of a nugget will be discussed. The necessary properties of an interest measure for nugget discovery will be presented. This will include a partial ordering of nuggets based on those properties. Some of the existing measures for nugget discovery will be reviewed in light of the properties established, and it will be shown that they do not display the required properties. A suitable evaluation function for nugget discovery, the fitness measure, will then be discussed and justified according to the required properties.

Download Full-text

Evolution of Spatial Data Templates for Object Classification

Data Mining ◽

10.4018/978-1-930708-25-9.ch007 ◽

2011 ◽

pp. 143-156

Author(s):

Neil Dunstan ◽

Michael de Raadt

Keyword(s):

Image Processing ◽

Genetic Algorithms ◽

Spatial Data ◽

Search Space ◽

Multidimensional Space ◽

Model Object ◽

Object Categories ◽

Different Depths ◽

Key Features

Sensing devices are commonly used for the detection and classification of subsurface objects, particularly for the purpose of eradicating Unexploded Ordnance (UXO) from military sites. UXO detection and classification is inherently different to pattern recognition in image processing in that signal responses for the same object will differ greatly when the object is at different depths and orientations. That is, subsurface objects span a multidimensional space with dimensions including depth, azimuth and declination. Thus the search space for identifying an instance of an object is extremely large. Our approach is to use templates of actual responses from scans of known objects to model object categories. We intend to justify a method whereby Genetic Algorithms are used to improve the template libraries with respect to their classification characteristics. This chapter describes the application, key features of the Genetic Algorithms tested and the results achieved.

Download Full-text

Data Mining
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Parallel Data Mining

aiNet

Estimation of Distribution Algorithms for Feature Subset Selection in Large Dimensionality Domains

Artificial Immune Systems

Approximating Proximity to Fast and Robust Distance-Based Clustering

An Ant Colony Algorithm for Classification Rule Discovery

A Building Block Approach to Genetic Programming for Rule Discovery

Towards the Cross-Fertilization of Multiple Heuristics

The Discovery of Interesting Nuggets Using Heuristic Techniques

Evolution of Spatial Data Templates for Object Classification

Export Citation Format

Data MiningLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Parallel Data Mining

aiNet

Estimation of Distribution Algorithms for Feature Subset Selection in Large Dimensionality Domains

Artificial Immune Systems

Approximating Proximity to Fast and Robust Distance-Based Clustering

An Ant Colony Algorithm for Classification Rule Discovery

A Building Block Approach to Genetic Programming for Rule Discovery

Towards the Cross-Fertilization of Multiple Heuristics

The Discovery of Interesting Nuggets Using Heuristic Techniques

Evolution of Spatial Data Templates for Object Classification

Data Mining
Latest Publications