A Hybrid PSO/ACO Algorithm for Discovering Classification Rules in Data Mining

We have previously proposed a hybrid particle swarm optimisation/ant colony optimisation (PSO/ACO) algorithm for the discovery of classification rules. Unlike a conventional PSO algorithm, this hybrid algorithm can directly cope with nominal attributes, without converting nominal values into binary numbers in a preprocessing phase. PSO/ACO2 also directly deals with both continuous and nominal attribute values, a feature that current PSO and ACO rule induction algorithms lack. We evaluate the new version of the PSO/ACO algorithm (PSO/ACO2) in 27 public-domain, real-world data sets often used to benchmark the performance of classification algorithms. We compare the PSO/ACO2 algorithm to an industry standard algorithm PART and compare a reduced version of our PSO/ACO2 algorithm, coping only with continuous data, to our new classification algorithm for continuous data based on differential evolution. The results show that PSO/ACO2 is very competitive in terms of accuracy to PART and that PSO/ACO2 produces significantly simpler (smaller) rule sets, a desirable result in data mining—where the goal is to discover knowledge that is not only accurate but also comprehensible to the user. The results also show that the reduced PSO version for continuous attributes provides a slight increase in accuracy when compared to the differential evolution variant.

Download Full-text

An Ant Colony Algorithm for Classification Rule Discovery

Data Mining ◽

10.4018/978-1-930708-25-9.ch010 ◽

2011 ◽

pp. 191-208 ◽

Cited By ~ 48

Author(s):

Rafael S. Parpinelli ◽

Heitor S. Lopes ◽

Alex A. Freitas

Keyword(s):

Data Mining ◽

Ant Colony Algorithm ◽

Predictive Accuracy ◽

Ant Colony ◽

Classification Rule ◽

Data Sets ◽

Rule Discovery ◽

Classification Rules ◽

Ant Colonies ◽

Rule Sets

This work proposes an algorithm for rule discovery called Ant-Miner (Ant Colony-Based Data Miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is based on recent research on the behavior of real ant colonies as well as in some data mining concepts. We compare the performance of Ant-Miner with the performance of the wellknown C4.5 algorithm on six public domain data sets. The results provide evidence that: (a) Ant-Miner is competitive with C4.5 with respect to predictive accuracy; and (b) the rule sets discovered by Ant-Miner are simpler (smaller) than the rule sets discovered by C4.5.

Download Full-text

Finding Persistent Strong Rules

Knowledge Discovery Practices and Emerging Applications of Data Mining - Advances in Data Mining and Database Management ◽

10.4018/978-1-60960-067-9.ch005 ◽

2010 ◽

pp. 85-107

Author(s):

Anthony Scime ◽

Karthik Rajasethupathy ◽

Kulathur S. Rajasethupathy ◽

Gregg R. Murray

Keyword(s):

Data Mining ◽

Association Rules ◽

Strong Association ◽

National Election ◽

Data Sets ◽

Rule Discovery ◽

Discovery Process ◽

Data Set ◽

Rule Sets ◽

Election Studies

Data mining is a collection of algorithms for finding interesting and unknown patterns or rules in data. However, different algorithms can result in different rules from the same data. The process presented here exploits these differences to find particularly robust, consistent, and noteworthy rules among much larger potential rule sets. More specifically, this research focuses on using association rules and classification mining to select the persistently strong association rules. Persistently strong association rules are association rules that are verifiable by classification mining the same data set. The process for finding persistent strong rules was executed against two data sets obtained from the American National Election Studies. Analysis of the first data set resulted in one persistent strong rule and one persistent rule, while analysis of the second data set resulted in 11 persistent strong rules and 10 persistent rules. The persistent strong rule discovery process suggests these rules are the most robust, consistent, and noteworthy among the much larger potential rule sets.

Download Full-text

Bio-Inspired Algorithms for Medical Data Analysis

Handbook of Research on Biomimicry in Information Retrieval and Knowledge Management - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-3004-6.ch014 ◽

2018 ◽

pp. 251-275 ◽

Cited By ~ 1

Author(s):

Hanane Menad ◽

Abdelmalek Amine

Keyword(s):

Data Mining ◽

Data Analysis ◽

Social Behavior ◽

Medical Data ◽

The Other ◽

Data Sets ◽

Classification Rules ◽

Medical Data Mining ◽

Good Efficiency

Medical data mining has great potential for exploring the hidden patterns in the data sets of the medical domain. These patterns can be utilized for clinical diagnosis. Bio-inspired algorithms is a new field of research. Its main advantage is knitting together subfields related to the topics of connectionism, social behavior, and emergence. Briefly put, it is the use of computers to model living phenomena and simultaneously the study of life to improve the usage of computers. In this chapter, the authors present an application of four bio-inspired algorithms and meta heuristics for classification of seven different real medical data sets. Two of these algorithms are based on similarity calculation between training and test data while the other two are based on random generation of population to construct classification rules. The results showed a very good efficiency of bio-inspired algorithms for supervised classification of medical data.

Download Full-text

Finding Persistent Strong Rules

Data Mining ◽

10.4018/978-1-4666-2455-9.ch002 ◽

2013 ◽

pp. 28-49

Author(s):

Anthony Scime ◽

Karthik Rajasethupathy ◽

Kulathur S. Rajasethupathy ◽

Gregg R. Murray

Keyword(s):

Data Mining ◽

Association Rules ◽

Strong Association ◽

National Election ◽

Data Sets ◽

Rule Discovery ◽

Discovery Process ◽

Data Set ◽

Rule Sets ◽

Election Studies

Download Full-text

Heuristic extraction of fuzzy classification rules using data mining techniques: an empirical study on benchmark data sets

2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542) ◽

10.1109/fuzzy.2004.1375709 ◽

2005 ◽

Cited By ~ 7

Author(s):

H. Ishibuchi ◽

T. Yamamoto

Keyword(s):

Data Mining ◽

Empirical Study ◽

Fuzzy Classification ◽

Data Sets ◽

Classification Rules ◽

Data Mining Techniques ◽

Benchmark Data ◽

Using Data

Download Full-text

Performance Assessment of Learning Algorithms on Multi-Domain Data Sets

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/ijkdb.2018010103 ◽

2018 ◽

Vol 8 (1) ◽

pp. 27-41

Author(s):

Amit Kumar ◽

Bikash Kanti Sarkar

Keyword(s):

Data Mining ◽

Performance Assessment ◽

Learning Algorithm ◽

Wide Spectrum ◽

Learning Algorithms ◽

Complex Nature ◽

Data Sets ◽

Real World Data ◽

Comparative Performance ◽

Assessment Of Learning

This article describes how for the last few decades, data mining research has had significant progress in a wide spectrum of applications. Research in prediction of multi-domain data sets is a challenging task due to the imbalanced, voluminous, conflicting, and complex nature of data sets. A learning algorithm is the most important technique for solving these problems. The learning algorithms are widely used for classification purposes. But choosing the learners that perform best for data sets of particular domains is a challenging task in data mining. This article provides a comparative performance assessment of various state-of-the-art learning algorithms over multi-domain data sets to search the effective classifier(s) for a particular domain, e.g., artificial, natural, semi-natural, etc. In the present article, a total of 14 real world data sets are selected from University of California, Irvine (UCI) machine learning repository for conducting experiments using three competent individual learners and their hybrid combinations.

Download Full-text

An Uncertainty-Based Model for Optimized Multi-Label Classification

Advances in Computational Intelligence and Robotics - Handbook of Research on Swarm Intelligence in Engineering ◽

10.4018/978-1-4666-8291-7.ch002 ◽

2015 ◽

pp. 40-73

Author(s):

J. Anuradha ◽

B. K. Tripathy

Keyword(s):

Data Mining ◽

Particle Swarm Optimization ◽

Rough Set ◽

Fuzzy Model ◽

Pso Algorithm ◽

Data Sets ◽

Swarm Optimization ◽

Weak Points ◽

The Individual

The data used in the real world applications are uncertain and vague. Several models to handle such data efficiently have been put forth so far. It has been found that the individual models have some strong points and certain weak points. Efforts have been made to combine these models so that the hybrid models will cash upon the strong points of the constituent models. Dubois and Prade in 1990 combined rough set and fuzzy set together to develop two models of which rough fuzzy model is a popular one and is used in many fields to handle uncertainty-based data sets very well. Particle Swarm Optimization (PSO) further combined with the rough fuzzy model is expected to produce optimized solutions. Similarly, multi-label classification in the context of data mining deals with situations where an object or a set of objects can be assigned to multiple classes. In this chapter, the authors present a rough fuzzy PSO algorithm that performs classification of multi-label data sets, and through experimental analysis, its efficiency and superiority has been established.

Download Full-text

Quantization of Continuous Data for Pattern Based Rule Extraction

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch251 ◽

2011 ◽

pp. 1646-1652

Author(s):

Andrew Hamilton-Wright ◽

Daniel W. Stashuk

Keyword(s):

Data Mining ◽

Rough Sets ◽

Extreme Points ◽

Fixed Number ◽

Continuous Data ◽

Continuous Variables ◽

Negative Effects ◽

Real World Data ◽

Event Based ◽

Continuous Domain

A great deal of interesting real-world data is encountered through the analysis of continuous variables, however many of the robust tools for rule discovery and data characterization depend upon the underlying data existing in an ordinal, enumerable or discrete data domain. Tools that fall into this category include much of the current work in fuzzy logic and rough sets, as well as all forms of event-based pattern discovery tools based on probabilistic inference. Through the application of discretization techniques, continuous data is made accessible to the analysis provided by the strong tools of discrete-valued data mining. The most common approach for discretization is quantization, in which the range of observed continuous valued data are assigned to a fixed number of quanta, each of which covers a particular portion of the range within the bounds provided by the most extreme points observed within the continuous domain. This chapter explores the effects such quantization may have, and the techniques that are available to ameliorate the negative effects of these efforts, notably fuzzy systems and rough sets.

Download Full-text

A regression-based algorithm for frequent itemsets mining

Data Technologies and Applications ◽

10.1108/dta-03-2019-0037 ◽

2019 ◽

Vol 54 (3) ◽

pp. 259-273

Author(s):

Zirui Jia ◽

Zengli Wang

Keyword(s):

Data Mining ◽

Regression Model ◽

Multiple Linear Regression Model ◽

Mining Area ◽

Frequent Itemset ◽

Continuous Data ◽

Data Sets ◽

Content Type ◽

Existing Problems ◽

Frequent Itemsets Mining

Purpose Frequent itemset mining (FIM) is a basic topic in data mining. Most FIM methods build itemset database containing all possible itemsets, and use predefined thresholds to determine whether an itemset is frequent. However, the algorithm has some deficiencies. It is more fit for discrete data rather than ordinal/continuous data, which may result in computational redundancy, and some of the results are difficult to be interpreted. The purpose of this paper is to shed light on this gap by proposing a new data mining method. Design/methodology/approach Regression pattern (RP) model will be introduced, in which the regression model and FIM method will be combined to solve the existing problems. Using a survey data of computer technology and software professional qualification examination, the multiple linear regression model is selected to mine associations between items. Findings Some interesting associations mined by the proposed algorithm and the results show that the proposed method can be applied in ordinal/continuous data mining area. The experiment of RP model shows that, compared to FIM, the computational redundancy decreased and the results contain more information. Research limitations/implications The proposed algorithm is designed for ordinal/continuous data and is expected to provide inspiration for data stream mining and unstructured data mining. Practical implications Compared to FIM, which mines associations between discrete items, RP model could mine associations between ordinal/continuous data sets. Importantly, RP model performs well in saving computational resource and mining meaningful associations. Originality/value The proposed algorithms provide a novelty view to define and mine association.

Download Full-text

Unsupervised Large‐Scale Search for Similar Earthquake Signals

Bulletin of the Seismological Society of America ◽

10.1785/0120190006 ◽

2019 ◽

Vol 109 (4) ◽

pp. 1451-1468 ◽

Cited By ~ 1

Author(s):

Clara E. Yoon ◽

Karianne J. Bergen ◽

Kexin Rong ◽

Hashem Elezabi ◽

William L. Ellsworth ◽

...

Keyword(s):

Data Mining ◽

Nuclear Power ◽

Large Scale ◽

Seismic Network ◽

Continuous Data ◽

Data Sets ◽

Data Set ◽

Data Mining Algorithms ◽

Seismicity Rates ◽

Unsupervised Data Mining

Abstract Seismology has continuously recorded ground‐motion spanning up to decades. Blind, uninformed search for similar‐signal waveforms within this continuous data can detect small earthquakes missing from earthquake catalogs, yet doing so with naive approaches is computationally infeasible. We present results from an improved version of the Fingerprint And Similarity Thresholding (FAST) algorithm, an unsupervised data‐mining approach to earthquake detection, now available as open‐source software. We use FAST to search for small earthquakes in 6–11 yr of continuous data from 27 channels over an 11‐station local seismic network near the Diablo Canyon nuclear power plant in central California. FAST detected 4554 earthquakes in this data set, with a 7.5% false detection rate: 4134 of the detected events were previously cataloged earthquakes located across California, and 420 were new local earthquake detections with magnitudes −0.3≤ML≤2.4, of which 224 events were located near the seismic network. Although seismicity rates are low, this study confirms that nearby faults are active. This example shows how seismology can leverage recent advances in data‐mining algorithms, along with improved computing power, to extract useful additional earthquake information from long‐duration continuous data sets.

Download Full-text