A Fast Boosting Based Incremental Genetic Algorithm for Mining Classification Rules in Large Datasets

Author(s):  
Periasamy Vivekanandan ◽  
Raju Nedunchezhian

Genetic algorithm is a search technique purely based on natural evolution process. It is widely used by the data mining community for classification rule discovery in complex domains. During the learning process it makes several passes over the data set for determining the accuracy of the potential rules. Due to this characteristic it becomes an extremely I/O intensive slow process. It is particularly difficult to apply GA when the training data set becomes too large and not fully available. An incremental Genetic algorithm based on boosting phenomenon is proposed in this paper which constructs a weak ensemble of classifiers in a fast incremental manner and thus tries to reduce the learning cost considerably.

2011 ◽  
Vol 2 (1) ◽  
pp. 49-58
Author(s):  
Periasamy Vivekanandan ◽  
Raju Nedunchezhian

Genetic algorithm is a search technique purely based on natural evolution process. It is widely used by the data mining community for classification rule discovery in complex domains. During the learning process it makes several passes over the data set for determining the accuracy of the potential rules. Due to this characteristic it becomes an extremely I/O intensive slow process. It is particularly difficult to apply GA when the training data set becomes too large and not fully available. An incremental Genetic algorithm based on boosting phenomenon is proposed in this paper which constructs a weak ensemble of classifiers in a fast incremental manner and thus tries to reduce the learning cost considerably.


Author(s):  
Bijaya Kumar Nanda ◽  
Satchidananda Dehuri

In data mining the task of extracting classification rules from large data is an important task and is gaining considerable attention. This article presents a novel ant miner for classification rule mining. The ant miner is inspired by researches on the behaviour of real ant colonies, simulated annealing, and some data mining concepts as well as principles. This paper presents a Pittsburgh style approach for single objective classification rule mining. The algorithm is tested on a few benchmark datasets drawn from UCI repository. The experimental outcomes confirm that ant miner-HPB (Hybrid Pittsburgh Style Classification) is significantly better than ant-miner-PB (Pittsburgh Style Classification).


2010 ◽  
Vol 34-35 ◽  
pp. 1961-1965
Author(s):  
You Qu Chang ◽  
Guo Ping Hou ◽  
Huai Yong Deng

distributed data mining is widely used in industrial and commercial applications to analyze large datasets maintained over geographically distributed sites. This paper discusses the disadvantages of existing distributed data mining systems, and puts forward a distributed data mining platform based grid computing. The experiments done on a data set showed that the proposed approach produces meaningful results and has reasonable efficiency and effectiveness providing a trade-off between runtime and rule interestingness.


2019 ◽  
Vol 8 (3) ◽  
pp. 4373-4378

The amount of data belonging to different domains are being stored rapidly in various repositories across the globe. Extracting useful information from the huge volumes of data is always difficult due to the dynamic nature of data being stored. Data Mining is a knowledge discovery process used to extract the hidden information from the data stored in various repositories, termed as warehouses in the form of patterns. One of the popular tasks of data mining is Classification, which deals with the process of distinguishing every instance of a data set into one of the predefined class labels. Banking system is one of the realworld domains, which collects huge number of client data on a daily basis. In this work, we have collected two variants of the bank marketing data set pertaining to a Portuguese financial institution consisting of 41188 and 45211 instances and performed classification on them using two data reduction techniques. Attribute subset selection has been performed on the first data set and the training data with the selected features are used in classification. Principal Component Analysis has been performed on the second data set and the training data with the extracted features are used in classification. A deep neural network classification algorithm based on Backpropagation has been developed to perform classification on both the data sets. Finally, comparisons are made on the performance of each deep neural network classifier with the four standard classifiers, namely Decision trees, Naïve Bayes, Support vector machines, and k-nearest neighbors. It has been found that the deep neural network classifier outperforms the existing classifiers in terms of accuracy


Author(s):  
Giovanni Felici ◽  
Klaus Truemper

The method described in this chapter is designed for data mining and learning on logic data. This type of data is composed of records that can be described by the presence or absence of a finite number of properties. Formally, such records can be described by variables that may assume only the values true or false, usually referred to as logic (or Boolean) variables. In real applications, it may also happen that the presence or absence of some property cannot be verified for some record; in such a case we consider that variable to be unknown (the capability to treat formally data with missing values is a feature of logic-based methods). For example, to describe patient records in medical diagnosis applications, one may use the logic variables healthy, old, has_high_temperature, among many others. A very common data mining task is to find, based on training data, the rules that separate two subsets of the available records, or explains the belonging of the data to one subset or the other. For example, one may desire to find a rule that, based one the many variables observed in patient records, is able to distinguish healthy patients from sick ones. Such a rule, if sufficiently precise, may then be used to classify new data and/or to gain information from the available data. This task is often referred to as machine learning or pattern recognition and accounts for a significant portion of the research conducted in the data mining community. When the data considered is in logic form or can be transformed into it by some reasonable process, it is of great interest to determine explanatory rules in the form of the combination of logic variables, or logic formulas. In the example above, a rule derived from data could be:if (has_high_temperature is true) and (running_nose is true) then (the patient is not healthy).


Author(s):  
M. Thangamani ◽  
P. Thangaraj

The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.


2016 ◽  
Vol 25 (2) ◽  
pp. 263-282 ◽  
Author(s):  
Renu Bala ◽  
Saroj Ratnoo

AbstractFuzzy rule-based systems (FRBSs) are proficient in dealing with cognitive uncertainties like vagueness and ambiguity imperative to real-world decision-making situations. Fuzzy classification rules (FCRs) based on fuzzy logic provide a framework for a flexible human-like reasoning involving linguistic variables. Appropriate membership functions (MFs) and suitable number of linguistic terms – according to actual distribution of data – are useful to strengthen the knowledge base (rule base [RB]+ data base [DB]) of FRBSs. An RB is expected to be accurate and interpretable, and a DB must contain appropriate fuzzy constructs (type of MFs, number of linguistic terms, and positioning of parameters of MFs) for the success of any FRBS. Moreover, it would be fascinating to know how a system behaves in some rare/exceptional circumstances and what action ought to be taken in situations where generalized rules cease to work. In this article, we propose a three-phased approach for discovery of FCRs augmented with intra- and inter-class exceptions. A pre-processing algorithm is suggested to tune DB in terms of the MFs and number of linguistic terms for each attribute of a data set in the first phase. The second phase discovers FCRs employing a genetic algorithm approach. Subsequently, intra- and inter-class exceptions are incorporated in the rules in the third phase. The proposed approach is illustrated on an example data set and further validated on six UCI machine learning repository data sets. The results show that the approach has been able to discover more accurate, interpretable, and interesting rules. The rules with intra-class exceptions tell us about the unique objects of a category, and rules with inter-class exceptions enable us to take a right decision in the exceptional circumstances.


Author(s):  
Lai Lai Yee ◽  
Myo Ma Ma

Data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses or other information repositories. This can be viewed as a result of the natural evolution of information technology. The key point is that data mining is the application of these and other AI and statistical techniques to common business problems in a fashion that makes these techniques available to the skilled knowledge worker as well as the trained statistics professional. This paper is classification system for Toxicology using C4.5. Firstly, the input data are randomly partitioned into two independent data, a training data and a test data. And then two third of the data are allocated to the training data and the remaining one third is allocated to the test data. Final step is C4.5 Algorithm Process, the training data is used to derive C4.5 algorithm. Classification Process, test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable the rules can be applied to the classification of new data.


2016 ◽  
Vol 25 (01) ◽  
pp. 1550028 ◽  
Author(s):  
Mete Celik ◽  
Fehim Koylu ◽  
Dervis Karaboga

In data mining, classification rule learning extracts the knowledge in the representation of IF_THEN rule which is comprehensive and readable. It is a challenging problem due to the complexity of data sets. Various meta-heuristic machine learning algorithms are proposed for rule learning. Cooperative rule learning is the discovery process of all classification rules with a single run concurrently. In this paper, a novel cooperative rule learning algorithm, called CoABCMiner, based on Artificial Bee Colony is introduced. The proposed algorithm handles the training data set and discovers the classification model containing the rule list. Token competition, new updating strategy used in onlooker and employed phases, and new scout bee mechanism are proposed in CoABCMiner to achieve cooperative learning of different rules belonging to different classes. We compared the results of CoABCMiner with several state-of-the-art algorithms using 14 benchmark data sets. Non parametric statistical tests, such as Friedman test, post hoc test, and contrast estimation based on medians are performed. Nonparametric tests determine the similarity of control algorithm among other algorithms on multiple problems. Sensitivity analysis of CoABCMiner is conducted. It is concluded that CoABCMiner can be used to discover classification rules for the data sets used in experiments, efficiently.


Sign in / Sign up

Export Citation Format

Share Document