Genetic Programming for Automatically Constructing Data Mining Algorithms

Author(s):  
Alex A. Freitas ◽  
Gisele L. Pappa

At present there is a wide range of data mining algorithms available to researchers and practitioners (Witten & Frank, 2005; Tan et al., 2006). Despite the great diversity of these algorithms, virtually all of them share one feature: they have been manually designed. As a result, current data mining algorithms in general incorporate human biases and preconceptions in their designs. This article proposes an alternative approach to the design of data mining algorithms, namely the automatic creation of data mining algorithms by means of Genetic Programming (GP) (Pappa & Freitas, 2006). In essence, GP is a type of Evolutionary Algorithm – i.e., a search algorithm inspired by the Darwinian process of natural selection – that evolves computer programs or executable structures. This approach opens new avenues for research, providing the means to design novel data mining algorithms that are less limited by human biases and preconceptions, and so offer the potential to discover new kinds of patterns (or knowledge) to the user. It also offers an interesting opportunity for the automatic creation of data mining algorithms tailored to the data being mined.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Thiago Cesar de Oliveira ◽  
Lúcio de Medeiros ◽  
Daniel Henrique Marco Detzel

Purpose Real estate appraisals are becoming an increasingly important means of backing up financial operations based on the values of these kinds of assets. However, in very large databases, there is a reduction in the predictive capacity when traditional methods, such as multiple linear regression (MLR), are used. This paper aims to determine whether in these cases the application of data mining algorithms can achieve superior statistical results. First, real estate appraisal databases from five towns and cities in the State of Paraná, Brazil, were obtained from Caixa Econômica Federal bank. Design/methodology/approach After initial validations, additional databases were generated with both real, transformed and nominal values, in clean and raw data. Each was assisted by the application of a wide range of data mining algorithms (multilayer perceptron, support vector regression, K-star, M5Rules and random forest), either isolated or combined (regression by discretization – logistic, bagging and stacking), with the use of 10-fold cross-validation in Weka software. Findings The results showed more varied incremental statistical results with the use of algorithms than those obtained by MLR, especially when combined algorithms were used. The largest increments were obtained in databases with a large amount of data and in those where minor initial data cleaning was carried out. The paper also conducts a further analysis, including an algorithmic ranking based on the number of significant results obtained. Originality/value The authors did not find similar studies or research studies conducted in Brazil.


Author(s):  
Shyue-Liang Wang ◽  
Ju-Wen Shen ◽  
Tuzng-Pei Hong

Mining functional dependencies (FDs) from databases has been identified as an important database analysis technique. It has received considerable research interest in recent years. However, most current data mining techniques for determining functional dependencies deal only with crisp databases. Although various forms of fuzzy functional dependencies (FFDs) have been proposed for fuzzy databases, they emphasized conceptual viewpoints and only a few mining algorithms are given. In this research, we propose methods to validate and incrementally search for FFDs from similarity-based fuzzy relational databases. For a given pair of attributes, the validation of FFDs is based on fuzzy projection and fuzzy selection operations. In addition, the property that FFDs are monotonic in the sense that r1 ? r2 implies FDa(r1) ? FDa(r2) is shown. An incremental search algorithm for FFDs based on this property is then presented. Experimental results showing the behavior of the search algorithm are discussed.


2014 ◽  
Vol 926-930 ◽  
pp. 2280-2283
Author(s):  
Qiong Ren

With the increasing of input data size, process cost will be very long, for the explosive growth of the Internet data even reached the point of single machine can handle. This article mainly introduces the architecture of the concept of cloud computing and, the mainstream of the analysis of the current data mining algorithms, based on cloud computing to develop the data mining system, providing the operation feasibility of data mining in cloud computing platform, having strong guiding significance.


Data mining can be considered to be an important aspects of information industry. Data mining has found a wide applicability in almost every field which deals with data. Out of the various techniques employed for data mining, Classification is a very commonly used tool for knowledge discovery. Various alternatives methods are available which can be used to create a classification model, out of which the most common and apprehensible one is KNN. In spite of KNN having a number of shortcomings and limitations in it, these can be overcome by with the help of alterations which can be made to the basic KNN algorithm. Due to its wide applicability, kNN has been the focus of extensive research and as a result, many alternatives have been performed with wide range of success in performance improvement. A major hardship being faced by the data mining applications is the large number of dimensions which render most of the data mining algorithms inefficient. The problem can be solved to some extent by using dimensionality reduction methods like PCA. Further improvements in the efficiency of the classification based mining algorithms can be achieved by using optimization methods. Meta-heuristic algorithms inspired by natural phenomenon like particle swarm optimization can be used very effectively for the purpose.


2019 ◽  
Vol 8 (3) ◽  
pp. 4846-4853

In the age of data generation known as Big Data, where data is produced in enormous amount, managing it has become a big challenge and along with this drawing information from the gathered data is equally important and challenging. Inferring relationships and predicting patterns from theses structured and unstructured data is now an area of research for researchers. And the data mining techniques have evolved as a tool for generating results and deducing conclusions. These mining algorithms find their applicability in almost every domain likewise understanding market segment, fraud detection, trend analysis, healthcare sector, education sector and many more. Looking at the wide range of applicability, in this paper, a brief overview of data mining algorithms is discussed. This discussion comprises of different data mining algorithms, their mathematical modelling, their evaluation methods, and their limitations. To support the fact a case study is conducted on a cardiovascular disease dataset and the measures of these mining techniques are compared.


2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

© Springer International Publishing AG 2017. k-means is one of the fundamental and most well-known algorithms in data mining. It has been widely used in clustering tasks, but suffers from a number of limitations on large or complex datasets. Genetic Programming (GP) has been used to improve performance of data mining algorithms by performing feature construction—the process of combining multiple attributes (features) of a dataset together to produce more powerful constructed features. In this paper, we propose novel representations for using GP to perform feature construction to improve the clustering performance of the k-means algorithm. Our experiments show significant performance improvement compared to k-means across a variety of difficult datasets. Several GP programs are also analysed to provide insight into how feature construction is able to improve clustering performance.


Author(s):  
Ali Bayır ◽  
Sevinç Gülseçen ◽  
Gökhan Türkmen

Political elections are influenced by a number of factors such as political tendencies, voters' perceptions, and preferences. The results of a political election could also be based on specific attributes of candidates: age, gender, occupancy, education, etc. Although it is very difficult to understand all the factors which could have influenced the outcome of the election, many of the attributes mentioned above could be included in a data set, and by using current data mining techniques, undiscovered patterns can be revealed. Despite unpredictability of human behaviors and/or choices involved, data mining techniques still could help in predicting the election outcomes. In this study, the results of the survey prepared by KONDA Research and Consultancy Company before 2011 elections in Turkey were used as raw data. This study may help in understanding how data mining methods and techniques could be used in political sciences research. The study may also reveal whether voting tendencies in elections could be a factor for the outcome of the election.


2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

© Springer International Publishing AG 2017. k-means is one of the fundamental and most well-known algorithms in data mining. It has been widely used in clustering tasks, but suffers from a number of limitations on large or complex datasets. Genetic Programming (GP) has been used to improve performance of data mining algorithms by performing feature construction—the process of combining multiple attributes (features) of a dataset together to produce more powerful constructed features. In this paper, we propose novel representations for using GP to perform feature construction to improve the clustering performance of the k-means algorithm. Our experiments show significant performance improvement compared to k-means across a variety of difficult datasets. Several GP programs are also analysed to provide insight into how feature construction is able to improve clustering performance.


2019 ◽  
Vol 14 (1) ◽  
pp. 21-26 ◽  
Author(s):  
Viswam Subeesh ◽  
Eswaran Maheswari ◽  
Hemendra Singh ◽  
Thomas Elsa Beulah ◽  
Ann Mary Swaroop

Background: The signal is defined as “reported information on a possible causal relationship between an adverse event and a drug, of which the relationship is unknown or incompletely documented previously”. Objective: To detect novel adverse events of iloperidone by disproportionality analysis in FDA database of Adverse Event Reporting System (FAERS) using Data Mining Algorithms (DMAs). Methodology: The US FAERS database consists of 1028 iloperidone associated Drug Event Combinations (DECs) which were reported from 2010 Q1 to 2016 Q3. We consider DECs for disproportionality analysis only if a minimum of ten reports are present in database for the given adverse event and which were not detected earlier (in clinical trials). Two data mining algorithms, namely, Reporting Odds Ratio (ROR) and Information Component (IC) were applied retrospectively in the aforementioned time period. A value of ROR-1.96SE>1 and IC- 2SD>0 were considered as the threshold for positive signal. Results: The mean age of the patients of iloperidone associated events was found to be 44years [95% CI: 36-51], nevertheless age was not mentioned in twenty-one reports. The data mining algorithms exhibited positive signal for akathisia (ROR-1.96SE=43.15, IC-2SD=2.99), dyskinesia (21.24, 3.06), peripheral oedema (6.67,1.08), priapism (425.7,9.09) and sexual dysfunction (26.6-1.5) upon analysis as those were well above the pre-set threshold. Conclusion: Iloperidone associated five potential signals were generated by data mining in the FDA AERS database. The result requires an integration of further clinical surveillance for the quantification and validation of possible risks for the adverse events reported of iloperidone.


Sign in / Sign up

Export Citation Format

Share Document