Estimation of Distribution Algorithms for Feature Subset Selection in Large Dimensionality Domains

Data Mining ◽  
2011 ◽  
pp. 97-116 ◽  
Author(s):  
Inaki Inza ◽  
Pedro Larranaga ◽  
Basilio Sierra

Feature Subset Selection (FSS) is a well-known task of Machine Learning, Data Mining, Pattern Recognition or Text Learning paradigms. Genetic Algorithms (GAs) are possibly the most commonly used algorithms for Feature Subset Selection tasks. Although the FSS literature contains many papers, few of them tackle the task of FSS in domains with more than 50 features. In this chapter we present a novel search heuristic paradigm, called Estimation of Distribution Algorithms (EDAs), as an alternative to GAs, to perform a population-based and randomized search in datasets of a large dimensionality. The EDA paradigm avoids the use of genetic crossover and mutation operators to evolve the populations. In absence of these operators, the evolution is guaranteed by the factorization of the probability distribution of the best solutions found in a generation of the search and the subsequent simulation of this distribution to obtain a new pool of solutions. In this chapter we present four different probabilistic models to perform this factorization. In a comparison with two types of GAs in natural and artificial datasets of a large dimensionality, EDAbased approaches obtain encouraging results with regard to accuracy, and a fewer number of evaluations were needed than used in genetic approaches.

2015 ◽  
Vol 157 ◽  
pp. 46-60 ◽  
Author(s):  
Iñigo Mendialdua ◽  
Andoni Arruti ◽  
Ekaitz Jauregi ◽  
Elena Lazkano ◽  
Basilio Sierra

Author(s):  
ROSA BLANCO ◽  
PEDRO LARRAÑAGA ◽  
IÑAKI INZA ◽  
BASILIO SIERRA

Despite the fact that cancer classification has considerably improved, nowadays a general method that classifies known types of cancer has not yet been developed. In this work, we propose the use of supervised classification techniques, coupled with feature subset selection algorithms, to automatically perform this classification in gene expression datasets. Due to the large number of features of gene expression datasets, the search of a highly accurate combination of features is done by means of the new Estimation of Distribution Algorithms paradigm. In order to assess the accuracy level of the proposed approach, the naïve-Bayes classification algorithm is employed in a wrapper form. Promising results are achieved, in addition to a considerable reduction in the number of genes. Stating the optimal selection of genes as a search task, an automatic and robust choice in the genes finally selected is performed, in contrast to previous works that research the same types of problems.


SPE Journal ◽  
2013 ◽  
Vol 18 (03) ◽  
pp. 508-517 ◽  
Author(s):  
Asaad Abdollahzadeh ◽  
Alan Reynolds ◽  
Mike Christie ◽  
David Corne ◽  
Glyn Williams ◽  
...  

Summary The topic of automatically history-matched reservoir models has seen much research activity in recent years. History matching is an example of an inverse problem, and there is significant active research on inverse problems in many other scientific and engineering areas. While many techniques from other fields, such as genetic algorithms, evolutionary strategies, differential evolution, particle swarm optimization, and the ensemble Kalman filter have been tried in the oil industry, more recent and effective ideas have yet to be tested. One of these relatively untested ideas is a class of algorithms known as estimation of distribution algorithms (EDAs). EDAs are population-based algorithms that use probability models to estimate the probability distribution of promising solutions, and then to generate new candidate solutions. EDAs have been shown to be very efficient in very complex high-dimensional problems. An example of a state-of-the-art EDA is the Bayesian optimization algorithm (BOA), which is a multivariate EDA employing Bayesian networks for modeling the relationships between good solutions. The use of a Bayesian network leads to relatively fast convergence as well as high diversity in the matched models. Given the relatively limited number of reservoir simulations used in history matching, EDA-BOA offers the promise of high-quality history matches with a fast convergence rate. In this paper, we introduce EDAs and describe BOA in detail. We show results of the EDA-BOA algorithm on two history-matching problems. First, we tune the algorithm, demonstrate convergence speed, and search diversity on the PUNQ-S3 synthetic case. Second, we apply the algorithm to a real North Sea turbidite field with multiple wells. In both examples, we show improvements in performance over traditional population-based algorithms.


2010 ◽  
Vol 19 (01) ◽  
pp. 1-18 ◽  
Author(s):  
ELIAS P. DUARTE ◽  
AURORA T. R. POZO ◽  
BOGDAN T. NASSU

As faults are unavoidable in large scale multiprocessor systems, it is important to be able to determine which units of the system are working and which are faulty. System-level diagnosis is a long-standing realistic approach to detect faults in multiprocessor systems. Diagnosis is based on the results of tests executed on the system units. In this work we evaluate the performance of evolutionary algorithms applied to the diagnosis problem. Experimental results are presented for both the traditional genetic algorithm (GA) and specialized versions of the GA. We then propose and evaluate specialized versions of Estimation of Distribution Algorithms (EDA) for system-level diagnosis: the compact GA and Population-Based Incremental Learning both with and without negative examples. The evaluation was performed using four metrics: the average number of generations needed to find the solution, the average fitness after up to 500 generations, the percentage of tests that got to the optimal solution and the average time until the solution was found. An analysis of experimental results shows that more sophisticated algorithms converge faster to the optimal solution.


2010 ◽  
Vol 18 (4) ◽  
pp. 547-579 ◽  
Author(s):  
Chung-Yao Chuang ◽  
Ying-ping Chen

The probabilistic model building performed by estimation of distribution algorithms (EDAs) enables these methods to use advanced techniques of statistics and machine learning for automatic discovery of problem structures. However, in some situations, it may not be possible to completely and accurately identify the whole problem structure by probabilistic modeling due to certain inherent properties of the given problem. In this work, we illustrate one possible cause of such situations with problems consisting of structures with unequal fitness contributions. Based on the illustrative example, we introduce a notion that the estimated probabilistic models should be inspected to reveal the effective search directions and further propose a general approach which utilizes a reserved set of solutions to examine the built model for likely inaccurate fragments. Furthermore, the proposed approach is implemented on the extended compact genetic algorithm (ECGA) and experiments are performed on several sets of additively separable problems with different scaling setups. The results indicate that the proposed method can significantly assist ECGA to handle problems comprising structures of disparate fitness contributions and therefore may potentially help EDAs in general to overcome those situations in which the entire problem structure cannot be recognized properly due to the temporal delay of emergence of some promising partial solutions.


Sign in / Sign up

Export Citation Format

Share Document