High-dimensional Unbalanced Binary Classification by Genetic Programming with Multi-criterion Fitness Evaluation and Selection

2021 ◽  
pp. 1-26
Author(s):  
Wenbin Pei ◽  
Bing Xue ◽  
Lin Shang ◽  
Mengjie Zhang

Abstract High-dimensional unbalanced classification is challenging because of the joint effects of high dimensionality and class imbalance. Genetic programming (GP) has the potential benefits for use in high-dimensional classification due to its built-in capability to select informative features. However, once data is not evenly distributed, GP tends to develop biased classifiers which achieve a high accuracy on the majority class but a low accuracy on the minority class. Unfortunately, the minority class is often at least as important as the majority class. It is of importance to investigate how GP can be effectively utilized for high-dimensional unbalanced classification. In this paper, to address the performance bias issue of GP, a new two-criterion fitness function is developed, which considers two criteria, i.e. the approximation of area under the curve (AUC) and the classification clarity (i.e. how well a program can separate two classes). The obtained values on the two criteria are combined in pairs, instead of summing them together. Furthermore, this paper designs a three-criterion tournament selection to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. The experimental results show that the proposed method achieves better classification performance than other compared methods.

2021 ◽  
Author(s):  
◽  
Wenbin Pei

<p><b>Class imbalance and high dimensionality have been acknowledged as two tough issues in classification. Learning from unbalanced data, the constructed classifiers are often biased towards the majority class, and thereby perform poorly on the minority class. Unfortunately, the minority class is often the class of interest in many real-world applications, such as medical diagnosis and fault detection. High dimensionality often makes it more difficult to handle the class imbalance issue. To date, most existing works attempt to address one single issue, without consideration of solving the other. These works could not be effectively applied to some challenging classification tasks that suffer from both of the two issues.</b></p> <p>Genetic programming (GP) is one of the most popular techniques from evolutionary computation, which has been widely applied to classification tasks. The built-in feature selection ability of GP makes it very powerful for use in classification with high-dimensional data. However, if the class imbalance issue is not well addressed, the constructed GP classifiers are often biased towards the majority class. Accordingly, this thesis aims to address the joint effects of class imbalance and high dimensionality by developing new GP based classification approaches, with the goal of improving classification performance.</p> <p>To effectively and efficiently address the performance bias issue of GP, this thesis develops a fitness function that considers two criteria, namely the approximation of area under the curve (AUC) and classification clarity (i.e. how well a program can separate the two classes). To further improve the efficiency, a new program reuse mechanism is designed to reuse previous effective GP individuals. According to experimental results, GP with the new fitness function and the program reuse mechanism achieves good performance and significantly saves training time. However, this method treats the two criteria equally, which is not always reasonable.</p> <p>To avoid manually weighing the two criteria in the fitness evaluation process, we propose a novel two-criterion fitness evaluation method, where the obtained values on the two criteria are combined in pairs, instead of summing them together. Then, a three-criterion tournament selection is designed to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. Experimental results show that the proposed GP method achieves better classification performance than compared methods.</p> <p>Cost-sensitive learning is a popular approach to addressing the problem of class imbalance for many classification algorithms in machine learning. However, cost-sensitive algorithms are dependent on cost matrices that are usually designed manually. Unfortunately, it is often not easy for humans, even experts, to accurately specify misclassification costs for different mistakes due to the lack or incompleteness of domain knowledge related to actual situations in many complex tasks. As a result, these cost-sensitive algorithms cannot be directly applied. This thesis develops new GP based approaches to developing cost-sensitive classifiers without requiring cost matrices from humans. The newly developed cost-sensitive GP methods are able to construct classifiers and learn cost values or intervals automatically and simultaneously. The experimental results show that the new cost-sensitive GP methods outperform compared methods for high-dimensional unbalanced classification in almost all comparisons.</p> <p>Cost-sensitive GP classifiers treat the minority class as being more important than the majority class, but this may cause an accuracy decrease in the overlapping areas where the prior probabilities of the two classes are about the same. In the thesis, we propose a neighborhood method to detect overlapping areas, and then use GP to develop cost-sensitive classifiers that employ different classification strategies to classify instances from the overlapping areas or the non-overlapping areas.</p>


2021 ◽  
Author(s):  
◽  
Alan David Kinzett

<p>In tree-based genetic programming (GP) there is a tendency for the program trees to increase in size from one generation to the next. If this increase in program size is not accompanied by an improvement in fitness then this unproductive increase is known as bloat. It is standard practice to place some form of control on program size. This can be done by limiting the number of nodes or the depth of the program trees, or by adding a component to the fitness function that rewards smaller programs (parsimony pressure) or by simplifying individual programs using algebraic methods. This thesis proposes a novel program simplification method called numerical simplification that uses only the range of values the nodes take during fitness evaluation. The effect of online program simplification, both algebraic and numerical, on program size and resource usage is examined. This thesis also examines the distribution of program fragments within a genetic programming population and how this is changed by using simplification. It is shown that both simplification approaches result in reductions in average program size, memory used and computation time and that numerical simplification performs at least as well as algebraic simplification, and in some cases will outperform algebraic simplification. This reduction in program size and the resources required to process the GP run come without any significant reduction in accuracy. It is also shown that although the two online simplification methods destroy some existing program fragments, they generate new fragments during evolution, which compensates for any negative effects from the disruption of existing fragments. It is also shown that, after the first few generations, the rate new fragments are created, the rate fragments are lost from the population, and the number of distinct (different) fragments in the population remain within a very narrow range of values for the remainder of the run.</p>


2020 ◽  
Author(s):  
Q Ul Ain ◽  
Bing Xue ◽  
Harith Al-Sahaf ◽  
Mengjie Zhang

© 2019 IEEE. The occurrence of malignant melanoma had enormously increased since past decades. For accurate detection and classification, not only discriminative features are required but a properly designed model to combine these features effectively is also needed. In this study, the multi-tree representation of genetic programming (GP) has been utilised to effectively combine different types of features and evolve a classification model for the task of melanoma detection. Local binary patterns have been used to extract pixel-level informative features. For incorporating the properties of ABCD (asymmetrical property, border shape, color variation and geometrical characteristics) rule of dermoscopy, various features have been used to include local and global information of the skin lesions. To meet the requirements of the proposed multi-tree GP representation, genetic operators such as crossover and mutation are designed accordingly. Moreover, a new weighted fitness function is designed to evolve better GP individuals having multiple trees influencing each other's performance during the evolution, in order to get overall performance gains. The performance of the new method is checked on two benchmark skin image datasets, and compared with six widely used classification algorithms and the single tree GP method. The experimental results have shown that the proposed method has significantly outperformed all these classification methods.


Author(s):  
Imbaby I. Mahmoud ◽  
May Salama ◽  
Asmaa Abd El Tawab Abd El Hamid

The aim of this chapter is to investigate the hardware (H/W) implementation of Genetic Algorithm (GA) based motion path planning of robot. The potential benefit of using H/W implementation of genetic algorithm is that it allows the use of huge parallelism which is suited to random number generation, crossover, mutation and fitness evaluation. The operation of selection and reproduction are basically problem independent and involve basic string manipulation tasks. The fitness evaluation task, which is problem dependent, however proves a major difficulty in H/W implementation. Another difficulty comes from that designs can only be used for the individual problem their fitness function represents. Therefore, in this work the genetic operators are implemented in H/W, while the fitness evaluation module is implemented in software (S/W). This allows a mixed hardware/software approach to address both generality and acceleration. Moreover, a simple H/W implementation for fitness evaluation of robot motion path planning problem is discussed.


2012 ◽  
Vol 9 (1) ◽  
Author(s):  
Rok Blagus ◽  
Lara Lusa

The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.


Sign in / Sign up

Export Citation Format

Share Document