High-dimensional Unbalanced Binary Classification by Genetic Programming with Multi-criterion Fitness Evaluation and Selection

Abstract High-dimensional unbalanced classification is challenging because of the joint effects of high dimensionality and class imbalance. Genetic programming (GP) has the potential benefits for use in high-dimensional classification due to its built-in capability to select informative features. However, once data is not evenly distributed, GP tends to develop biased classifiers which achieve a high accuracy on the majority class but a low accuracy on the minority class. Unfortunately, the minority class is often at least as important as the majority class. It is of importance to investigate how GP can be effectively utilized for high-dimensional unbalanced classification. In this paper, to address the performance bias issue of GP, a new two-criterion fitness function is developed, which considers two criteria, i.e. the approximation of area under the curve (AUC) and the classification clarity (i.e. how well a program can separate two classes). The obtained values on the two criteria are combined in pairs, instead of summing them together. Furthermore, this paper designs a three-criterion tournament selection to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. The experimental results show that the proposed method achieves better classification performance than other compared methods.

Download Full-text

Genetic Programming for Binary Classification with High-dimensional Unbalanced Data

10.26686/wgtn.16862482.v1 ◽

2021 ◽

Author(s):

◽

Wenbin Pei

Keyword(s):

Fitness Function ◽

Class Imbalance ◽

Classification Performance ◽

Experimental Results ◽

High Dimensionality ◽

High Dimensional ◽

Unbalanced Data ◽

Minority Class ◽

Fitness Evaluation ◽

Classification Tasks

Class imbalance and high dimensionality have been acknowledged as two tough issues in classification. Learning from unbalanced data, the constructed classifiers are often biased towards the majority class, and thereby perform poorly on the minority class. Unfortunately, the minority class is often the class of interest in many real-world applications, such as medical diagnosis and fault detection. High dimensionality often makes it more difficult to handle the class imbalance issue. To date, most existing works attempt to address one single issue, without consideration of solving the other. These works could not be effectively applied to some challenging classification tasks that suffer from both of the two issues. Genetic programming (GP) is one of the most popular techniques from evolutionary computation, which has been widely applied to classification tasks. The built-in feature selection ability of GP makes it very powerful for use in classification with high-dimensional data. However, if the class imbalance issue is not well addressed, the constructed GP classifiers are often biased towards the majority class. Accordingly, this thesis aims to address the joint effects of class imbalance and high dimensionality by developing new GP based classification approaches, with the goal of improving classification performance. To effectively and efficiently address the performance bias issue of GP, this thesis develops a fitness function that considers two criteria, namely the approximation of area under the curve (AUC) and classification clarity (i.e. how well a program can separate the two classes). To further improve the efficiency, a new program reuse mechanism is designed to reuse previous effective GP individuals. According to experimental results, GP with the new fitness function and the program reuse mechanism achieves good performance and significantly saves training time. However, this method treats the two criteria equally, which is not always reasonable. To avoid manually weighing the two criteria in the fitness evaluation process, we propose a novel two-criterion fitness evaluation method, where the obtained values on the two criteria are combined in pairs, instead of summing them together. Then, a three-criterion tournament selection is designed to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. Experimental results show that the proposed GP method achieves better classification performance than compared methods. Cost-sensitive learning is a popular approach to addressing the problem of class imbalance for many classification algorithms in machine learning. However, cost-sensitive algorithms are dependent on cost matrices that are usually designed manually. Unfortunately, it is often not easy for humans, even experts, to accurately specify misclassification costs for different mistakes due to the lack or incompleteness of domain knowledge related to actual situations in many complex tasks. As a result, these cost-sensitive algorithms cannot be directly applied. This thesis develops new GP based approaches to developing cost-sensitive classifiers without requiring cost matrices from humans. The newly developed cost-sensitive GP methods are able to construct classifiers and learn cost values or intervals automatically and simultaneously. The experimental results show that the new cost-sensitive GP methods outperform compared methods for high-dimensional unbalanced classification in almost all comparisons. Cost-sensitive GP classifiers treat the minority class as being more important than the majority class, but this may cause an accuracy decrease in the overlapping areas where the prior probabilities of the two classes are about the same. In the thesis, we propose a neighborhood method to detect overlapping areas, and then use GP to develop cost-sensitive classifiers that employ different classification strategies to classify instances from the overlapping areas or the non-overlapping areas.

Download Full-text

Reuse of program trees in genetic programming with a new fitness function in high-dimensional unbalanced classification

Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO '19 ◽

10.1145/3319619.3321958 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wenbin Pei ◽

Bing Xue ◽

Lin Shang ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Fitness Function ◽

High Dimensional ◽

Unbalanced Classification

Download Full-text

A Cost-sensitive Genetic Programming Approach for High-dimensional Unbalanced Classification

2019 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci44817.2019.9003041 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wenbin Pei ◽

Bing Xue ◽

Mengjie Zhang ◽

Lin Shang

Keyword(s):

Genetic Programming ◽

High Dimensional ◽

Programming Approach ◽

Unbalanced Classification

Download Full-text

Numerical Simplification and its Effect on Fragment Distributions in Genetic Programming

10.26686/wgtn.16992592 ◽

2021 ◽

Author(s):

◽

Alan David Kinzett

Keyword(s):

Genetic Programming ◽

Narrow Range ◽

Fitness Function ◽

Computation Time ◽

Resource Usage ◽

Online Program ◽

Negative Effects ◽

Fitness Evaluation ◽

Algebraic Simplification ◽

Range Of Values

In tree-based genetic programming (GP) there is a tendency for the program trees to increase in size from one generation to the next. If this increase in program size is not accompanied by an improvement in fitness then this unproductive increase is known as bloat. It is standard practice to place some form of control on program size. This can be done by limiting the number of nodes or the depth of the program trees, or by adding a component to the fitness function that rewards smaller programs (parsimony pressure) or by simplifying individual programs using algebraic methods. This thesis proposes a novel program simplification method called numerical simplification that uses only the range of values the nodes take during fitness evaluation. The effect of online program simplification, both algebraic and numerical, on program size and resource usage is examined. This thesis also examines the distribution of program fragments within a genetic programming population and how this is changed by using simplification. It is shown that both simplification approaches result in reductions in average program size, memory used and computation time and that numerical simplification performs at least as well as algebraic simplification, and in some cases will outperform algebraic simplification. This reduction in program size and the resources required to process the GP run come without any significant reduction in accuracy. It is also shown that although the two online simplification methods destroy some existing program fragments, they generate new fragments during evolution, which compensates for any negative effects from the disruption of existing fragments. It is also shown that, after the first few generations, the rate new fragments are created, the rate fragments are lost from the population, and the number of distinct (different) fragments in the population remain within a very narrow range of values for the remainder of the run.

Download Full-text

Multi-tree Genetic Programming with A New Fitness Function for Melanoma Detection

10.26686/wgtn.12616751.v1 ◽

2020 ◽

Author(s):

Q Ul Ain ◽

Bing Xue ◽

Harith Al-Sahaf ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Fitness Function ◽

Local Binary Patterns ◽

Skin Lesions ◽

Color Variation ◽

Classification Model ◽

Genetic Operators ◽

Melanoma Detection ◽

Multiple Trees ◽

Performance Gains

© 2019 IEEE. The occurrence of malignant melanoma had enormously increased since past decades. For accurate detection and classification, not only discriminative features are required but a properly designed model to combine these features effectively is also needed. In this study, the multi-tree representation of genetic programming (GP) has been utilised to effectively combine different types of features and evolve a classification model for the task of melanoma detection. Local binary patterns have been used to extract pixel-level informative features. For incorporating the properties of ABCD (asymmetrical property, border shape, color variation and geometrical characteristics) rule of dermoscopy, various features have been used to include local and global information of the skin lesions. To meet the requirements of the proposed multi-tree GP representation, genetic operators such as crossover and mutation are designed accordingly. Moreover, a new weighted fitness function is designed to evolve better GP individuals having multiple trees influencing each other's performance during the evolution, in order to get overall performance gains. The performance of the new method is checked on two benchmark skin image datasets, and compared with six widely used classification algorithms and the single tree GP method. The experimental results have shown that the proposed method has significantly outperformed all these classification methods.

Download Full-text

Hardware Implementation of a Genetic Algorithm for Motion Path Planning

Field-Programmable Gate Array (FPGA) Technologies for High Performance Instrumentation - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-5225-0299-9.ch010 ◽

2016 ◽

pp. 250-275

Author(s):

Imbaby I. Mahmoud ◽

May Salama ◽

Asmaa Abd El Tawab Abd El Hamid

Keyword(s):

Genetic Algorithm ◽

Path Planning ◽

Fitness Function ◽

Random Number Generation ◽

Motion Path ◽

Genetic Operators ◽

Fitness Evaluation ◽

Evaluation Task ◽

Evaluation Module ◽

The Individual

The aim of this chapter is to investigate the hardware (H/W) implementation of Genetic Algorithm (GA) based motion path planning of robot. The potential benefit of using H/W implementation of genetic algorithm is that it allows the use of huge parallelism which is suited to random number generation, crossover, mutation and fitness evaluation. The operation of selection and reproduction are basically problem independent and involve basic string manipulation tasks. The fitness evaluation task, which is problem dependent, however proves a major difficulty in H/W implementation. Another difficulty comes from that designs can only be used for the individual problem their fitness function represents. Therefore, in this work the genetic operators are implemented in H/W, while the fitness evaluation module is implemented in software (S/W). This allows a mixed hardware/software approach to address both generality and acceleration. Moreover, a simple H/W implementation for fitness evaluation of robot motion path planning problem is discussed.

Download Full-text

Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism

Soft Computing ◽

10.1007/s00500-020-05056-7 ◽

2020 ◽

Vol 24 (23) ◽

pp. 18021-18038

Author(s):

Wenbin Pei ◽

Bing Xue ◽

Lin Shang ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Fitness Function ◽

High Dimensional ◽

Imbalanced Classification

Download Full-text

A genetic programming method for classifier construction and cost learning in high-dimensional unbalanced classification

Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion ◽

10.1145/3377929.3389955 ◽

2020 ◽

Author(s):

Wenbin Pei ◽

Bing Xue ◽

Lin Shang ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Programming Method ◽

High Dimensional ◽

Unbalanced Classification

Download Full-text

Impact of class-imbalance on multi-class high-dimensional class prediction

Advances in Methodology and Statistics ◽

10.51936/grxm1445 ◽

2012 ◽

Vol 9 (1) ◽

Author(s):

Rok Blagus ◽

Lara Lusa

Keyword(s):

Sample Size ◽

Binary Classification ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

High Dimensional ◽

Class Imbalance Problem ◽

Class Prediction ◽

Linear Discriminant ◽

Imbalance Problem

The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.

Download Full-text

Genetic programming for development of cost-sensitive classifiers for binary high-dimensional unbalanced classification

Applied Soft Computing ◽

10.1016/j.asoc.2020.106989 ◽

2021 ◽

Vol 101 ◽

pp. 106989

Author(s):

Wenbin Pei ◽

Bing Xue ◽

Lin Shang ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

High Dimensional ◽

Unbalanced Classification

Download Full-text