scholarly journals Genetic Programming for  Classification with  Unbalanced Data

2021 ◽  
Author(s):  
◽  
Urvesh Bhowan

<p>In classification,machine learning algorithms can suffer a performance bias when data sets are unbalanced. Binary data sets are unbalanced when one class is represented by only a small number of training examples (called the minority class), while the other class makes up the rest (majority class). In this scenario, the induced classifiers typically have high accuracy on the majority class but poor accuracy on the minority class. As the minority class typically represents the main class-of-interest in many real-world problems, accurately classifying examples from this class can be at least as important as, and in some cases more important than, accurately classifying examples from the majority class. Genetic Programming (GP) is a promising machine learning technique based on the principles of Darwinian evolution to automatically evolve computer programs to solve problems. While GP has shown much success in evolving reliable and accurate classifiers for typical classification tasks with balanced data, GP, like many other learning algorithms, can evolve biased classifiers when data is unbalanced. This is because traditional training criteria such as the overall success rate in the fitness function in GP, can be influenced by the larger number of examples from the majority class.  This thesis proposes a GP approach to classification with unbalanced data. The goal is to develop new internal cost-adjustment techniques in GP to improve classification performances on both the minority class and the majority class. By focusing on internal cost-adjustment within GP rather than the traditional databalancing techniques, the unbalanced data can be used directly or "as is" in the learning process. This removes any dependence on a sampling algorithm to first artificially re-balance the input data prior to the learning process. This thesis shows that by developing a number of new methods in GP, genetic program classifiers with good classification ability on the minority and the majority classes can be evolved. This thesis evaluates these methods on a range of binary benchmark classification tasks with unbalanced data. This thesis demonstrates that unlike tasks with multiple balanced classes where some dynamic (non-static) classification strategies perform significantly better than the simple static classification strategy, either a static or dynamic strategy shows no significant difference in the performance of evolved GP classifiers on these binary tasks. For this reason, the rest of the thesis uses this static classification strategy.  This thesis proposes several new fitness functions in GP to perform cost adjustment between the minority and the majority classes, allowing the unbalanced data sets to be used directly in the learning process without sampling. Using the Area under the Receiver Operating Characteristics (ROC) curve (also known as the AUC) to measure how well a classifier performs on the minority and majority classes, these new fitness functions find genetic program classifiers with high AUC on the tasks on both classes, and with fast GP training times. These GP methods outperform two popular learning algorithms, namely, Naive Bayes and Support Vector Machines on the tasks, particularly when the level of class imbalance is large, where both algorithms show biased classification performances.  This thesis also proposes a multi-objective GP (MOGP) approach which treats the accuracies of the minority and majority classes separately in the learning process. The MOGP approach evolves a good set of trade-off solutions (a Pareto front) in a single run that perform as well as, and in some cases better than, multiple runs of canonical single-objective GP (SGP). In SGP, individual genetic program solutions capture the performance trade-off between the two objectives (minority and majority class accuracy) using an ROC curve; whereas in MOGP, this requirement is delegated to multiple genetic program solutions along the Pareto front.  This thesis also shows how multiple Pareto front classifiers can be combined into an ensemble where individual members vote on the class label. Two ensemble diversity measures are developed in the fitness functions which treat the diversity on both the minority and the majority classes as equally important; otherwise, these measures risk being biased toward the majority class. The evolved ensembles outperform their individual members on the tasks due to good cooperation between members.  This thesis further improves the ensemble performances by developing a GP approach to ensemble selection, to quickly find small groups of individuals that cooperate very well together in the ensemble. The pruned ensembles use much fewer individuals to achieve performances that are as good as larger (unpruned) ensembles, particularly on tasks with high levels of class imbalance, thereby reducing the total time to evaluate the ensemble.</p>

2021 ◽  
Author(s):  
◽  
Urvesh Bhowan

<p>In classification,machine learning algorithms can suffer a performance bias when data sets are unbalanced. Binary data sets are unbalanced when one class is represented by only a small number of training examples (called the minority class), while the other class makes up the rest (majority class). In this scenario, the induced classifiers typically have high accuracy on the majority class but poor accuracy on the minority class. As the minority class typically represents the main class-of-interest in many real-world problems, accurately classifying examples from this class can be at least as important as, and in some cases more important than, accurately classifying examples from the majority class. Genetic Programming (GP) is a promising machine learning technique based on the principles of Darwinian evolution to automatically evolve computer programs to solve problems. While GP has shown much success in evolving reliable and accurate classifiers for typical classification tasks with balanced data, GP, like many other learning algorithms, can evolve biased classifiers when data is unbalanced. This is because traditional training criteria such as the overall success rate in the fitness function in GP, can be influenced by the larger number of examples from the majority class.  This thesis proposes a GP approach to classification with unbalanced data. The goal is to develop new internal cost-adjustment techniques in GP to improve classification performances on both the minority class and the majority class. By focusing on internal cost-adjustment within GP rather than the traditional databalancing techniques, the unbalanced data can be used directly or "as is" in the learning process. This removes any dependence on a sampling algorithm to first artificially re-balance the input data prior to the learning process. This thesis shows that by developing a number of new methods in GP, genetic program classifiers with good classification ability on the minority and the majority classes can be evolved. This thesis evaluates these methods on a range of binary benchmark classification tasks with unbalanced data. This thesis demonstrates that unlike tasks with multiple balanced classes where some dynamic (non-static) classification strategies perform significantly better than the simple static classification strategy, either a static or dynamic strategy shows no significant difference in the performance of evolved GP classifiers on these binary tasks. For this reason, the rest of the thesis uses this static classification strategy.  This thesis proposes several new fitness functions in GP to perform cost adjustment between the minority and the majority classes, allowing the unbalanced data sets to be used directly in the learning process without sampling. Using the Area under the Receiver Operating Characteristics (ROC) curve (also known as the AUC) to measure how well a classifier performs on the minority and majority classes, these new fitness functions find genetic program classifiers with high AUC on the tasks on both classes, and with fast GP training times. These GP methods outperform two popular learning algorithms, namely, Naive Bayes and Support Vector Machines on the tasks, particularly when the level of class imbalance is large, where both algorithms show biased classification performances.  This thesis also proposes a multi-objective GP (MOGP) approach which treats the accuracies of the minority and majority classes separately in the learning process. The MOGP approach evolves a good set of trade-off solutions (a Pareto front) in a single run that perform as well as, and in some cases better than, multiple runs of canonical single-objective GP (SGP). In SGP, individual genetic program solutions capture the performance trade-off between the two objectives (minority and majority class accuracy) using an ROC curve; whereas in MOGP, this requirement is delegated to multiple genetic program solutions along the Pareto front.  This thesis also shows how multiple Pareto front classifiers can be combined into an ensemble where individual members vote on the class label. Two ensemble diversity measures are developed in the fitness functions which treat the diversity on both the minority and the majority classes as equally important; otherwise, these measures risk being biased toward the majority class. The evolved ensembles outperform their individual members on the tasks due to good cooperation between members.  This thesis further improves the ensemble performances by developing a GP approach to ensemble selection, to quickly find small groups of individuals that cooperate very well together in the ensemble. The pruned ensembles use much fewer individuals to achieve performances that are as good as larger (unpruned) ensembles, particularly on tasks with high levels of class imbalance, thereby reducing the total time to evaluate the ensemble.</p>


2013 ◽  
Vol 22 (02) ◽  
pp. 1350008 ◽  
Author(s):  
ATLÁNTIDA I. SÁNCHEZ ◽  
EDUARDO F. MORALES ◽  
JESUS A. GONZALEZ

Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.


Author(s):  
Tatyana Biloborodova ◽  
Inna Skarga-Bandurova ◽  
Mark Koverga

The methodology of solving the problem of eliminating class imbalance in image data sets is presented. The proposed methodology includes the stages of image fragment extraction, fragment augmentation, feature extraction, duplication of minority objects, and is based on reinforcement learning technology. The degree of imbalance indicator was used as a measure to determine the imbalance of the data set. An experiment was performed using a set of images of the faces of patients with skin rashes, annotated according to the severity of acne. The main steps of the methodology implementation are considered. The results of the classification showed the feasibility of applying the proposed methodology. The accuracy of classification on test data was 85%, which is 5% higher than the result obtained without the use of the proposed methodology. Key words: class imbalance, unbalanced data set, image fragment extraction, augmentation.


Author(s):  
Alessio Bernardo ◽  
Emanuele Della Valle

AbstractThe world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.


2021 ◽  
Author(s):  
◽  
Wenbin Pei

<p><b>Class imbalance and high dimensionality have been acknowledged as two tough issues in classification. Learning from unbalanced data, the constructed classifiers are often biased towards the majority class, and thereby perform poorly on the minority class. Unfortunately, the minority class is often the class of interest in many real-world applications, such as medical diagnosis and fault detection. High dimensionality often makes it more difficult to handle the class imbalance issue. To date, most existing works attempt to address one single issue, without consideration of solving the other. These works could not be effectively applied to some challenging classification tasks that suffer from both of the two issues.</b></p> <p>Genetic programming (GP) is one of the most popular techniques from evolutionary computation, which has been widely applied to classification tasks. The built-in feature selection ability of GP makes it very powerful for use in classification with high-dimensional data. However, if the class imbalance issue is not well addressed, the constructed GP classifiers are often biased towards the majority class. Accordingly, this thesis aims to address the joint effects of class imbalance and high dimensionality by developing new GP based classification approaches, with the goal of improving classification performance.</p> <p>To effectively and efficiently address the performance bias issue of GP, this thesis develops a fitness function that considers two criteria, namely the approximation of area under the curve (AUC) and classification clarity (i.e. how well a program can separate the two classes). To further improve the efficiency, a new program reuse mechanism is designed to reuse previous effective GP individuals. According to experimental results, GP with the new fitness function and the program reuse mechanism achieves good performance and significantly saves training time. However, this method treats the two criteria equally, which is not always reasonable.</p> <p>To avoid manually weighing the two criteria in the fitness evaluation process, we propose a novel two-criterion fitness evaluation method, where the obtained values on the two criteria are combined in pairs, instead of summing them together. Then, a three-criterion tournament selection is designed to effectively identify and select good programs to be used by genetic operators for generating better offspring during the evolutionary learning process. Experimental results show that the proposed GP method achieves better classification performance than compared methods.</p> <p>Cost-sensitive learning is a popular approach to addressing the problem of class imbalance for many classification algorithms in machine learning. However, cost-sensitive algorithms are dependent on cost matrices that are usually designed manually. Unfortunately, it is often not easy for humans, even experts, to accurately specify misclassification costs for different mistakes due to the lack or incompleteness of domain knowledge related to actual situations in many complex tasks. As a result, these cost-sensitive algorithms cannot be directly applied. This thesis develops new GP based approaches to developing cost-sensitive classifiers without requiring cost matrices from humans. The newly developed cost-sensitive GP methods are able to construct classifiers and learn cost values or intervals automatically and simultaneously. The experimental results show that the new cost-sensitive GP methods outperform compared methods for high-dimensional unbalanced classification in almost all comparisons.</p> <p>Cost-sensitive GP classifiers treat the minority class as being more important than the majority class, but this may cause an accuracy decrease in the overlapping areas where the prior probabilities of the two classes are about the same. In the thesis, we propose a neighborhood method to detect overlapping areas, and then use GP to develop cost-sensitive classifiers that employ different classification strategies to classify instances from the overlapping areas or the non-overlapping areas.</p>


2017 ◽  
Vol 42 (2) ◽  
pp. 149-176 ◽  
Author(s):  
Szymon Wojciechowski ◽  
Szymon Wilk

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.


Author(s):  
Liangxiao Jiang ◽  
Chen Qiu ◽  
Chaoqun Li

In many real-world applications, it is often the case that the class distribution of instances is imbalanced and the costs of misclassification are different. Thus, the class-imbalanced cost-sensitive learning has attracted much attention from researchers. Sampling is one of the widely used techniques in dealing with the class-imbalance problem, which alters the class distribution of instances so that the minority class is well represented in the training data. In this paper, we propose a novel Minority Cloning Technique (MCT) for class-imbalanced cost-sensitive learning. MCT alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class. The experimental results on a large number of UCI datasets show that MCT performs much better than Minority Oversampling with Replacement Technique (MORT) and Synthetic Minority Oversampling TEchnique (SMOTE) in terms of the total misclassification costs of the built classifiers.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Le Wang ◽  
Meng Han ◽  
Xiaojuan Li ◽  
Ni Zhang ◽  
Haodong Cheng

Author(s):  
Sayan Surya Shaw ◽  
Shameem Ahmed ◽  
Samir Malakar ◽  
Laura Garcia-Hernandez ◽  
Ajith Abraham ◽  
...  

AbstractMany real-life datasets are imbalanced in nature, which implies that the number of samples present in one class (minority class) is exceptionally less compared to the number of samples found in the other class (majority class). Hence, if we directly fit these datasets to a standard classifier for training, then it often overlooks the minority class samples while estimating class separating hyperplane(s) and as a result of that it missclassifies the minority class samples. To solve this problem, over the years, many researchers have followed different approaches. However the selection of the true representative samples from the majority class is still considered as an open research problem. A better solution for this problem would be helpful in many applications like fraud detection, disease prediction and text classification. Also, the recent studies show that it needs not only analyzing disproportion between classes, but also other difficulties rooted in the nature of different data and thereby it needs more flexible, self-adaptable, computationally efficient and real-time method for selection of majority class samples without loosing much of important data from it. Keeping this fact in mind, we have proposed a hybrid model constituting Particle Swarm Optimization (PSO), a popular swarm intelligence-based meta-heuristic algorithm, and Ring Theory (RT)-based Evolutionary Algorithm (RTEA), a recently proposed physics-based meta-heuristic algorithm. We have named the algorithm as RT-based PSO or in short RTPSO. RTPSO can select the most representative samples from the majority class as it takes advantage of the efficient exploration and the exploitation phases of its parent algorithms for strengthening the search process. We have used AdaBoost classifier to observe the final classification results of our model. The effectiveness of our proposed method has been evaluated on 15 standard real-life datasets having low to extreme imbalance ratio. The performance of the RTPSO has been compared with PSO, RTEA and other standard undersampling methods. The obtained results demonstrate the superiority of RTPSO over state-of-the-art class imbalance problem-solvers considered here for comparison. The source code of this work is available in https://github.com/Sayansurya/RTPSO_Class_imbalance.


Sign in / Sign up

Export Citation Format

Share Document