Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy

2009 ◽  
Vol 17 (3) ◽  
pp. 275-306 ◽  
Author(s):  
Salvador García ◽  
Francisco Herrera

Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples (undersampling) or replicate minority class examples (oversampling). Undersampling with imbalanced datasets could be considered as a prototype selection procedure with the purpose of balancing datasets to achieve a high classification rate, avoiding the bias toward majority class examples. Evolutionary algorithms have been used for classical prototype selection showing good results, where the fitness function is associated to the classification and reduction rates. In this paper, we propose a set of methods called evolutionary undersampling that take into consideration the nature of the problem and use different fitness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison among our models and state of the art undersampling methods. The results have been contrasted by using nonparametric statistical procedures and show that evolutionary undersampling outperforms the nonevolutionary models when the degree of imbalance is increased.

Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 54
Author(s):  
Chen Fu ◽  
Jianhua Yang

The problem of classification for imbalanced datasets is frequently encountered in practical applications. The data to be classified in this problem are skewed, i.e., the samples of one class (the minority class) are much less than those of other classes (the majority class). When dealing with imbalanced datasets, most classifiers encounter a common limitation, that is, they often obtain better classification performances on the majority classes than those on the minority class. To alleviate the limitation, in this study, a fuzzy rule-based modeling approach using information granules is proposed. Information granules, as some entities derived and abstracted from data, can be used to describe and capture the characteristics (distribution and structure) of data from both majority and minority classes. Since the geometric characteristics of information granules depend on the distance measures used in the granulation process, the main idea of this study is to construct information granules on each class of imbalanced data using Minkowski distance measures and then to establish the classification models by using “If-Then” rules. The experimental results involving synthetic and publicly available datasets reflect that the proposed Minkowski distance-based method can produce information granules with a series of geometric shapes and construct granular models with satisfying classification performance for imbalanced datasets.


This is an attempt to address the various challenges opportunities and scope for formulating and designing new procedure in imbalanced classification problem which poses a challenge to a predictive modelling as many of AI ML n DL algorithms which are extensively used for classification are always designed from the perspective of with majority of focus on assuming equal number of examples for a class. It leads to poor efficiency and performance especially in minority class. As Minority class is always very crucial and sensitive to classification errors and also its utmost important in imbalanced classification. This chapter discusses addresses and gives novel as well as deep insights with unequal distribution of classes in training datasets. Largely real time and real world classifications are comprising imbalanced distribution so need specialized techniques for more challenging and sophisticated models with minimal errors and improved performance.


2017 ◽  
Vol 14 (3) ◽  
pp. 579-595 ◽  
Author(s):  
Lu Cao ◽  
Hong Shen

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.


2020 ◽  
Vol 10 (5) ◽  
pp. 1684
Author(s):  
Huajuan Duan ◽  
Yongqing Wei ◽  
Peiyu Liu ◽  
Hongxia Yin

Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods.


Mathematics ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. 156
Author(s):  
Darío Ramos-López ◽  
Ana D. Maldonado

Multi-class classification in imbalanced datasets is a challenging problem. In these cases, common validation metrics (such as accuracy or recall) are often not suitable. In many of these problems, often real-world problems related to health, some classification errors may be tolerated, whereas others are to be avoided completely. Therefore, a cost-sensitive variable selection procedure for building a Bayesian network classifier is proposed. In it, a flexible validation metric (cost/loss function) encoding the impact of the different classification errors is employed. Thus, the model is learned to optimize the a priori specified cost function. The proposed approach was applied to forecasting an air quality index using current levels of air pollutants and climatic variables from a highly imbalanced dataset. For this problem, the method yielded better results than other standard validation metrics in the less frequent class states. The possibility of fine-tuning the objective validation function can improve the prediction quality in imbalanced data or when asymmetric misclassification costs have to be considered.


2021 ◽  
Vol 25 (3) ◽  
pp. 541-554
Author(s):  
Zicheng Wang ◽  
Yanrui Sun

Oversampling ratio N and the minority class’ nearest neighboring number k are key hyperparameters of synthetic minority oversampling technique (SMOTE) to reconstruct the class distribution of dataset. No optimal default value exists there. Therefore, it is of necessity to discuss the influence of the output dataset on the classification performance when SMOTE adopts various hyperparameter combinations. In this paper, we propose a hyperparameter optimization algorithm for imbalanced data. By iterating to find reasonable N and k for SMOTE, so as to build a balanced and high-quality dataset. As a result, a model with outstanding performance and strong generalization ability is trained, thus effectively solving imbalanced classification. The proposed algorithm is based on the hybridization of simulated annealing mechanism (SA) and particle swarm optimization algorithm (PSO). In the optimization, Cohen’s Kappa is used to construct the fitness function, and AdaRBFNN, a new classifier, is integrated by multiple trained RBF neural networks based on AdaBoost algorithm. Kappa of each generation is calculated according to the classification results, so as to evaluate the quality of candidate solution. Experiments are conducted on seven groups of KEEL datasets. Results show that the proposed algorithm delivers excellent performance and can significantly improve the classification accuracy of the minority class.


2020 ◽  
Vol 24 (6) ◽  
pp. 1239-1255
Author(s):  
Xu Wu ◽  
Youlong Yang ◽  
Lingyu Ren

Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.


Author(s):  
Kersten Schuster ◽  
Philip Trettner ◽  
Leif Kobbelt

We present a numerical optimization method to find highly efficient (sparse) approximations for convolutional image filters. Using a modified parallel tempering approach, we solve a constrained optimization that maximizes approximation quality while strictly staying within a user-prescribed performance budget. The results are multi-pass filters where each pass computes a weighted sum of bilinearly interpolated sparse image samples, exploiting hardware acceleration on the GPU. We systematically decompose the target filter into a series of sparse convolutions, trying to find good trade-offs between approximation quality and performance. Since our sparse filters are linear and translation-invariant, they do not exhibit the aliasing and temporal coherence issues that often appear in filters working on image pyramids. We show several applications, ranging from simple Gaussian or box blurs to the emulation of sophisticated Bokeh effects with user-provided masks. Our filters achieve high performance as well as high quality, often providing significant speed-up at acceptable quality even for separable filters. The optimized filters can be baked into shaders and used as a drop-in replacement for filtering tasks in image processing or rendering pipelines.


2021 ◽  
Vol 25 (5) ◽  
pp. 1169-1185
Author(s):  
Deniu He ◽  
Hong Yu ◽  
Guoyin Wang ◽  
Jie Li

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.


2013 ◽  
Vol 443 ◽  
pp. 741-745
Author(s):  
Hu Li ◽  
Peng Zou ◽  
Wei Hong Han ◽  
Rong Ze Xia

Many real world data is imbalanced, i.e. one category contains significantly more samples than other categories. Traditional classification methods take different categories equally and are often ineffective. Based on the comprehensive analysis of existing researches, we propose a new imbalanced data classification method based on clustering. The method clusters both majority class and minority class at first. Then, clustered minority class will be over-sampled by SMOTE while clustered majority class be under-sampled randomly. Through clustering, the proposed method can avoid the loss of useful information while resampling. Experiments on several UCI datasets show that the proposed method can effectively improve the classification results on imbalanced data.


Sign in / Sign up

Export Citation Format

Share Document