A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data

Author(s):  
A. Guzmán-Ponce ◽  
R. M. Valdovinos ◽  
J. S. Sánchez
Open Physics ◽  
2019 ◽  
Vol 17 (1) ◽  
pp. 975-983
Author(s):  
Jianhua Zhao ◽  
Ning Liu

Abstract In practical application, there are a large amount of imbalanced data containing only a small number of labeled data. In order to improve the classification performance of this kind of problem, this paper proposes a semi-supervised learning algorithm based on mixed sampling for imbalanced data classification (S2MAID), which combines semi-supervised learning, over sampling, under sampling and ensemble learning. Firstly, a kind of under sampling algorithm UD-density is provided to select samples with high information content from majority class set for semi-supervised learning. Secondly, a safe supervised-learning method is used to mark unlabeled sample and expand the labeled sample. Thirdly, a kind of over sampling algorithm SMOTE-density is provided to make the imbalanced data set become balance set. Fourthly, an ensemble technology is used to generate a strong classifier. Finally, the experiment is carried out on imbalanced data with containing only a few labeled samples, and semi-supervised learning process is simulated. The proposed S2MAID is verified and the experimental result shows that the proposed S2MAID has a better classification performance.


Author(s):  
Luo Ruisen ◽  
Dian Songyi ◽  
Wang Chen ◽  
Cheng Peng ◽  
Tang Zuodong ◽  
...  

Author(s):  
Mohammad Amini ◽  
Jalal Rezaeenour ◽  
Esmaeil Hadavandi

The aim of direct marketing is to find the right customers who are most likely to respond to marketing campaign messages. In order to detect which customers are most valuable, response modeling is used to classify customers as respondent or non-respondent using their purchase history information or other behavioral characteristics. Data mining techniques, including effective classification methods, can be used to predict responsive customers. However, the inherent problem of imbalanced data in response modeling brings some difficulties into response prediction. As a result, the prediction models will be biased towards non-respondent customers. Another problem is that single models cannot provide the desired high accuracy due to their internal limitations. In this paper, we propose an ensemble classification method which removes imbalance in the data, using a combination of clustering and under-sampling. The predictions of multiple classifiers are combined in order to achieve better results. Using data from a bank’s marketing campaigns, this ensemble method is implemented on different classification techniques and the results are evaluated. We also evaluate the performance of this ensemble method against two alternative ensembles. The experimental results demonstrate that our proposed method can improve the performance of the response models for bank direct marketing by raising prediction accuracy and increasing response rate.


Author(s):  
Bahram Sadeghi Bigham ◽  
Rozita Jamili Oskouei

2014 ◽  
Vol 513-517 ◽  
pp. 2510-2513 ◽  
Author(s):  
Xu Ying Liu

Nowadays there are large volumes of data in real-world applications, which poses great challenge to class-imbalance learning: the large amount of the majority class examples and severe class-imbalance. Previous studies on class-imbalance learning mainly focused on relatively small or moderate class-imbalance. In this paper we conduct an empirical study to explore the difference between learning with small or moderate class-imbalance and learning with severe class-imbalance. The experimental results show that: (1) Traditional methods cannot handle severe class-imbalance effectively. (2) AUC, G-mean and F-measure can be very inconsistent for severe class-imbalance, which seldom appears when class-imbalance is moderate. And G-mean is not appropriate for severe class-imbalance learning because it is not sensitive to the change of imbalance ratio. (3) When AUC and G-mean are evaluation metrics, EasyEnsemble is the best method, followed by BalanceCascade and under-sampling. (4) A little under-full balance is better for under-sampling to handle severe class-imbalance. And it is important to handle false positives when design methods for severe class-imbalance.


Sign in / Sign up

Export Citation Format

Share Document