scholarly journals CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

2019 ◽  
Vol 8 (2) ◽  
pp. 2463-2468

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Author(s):  
Himani Tiwari

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices


2018 ◽  
Vol 2018 ◽  
pp. 1-13 ◽  
Author(s):  
Jianhong Yan ◽  
Suqing Han

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.


2018 ◽  
Vol 27 (06) ◽  
pp. 1850025 ◽  
Author(s):  
Huaping Guo ◽  
Jun Zhou ◽  
Chang-an Wu ◽  
Wei She

Class-imbalance is very common in real world. However, conventional advanced methods do not work well on imbalanced data due to imbalanced class distribution. This paper proposes a simple but effective Hybrid-based Ensemble (HE) to deal with two-class imbalanced problem. HE learns a hybrid ensemble using the following two stages: (1) learning several projection matrixes from the rebalanced data obtained by under-sampling the original training set and constructing new training sets by projecting the original training set to different spaces defined by the matrixes, and (2) undersampling several subsets from each new training set and training a model on each subset. Here, feature projection aims to improve the diversity between ensemble members and under-sampling technique is to improve generalization ability of individual members on minority class. Experimental results show that, compared with other state-of-the-art methods, HE shows significantly better performance on measures of AUC, G-mean, F-measure and recall.


Author(s):  
Bo Huang ◽  
Yimin Zhu ◽  
Zhongzhen Wang ◽  
Zhijun Fang

The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.


2020 ◽  
Vol 10 (22) ◽  
pp. 8059
Author(s):  
Haonan Tong ◽  
Shihai Wang ◽  
Guangling Li

Imbalanced data are a major factor for degrading the performance of software defect models. Software defect dataset is imbalanced in nature, i.e., the number of non-defect-prone modules is far more than that of defect-prone ones, which results in the bias of classifiers on the majority class samples. In this paper, we propose a novel credibility-based imbalance boosting (CIB) method in order to address the class-imbalance problem in software defect proneness prediction. The method measures the credibility of synthetic samples based on their distribution by introducing a credit factor to every synthetic sample, and proposes a weight updating scheme to make the base classifiers focus on synthetic samples with high credibility and real samples. Experiments are performed on 11 NASA datasets and nine PROMISE datasets by comparing CIB with MAHAKIL, AdaC2, AdaBoost, SMOTE, RUS, No sampling method in terms of four performance measures, i.e., area under the curve (AUC), F1, AGF, and Matthews correlation coefficient (MCC). Wilcoxon sign-ranked test and Cliff’s δ are separately used to perform statistical test and calculate effect size. The experimental results show that CIB is a more promising alternative for addressing the class-imbalance problem in software defect-prone prediction as compared with previous methods.


Author(s):  
Shaojian Qiu ◽  
Lu Lu ◽  
Siyu Jiang ◽  
Yang Guo

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.


2022 ◽  
Vol 16 (3) ◽  
pp. 1-37
Author(s):  
Robert A. Sowah ◽  
Bernard Kuditchar ◽  
Godfrey A. Mills ◽  
Amevi Acakpovi ◽  
Raphael A. Twum ◽  
...  

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.


Sign in / Sign up

Export Citation Format

Share Document