Improvising Balancing Methods for Classifying Imbalanced Data

Himani Tiwari

doi:10.22214/ijraset.2021.38225

Improvising Balancing Methods for Classifying Imbalanced Data

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38225 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1535-1543

Author(s):

Himani Tiwari

Keyword(s):

Learning Community ◽

Sampling Methods ◽

Evaluation Criteria ◽

Class Imbalance ◽

Imbalanced Data ◽

Simulated Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Under Sampling

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices

Download Full-text

CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2077.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2463-2468

Keyword(s):

Learning Community ◽

Sampling Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Center Of Gravity ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Download Full-text

Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Mathematical Problems in Engineering ◽

10.1155/2018/5036710 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 8

Author(s):

Jianhong Yan ◽

Suqing Han

Keyword(s):

Learning Community ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalanced Data Sets ◽

Imbalance Data ◽

Imbalance Problem ◽

Stacked Generalization ◽

Model Generalization

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

Download Full-text

An approach to class imbalance problem based on stacking and inverse random under sampling methods

2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC) ◽

10.1109/icnsc.2018.8361344 ◽

2018 ◽

Cited By ~ 4

Author(s):

Yuwei Zhang ◽

Guanjun Liu ◽

Wenjing Luan ◽

Chungang Yan ◽

Changjun Jiang

Keyword(s):

Sampling Methods ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Under Sampling

Download Full-text

A Novel Hybrid-Based Ensemble for Class Imbalance Problem

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213018500252 ◽

2018 ◽

Vol 27 (06) ◽

pp. 1850025 ◽

Cited By ~ 1

Author(s):

Huaping Guo ◽

Jun Zhou ◽

Chang-an Wu ◽

Wei She

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Sampling Technique ◽

Original Training ◽

Training Set ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem ◽

Under Sampling ◽

Feature Projection

Class-imbalance is very common in real world. However, conventional advanced methods do not work well on imbalanced data due to imbalanced class distribution. This paper proposes a simple but effective Hybrid-based Ensemble (HE) to deal with two-class imbalanced problem. HE learns a hybrid ensemble using the following two stages: (1) learning several projection matrixes from the rebalanced data obtained by under-sampling the original training set and constructing new training sets by projecting the original training set to different spaces defined by the matrixes, and (2) undersampling several subsets from each new training set and training a model on each subset. Here, feature projection aims to improve the diversity between ensemble members and under-sampling technique is to improve generalization ability of individual members on minority class. Experimental results show that, compared with other state-of-the-art methods, HE shows significantly better performance on measures of AUC, G-mean, F-measure and recall.

Download Full-text

Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction

Applied Sciences ◽

10.3390/app10228059 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8059

Author(s):

Haonan Tong ◽

Shihai Wang ◽

Guangling Li

Keyword(s):

Class Imbalance ◽

Area Under The Curve ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Synthetic Sample ◽

Promising Alternative ◽

Imbalance Problem ◽

Software Defect ◽

Boosting Method ◽

High Credibility

Imbalanced data are a major factor for degrading the performance of software defect models. Software defect dataset is imbalanced in nature, i.e., the number of non-defect-prone modules is far more than that of defect-prone ones, which results in the bias of classifiers on the majority class samples. In this paper, we propose a novel credibility-based imbalance boosting (CIB) method in order to address the class-imbalance problem in software defect proneness prediction. The method measures the credibility of synthetic samples based on their distribution by introducing a credit factor to every synthetic sample, and proposes a weight updating scheme to make the base classifiers focus on synthetic samples with high credibility and real samples. Experiments are performed on 11 NASA datasets and nine PROMISE datasets by comparing CIB with MAHAKIL, AdaC2, AdaBoost, SMOTE, RUS, No sampling method in terms of four performance measures, i.e., area under the curve (AUC), F1, AGF, and Matthews correlation coefficient (MCC). Wilcoxon sign-ranked test and Cliff’s δ are separately used to perform statistical test and calculate effect size. The experimental results show that CIB is a more promising alternative for addressing the class-imbalance problem in software defect-prone prediction as compared with previous methods.

Download Full-text

An Investigation of Imbalanced Ensemble Learning Methods for Cross-Project Defect Prediction

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419590377 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1959037 ◽

Cited By ~ 5

Author(s):

Shaojian Qiu ◽

Lu Lu ◽

Siyu Jiang ◽

Yang Guo

Keyword(s):

Ensemble Learning ◽

Class Imbalance ◽

Training Data ◽

Defect Prediction ◽

Class Imbalance Problem ◽

Learning Methods ◽

Imbalance Problem ◽

Intelligent Software ◽

Under Sampling ◽

Cross Project

Machine-learning-based software defect prediction (SDP) methods are receiving great attention from the researchers of intelligent software engineering. Most existing SDP methods are performed under a within-project setting. However, there usually is little to no within-project training data to learn an available supervised prediction model for a new SDP task. Therefore, cross-project defect prediction (CPDP), which uses labeled data of source projects to learn a defect predictor for a target project, was proposed as a practical SDP solution. In real CPDP tasks, the class imbalance problem is ubiquitous and has a great impact on performance of the CPDP models. Unlike previous studies that focus on subsampling and individual methods, this study investigated 15 imbalanced learning methods for CPDP tasks, especially for assessing the effectiveness of imbalanced ensemble learning (IEL) methods. We evaluated the 15 methods by extensive experiments on 31 open-source projects derived from five datasets. Through analyzing a total of 37504 results, we found that in most cases, the IEL method that combined under-sampling and bagging approaches will be more effective than the other investigated methods.

Download Full-text

A learning method for the class imbalance problem with medical data sets

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2010.03.005 ◽

2010 ◽

Vol 40 (5) ◽

pp. 509-518 ◽

Cited By ~ 89

Author(s):

Der-Chiang Li ◽

Chiao-Wen Liu ◽

Susan C. Hu

Keyword(s):

Class Imbalance ◽

Medical Data ◽

Data Sets ◽

Learning Method ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A Cost-Sensitive Sparse Representation Based Classification for Class-Imbalance Problem

Scientific Programming ◽

10.1155/2016/8035089 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9

Author(s):

Zhenbing Liu ◽

Chunyang Gao ◽

Huihua Yang ◽

Qijia He

Keyword(s):

Sparse Representation ◽

Classification Accuracy ◽

Class Imbalance ◽

Misclassification Rate ◽

Data Sets ◽

Class Imbalance Problem ◽

Misclassification Cost ◽

Practical Applications ◽

Imbalance Problem ◽

Positive Class

Sparse representation has been successfully used in pattern recognition and machine learning. However, most existing sparse representation based classification (SRC) methods are to achieve the highest classification accuracy, assuming the same losses for different misclassifications. This assumption, however, may not hold in many practical applications as different types of misclassification could lead to different losses. In real-world application, much data sets are imbalanced of the class distribution. To address these problems, we propose a cost-sensitive sparse representation based classification (CSSRC) for class-imbalance problem method by using probabilistic modeling. Unlike traditional SRC methods, we predict the class label of test samples by minimizing the misclassification losses, which are obtained via computing the posterior probabilities. Experimental results on the UCI databases validate the efficacy of the proposed approach on average misclassification cost, positive class misclassification rate, and negative class misclassification rate. In addition, we sampled test samples and training samples with different imbalance ratio and use F-measure, G-mean, classification accuracy, and running time to evaluate the performance of the proposed method. The experiments show that our proposed method performs competitively compared to SRC, CSSVM, and CS4VM.

Download Full-text

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Applied Computer Systems ◽

10.2478/acss-2019-0013 ◽

2019 ◽

Vol 24 (2) ◽

pp. 104-110

Author(s):

Duygu Sinanc Terzi ◽

Seref Sagiroglu

Keyword(s):

Big Data ◽

Class Imbalance ◽

Area Under The Curve ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

The Common ◽

Public Datasets ◽

Distributed Cluster

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

Pulsar candidate classification using generative adversary networks

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2975 ◽

2019 ◽

Vol 490 (4) ◽

pp. 5424-5439 ◽

Cited By ~ 3

Author(s):

Ping Guo ◽

Fuqing Duan ◽

Pei Wang ◽

Yao Yao ◽

Qian Yin ◽

...

Keyword(s):

Class Imbalance ◽

Feature Learning ◽

Computer Experiments ◽

Support Vector ◽

Data Sets ◽

Class Imbalance Problem ◽

Generative Adversarial Network ◽

Feature Representations ◽

Adversarial Network ◽

Imbalance Problem

ABSTRACT Discovering pulsars is a significant and meaningful research topic in the field of radio astronomy. With the advent of astronomical instruments, the volume and rate of data acquisition have grown exponentially. This development necessitates a focus on artificial intelligence (AI) technologies that can mine large astronomical data sets. Automatic pulsar candidate identification (APCI) can be considered as a task determining potential candidates for further investigation and eliminating the noise of radio-frequency interference and other non-pulsar signals. As reported in the existing literature, AI techniques, especially convolutional neural network (CNN)-based techniques, have been adopted for APCI. However, it is challenging to enhance the performance of CNN-based pulsar identification because only an extremely limited number of real pulsar samples exist, which results in a crucial class imbalance problem. To address these problems, we propose a framework that combines a deep convolution generative adversarial network (DCGAN) with a support vector machine (SVM). The DCGAN is used as a sample generation and feature learning model, and the SVM is adopted as the classifier for predicting the label of a candidate at the inference stage. The proposed framework is a novel technique, which not only can solve the class imbalance problem but also can learn the discriminative feature representations of pulsar candidates instead of computing hand-crafted features in the pre-processing steps. The proposed method can enhance the accuracy of the APCI, and the computer experiments performed on two pulsar data sets verified the effectiveness and efficiency of the proposed method.

Download Full-text