SMOTEMultiBoost: Leveraging the SMOTE with MultiBoost to Confront the Class Imbalance in Supervised Learning

Class imbalance problem is being manifoldly confronted by researchers due to the increasing amount of complicated data. Common classification algorithms are impoverished to perform effectively on imbalanced datasets. Larger class cases typically outbalance smaller class cases in class imbalance learning. Common classification algorithms raise larger class performance owing to class imbalance in data and overall improvement in accuracy as their goal while lowering performance on smaller class. Furthermore, these algorithms deal false positive and false negative in an even way and regard equal cost of misclassifying cases. Meanwhile, different ensemble solutions have been proposed over the years for class imbalance learning but these approaches hamper the performance of larger class as emphasizing on the small class cases. The intuition of this overall degraded outcome would be the low diversity in ensemble solutions and overfitting or underfitting in data resampling techniques. To overcome these problems, we suggest a hybrid ensemble method by leveraging MultiBoost ensemble and Synthetic Minority Over-sampling TEchnique (SMOTE). Our suggested solution leverage the effectiveness of its elements. Therefore, it improves the outcome of the smaller class by reinforcing its space and limiting error in prediction. The proposed method shows improved performance as compare to numerous other algorithms and techniques in experiments.

Download Full-text

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488280 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-37

Author(s):

Robert A. Sowah ◽

Bernard Kuditchar ◽

Godfrey A. Mills ◽

Amevi Acakpovi ◽

Raphael A. Twum ◽

...

Keyword(s):

Geometric Mean ◽

Class Imbalance ◽

Sampling Technique ◽

Data Repository ◽

Support Vector ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

High Degree ◽

Hybrid Sampling

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.

Download Full-text

Comparing the Behavior of Oversampling and Undersampling Approach of Class Imbalance Learning by Combining Class Imbalance Problem with Noise

Advances in Intelligent Systems and Computing - ICT Based Innovations ◽

10.1007/978-981-10-6602-3_3 ◽

2017 ◽

pp. 23-30 ◽

Cited By ~ 11

Author(s):

Prabhjot Kaur ◽

Anjana Gosain

Keyword(s):

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalance Learning ◽

Class Imbalance Learning

Download Full-text

Class Imbalance Learning to Heterogeneous Cross Software Projects Defect Prediction

International Journal of Software Innovation ◽

10.4018/ijsi.292021 ◽

2022 ◽

Vol 10 (1) ◽

pp. 0-0

Keyword(s):

Research Work ◽

Class Imbalance ◽

Training Dataset ◽

Software Projects ◽

Class Imbalance Problem ◽

Software Application ◽

Imbalance Problem ◽

Under Sampling ◽

Imbalance Learning ◽

Class Imbalance Learning

Heterogeneous CPDP (HCPDP) attempts to forecast defects in a software application having insufficient previous defect data. Nonetheless, with a Class Imbalance Problem (CIP) perspective, one should have a clear view of data distribution in the training dataset otherwise the trained model would lead to biased classification results. Class Imbalance Learning (CIL) is the method of achieving an equilibrium ratio between two classes in imbalanced datasets. There are a range of effective solutions to manage CIP such as resampling techniques like Over-Sampling (OS) & Under-Sampling (US) methods. The proposed research work employs Synthetic Minority Oversampling TEchnique (SMOTE) and Random Under Sampling (RUS) technique to handle CIP. In addition to this, the paper proposes a novel four-phase HCPDP model and contrasts the efficiency of basic HCPDP model with CIP and after handling CIP using SMOTE & RUS with three prediction pairs. Results show that training performance with SMOTE is substantially improved but RUS displays variations in relation to HCPDP for all three prediction pairs.

Download Full-text

A New Diversity Technique for Imbalance Learning Ensembles

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.11251 ◽

2018 ◽

Vol 7 (2.14) ◽

pp. 478 ◽

Cited By ~ 2

Author(s):

Hartono . ◽

Opim Salim Sitompul ◽

Erna Budhiarti Nababan ◽

Tulus . ◽

Dahlan Abdullah ◽

...

Keyword(s):

Hybrid Approach ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Classifier Ensembles ◽

Classification Problems ◽

Class Imbalance Problem ◽

Weighting Method ◽

Imbalance Problem ◽

Learning Ensembles ◽

Imbalance Learning

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.

Download Full-text

CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2077.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2463-2468

Keyword(s):

Learning Community ◽

Sampling Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Center Of Gravity ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Download Full-text

Manifold Distance-Based Over-Sampling Technique for Class Imbalance Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.330110071 ◽

2019 ◽

Vol 33 ◽

pp. 10071-10072

Author(s):

Lingkai Yang ◽

Yinan Guo ◽

Jian Cheng

Keyword(s):

Classification Accuracy ◽

Euclidean Distance ◽

Class Imbalance ◽

Sampling Technique ◽

Original Data ◽

Data Space ◽

Space Experiments ◽

Dataset Size ◽

Imbalance Learning ◽

Class Imbalance Learning

Over-sampling technology for handling the class imbalanced problem generates more minority samples to balance the dataset size of different classes. However, sampling in original data space is ineffective as the data in different classes is overlapped or disjunct. Based on this, a new minority sample is presented in terms of the manifold distance rather than Euclidean distance. The overlapped majority and minority samples apt to distribute in fully disjunct subspaces from the view of manifold learning. Moreover, it can avoid generating samples between the minority data locating far away in manifold space. Experiments on 23 UCI datasets show that the proposed method has the better classification accuracy.

Download Full-text

RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem

Information Sciences ◽

10.1016/j.ins.2020.07.014 ◽

2021 ◽

Vol 542 ◽

pp. 92-111

Author(s):

Paria Soltanzadeh ◽

Mahdi Hashemzadeh

Keyword(s):

Class Imbalance ◽

Sampling Technique ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Class Imbalanced Learning Menggunakan Algoritma Synthetic Minority Over-sampling Technique – Nominal (SMOTE-N) pada Dataset Tuberculosis Anak

Jurnal Buana Informatika ◽

10.24002/jbi.v10i2.2441 ◽

2019 ◽

Vol 10 (2) ◽

pp. 134

Author(s):

Yulia Ery Kurniawati

Keyword(s):

Naive Bayes ◽

Class Imbalance ◽

Sampling Technique ◽

Naïve Bayes ◽

Bayes Classifier ◽

Imbalanced Learning ◽

Naïve Bayes Classifier ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Data Level

Class Imbalance Learning (CIL) merupakan proses pembelajaran untuk representasi data dan ekstraksi informasi dengan distribusi data yang buruk untuk mendukung pembuatan keputusan yang efektif dalam proses pengambilan keputusan. SMOTE-N adalah salah satu pendekatan data-level dalam CIL mengunakan metode over-sampling. SMOTE-N menghasilkan instance sintesis untuk menyeimbangkan jumlah instance pada kelas minoritasnya. Penelitian ini mengaplikasikan SMOTE-N pada dataset Tuberculosis Anak (TB Anak) yang memiliki ketidakseimbangan kelas. Metode over-sampling dipilih untuk menghindari kehilangan informasi yang penting dikarenakan dataset TB Anak memiliki jumlah instance yang sedikit. Naïve Bayes Classifier digunakan untuk mengevaluasi model dari dataset yang sudah seimbang. Hasilnya menunjukkan bahwa SMOTE-N dapat meningkatkan kinerja pada CIL.

Download Full-text

A Method for Class-Imbalance Learning in Android Malware Detection

Electronics ◽

10.3390/electronics10243124 ◽

2021 ◽

Vol 10 (24) ◽

pp. 3124

Author(s):

Jun Guan ◽

Xu Jiang ◽

Baolei Mao

Keyword(s):

Machine Learning ◽

Malware Detection ◽

Computational Cost ◽

Class Imbalance ◽

Sampling Technique ◽

Minority Class ◽

Android Malware ◽

Android Malware Detection ◽

Imbalance Learning ◽

Class Imbalance Learning

More and more Android application developers are adopting many different methods against reverse engineering, such as adding a shell, resulting in certain features that cannot be obtained through decompilation, which causes a serious sample imbalance in Android malware detection based on machine learning. Hence, the researchers have focused on how to solve class-imbalance to improve the performance of Android malware detection. However, the disadvantages of the existing class-imbalance learning are mainly the loss of valuable samples and the computational cost. In this paper, we propose a method of Class-Imbalance Learning (CIL), which first selects representative features, uses the clustering K-Means algorithm and under-sampling to retain the important samples of the majority class while reducing the number of samples of the majority class. After that, we use the Synthetic Minority Over-Sampling Technique (SMOTE) algorithm to generate minority class samples for data balance, and finally use the Random Forest (RF) algorithm to build a malware detection model. The result of experiments indicates that CIL effectively improves the performance of Android malware detection based on machine learning, especially for class imbalance. Compared with existing class-imbalance learning methods, CIL is also effective for the Machine Learning Repository from the University of California, Irvine (UCI) and has better performance in some data sets.

Download Full-text

Class Imbalance Learning

10.34048/2017.1.f1 ◽

2017 ◽

Author(s):

Sudarsun Santhiappan ◽

Balaraman Ravindran

Keyword(s):

Machine Learning ◽

Real World ◽

Class Imbalance ◽

Classification Problem ◽

Classification Algorithms ◽

Challenges And Opportunities ◽

Data Points ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Real World Problems

Data classiﬁcation task assigns labels to data points using a model that is learned from a collection of pre-labeled data points. The Class Imbalance Learning (CIL) problem is concerned with the performance of classiﬁcation algorithms in the presence of under-represented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced datasets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data effciently into information and knowledge representation. It is important to study CIL because it is rare to ﬁnd a classiﬁcation problem in real world scenarios that follows balanced class distributions. In this article, we have presented how machine learning has become the integral part of modern lifestyle and how some of the real world problems are modeled as CIL problems. We have also provided a detailed survey on the fundamentals and solutions to class imbalance learning. We conclude the survey by presenting some of the challenges and opportunities with class imbalance learning.

Download Full-text