scholarly journals Combining Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) and Hybrid Sampling in Handling Multi-Class Imbalance and Overlapping

Author(s):  
Hartono Hartono ◽  
Erianto Ongko

Class imbalance is one of the main problems in classification because the number of samples in majority class is far more than the number of samples in minority class.  The class imbalance problem in the multi-class dataset is much more difficult to handle than the problem in the two class dataset. This multi-class imbalance problem is even more complicated if it is accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach which combines sampling and classifier ensembles. However, in terms of diversity among classifiers, hybrid approach that combine sampling and classifier ensembles will give better results. HAR-MI delivers excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of sample in minority class. However, this SMOTE also has a weakness where if there is an extremely imbalanced dataset and a large number of attributes there will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sazmpling (MOSHS) method and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based undersampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value.

2018 ◽  
Vol 7 (2.14) ◽  
pp. 478 ◽  
Author(s):  
Hartono . ◽  
Opim Salim Sitompul ◽  
Erna Budhiarti Nababan ◽  
Tulus . ◽  
Dahlan Abdullah ◽  
...  

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.  


Author(s):  
Hartono Hartono ◽  
Erianto Ongko ◽  
Yeni Risyani

<span>In the classification process that contains class imbalance problems. In addition to the uneven distribution of instances which causes poor performance, overlapping problems also cause performance degradation. This paper proposes a method that combining feature selection and hybrid approach redefinition (HAR) method in handling class imbalance and overlapping for multi-class imbalanced. HAR was a hybrid ensembles method in handling class imbalance problem. The main contribution of this work is to produce a new method that can overcome the problem of class imbalance and overlapping in the multi-class imbalance problem.  This method must be able to give better results in terms of classifier performance and overlap degrees in multi-class problems. This is achieved by improving an ensemble learning algorithm and a preprocessing technique in HAR <span>using minimizing overlapping selection under SMOTE (MOSS). MOSS was known as a very popular feature selection method in handling overlapping. To validate the accuracy of the proposed method, this research use augmented R-Value, Mean AUC, Mean F-Measure, Mean G-Mean, and Mean Precision. The performance of the model is evaluated against the hybrid method (MBP+CGE) as a popular method in handling class imbalance and overlapping for multi-class imbalanced. It is found that the proposed method is superior when subjected to classifier performance as indicate with better Mean AUC, F-Measure, G-Mean, and precision.</span></span>


2020 ◽  
Vol 8 (2) ◽  
pp. 42-51
Author(s):  
Gaetano Zazzaro

The class imbalance problem is widespread in Data Mining and it can reduce the general performance of a classification model. Many techniques have been proposed in order to overcome it, thanks to which a model able to handling rare events can be trained. The methodology presented in this paper, called Controlled Over-Sampling Method (COSM), includes a controller model able to reject new synthetic elements for which there is no certainty of belonging to the minority class. It combines the common Machine Learning method for holdout with an oversampling algorithm, for example the classic SMOTE algorithm. The proposal explained and designed here represents a guideline for the application of oversampling algorithms, but also a brief overview on techniques for overcoming the problem of the unbalanced class in Data Mining.


Author(s):  
Sayan Surya Shaw ◽  
Shameem Ahmed ◽  
Samir Malakar ◽  
Laura Garcia-Hernandez ◽  
Ajith Abraham ◽  
...  

AbstractMany real-life datasets are imbalanced in nature, which implies that the number of samples present in one class (minority class) is exceptionally less compared to the number of samples found in the other class (majority class). Hence, if we directly fit these datasets to a standard classifier for training, then it often overlooks the minority class samples while estimating class separating hyperplane(s) and as a result of that it missclassifies the minority class samples. To solve this problem, over the years, many researchers have followed different approaches. However the selection of the true representative samples from the majority class is still considered as an open research problem. A better solution for this problem would be helpful in many applications like fraud detection, disease prediction and text classification. Also, the recent studies show that it needs not only analyzing disproportion between classes, but also other difficulties rooted in the nature of different data and thereby it needs more flexible, self-adaptable, computationally efficient and real-time method for selection of majority class samples without loosing much of important data from it. Keeping this fact in mind, we have proposed a hybrid model constituting Particle Swarm Optimization (PSO), a popular swarm intelligence-based meta-heuristic algorithm, and Ring Theory (RT)-based Evolutionary Algorithm (RTEA), a recently proposed physics-based meta-heuristic algorithm. We have named the algorithm as RT-based PSO or in short RTPSO. RTPSO can select the most representative samples from the majority class as it takes advantage of the efficient exploration and the exploitation phases of its parent algorithms for strengthening the search process. We have used AdaBoost classifier to observe the final classification results of our model. The effectiveness of our proposed method has been evaluated on 15 standard real-life datasets having low to extreme imbalance ratio. The performance of the RTPSO has been compared with PSO, RTEA and other standard undersampling methods. The obtained results demonstrate the superiority of RTPSO over state-of-the-art class imbalance problem-solvers considered here for comparison. The source code of this work is available in https://github.com/Sayansurya/RTPSO_Class_imbalance.


Author(s):  
Ruchika Malhotra ◽  
Kusum Lata

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.


2019 ◽  
Vol 8 (2) ◽  
pp. 2463-2468

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.


2022 ◽  
Vol 16 (3) ◽  
pp. 1-37
Author(s):  
Robert A. Sowah ◽  
Bernard Kuditchar ◽  
Godfrey A. Mills ◽  
Amevi Acakpovi ◽  
Raphael A. Twum ◽  
...  

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.


Sign in / Sign up

Export Citation Format

Share Document