scholarly journals Hybridization of ring theory-based evolutionary algorithm and particle swarm optimization to solve class imbalance problem

Author(s):  
Sayan Surya Shaw ◽  
Shameem Ahmed ◽  
Samir Malakar ◽  
Laura Garcia-Hernandez ◽  
Ajith Abraham ◽  
...  

AbstractMany real-life datasets are imbalanced in nature, which implies that the number of samples present in one class (minority class) is exceptionally less compared to the number of samples found in the other class (majority class). Hence, if we directly fit these datasets to a standard classifier for training, then it often overlooks the minority class samples while estimating class separating hyperplane(s) and as a result of that it missclassifies the minority class samples. To solve this problem, over the years, many researchers have followed different approaches. However the selection of the true representative samples from the majority class is still considered as an open research problem. A better solution for this problem would be helpful in many applications like fraud detection, disease prediction and text classification. Also, the recent studies show that it needs not only analyzing disproportion between classes, but also other difficulties rooted in the nature of different data and thereby it needs more flexible, self-adaptable, computationally efficient and real-time method for selection of majority class samples without loosing much of important data from it. Keeping this fact in mind, we have proposed a hybrid model constituting Particle Swarm Optimization (PSO), a popular swarm intelligence-based meta-heuristic algorithm, and Ring Theory (RT)-based Evolutionary Algorithm (RTEA), a recently proposed physics-based meta-heuristic algorithm. We have named the algorithm as RT-based PSO or in short RTPSO. RTPSO can select the most representative samples from the majority class as it takes advantage of the efficient exploration and the exploitation phases of its parent algorithms for strengthening the search process. We have used AdaBoost classifier to observe the final classification results of our model. The effectiveness of our proposed method has been evaluated on 15 standard real-life datasets having low to extreme imbalance ratio. The performance of the RTPSO has been compared with PSO, RTEA and other standard undersampling methods. The obtained results demonstrate the superiority of RTPSO over state-of-the-art class imbalance problem-solvers considered here for comparison. The source code of this work is available in https://github.com/Sayansurya/RTPSO_Class_imbalance.

Author(s):  
Ruchika Malhotra ◽  
Kusum Lata

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.


2019 ◽  
Vol 8 (4) ◽  
pp. 2594-2602

The need for generating automated sentiment on audience feedbacks has been the need of the hour. Manually going through the entire movie feedback becomes tedious therefore an attempt to predict the polarity of a movie based on the reviews using machine learning models is done. Usage of the IMDB movie reviews dataset has been done for training and testing. In this study we also try to depict the real-life problems of class imbalance and train-test splits, hence obtaining solutions for the same. The problem of class imbalance in today’s world has affected a large amount of predictive applications such as cancer detection , fraudulent transactions in banks etc, hence this study is an attempt to perform a solution to solve the class imbalance problem. Use of the undersampling method has been done in this study to improve the accuracy of an imbalanced class. Feature extraction methods such as Bag of Words and Term Frequency Inverse document Frequency have been used to generate features from the reviews. The Logistic regression and SVM classifiers have been used in the study to measure the accuracy. Along with the accuracy the Confusion Matrix has also been calculated to showcase the class imbalance taking its effect on the accuracy.


Author(s):  
Hartono Hartono ◽  
Erianto Ongko

Class imbalance is one of the main problems in classification because the number of samples in majority class is far more than the number of samples in minority class.  The class imbalance problem in the multi-class dataset is much more difficult to handle than the problem in the two class dataset. This multi-class imbalance problem is even more complicated if it is accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach which combines sampling and classifier ensembles. However, in terms of diversity among classifiers, hybrid approach that combine sampling and classifier ensembles will give better results. HAR-MI delivers excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of sample in minority class. However, this SMOTE also has a weakness where if there is an extremely imbalanced dataset and a large number of attributes there will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sazmpling (MOSHS) method and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based undersampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value.


Classification is a major obstacle in Machine Learning generally and also specific when tackling class imbalance problem. A dataset is said to be imbalanced if a class we are interested in falls to the minority class and appears scanty when compared to the majority class, the minority class is also known as the positive class while the majority class is also known as the negative class. Class imbalance has been a major bottleneck for Machine Learning scientist as it often leads to using wrong model for different purposes, this Survey will lead researchers to choose the right model and the best strategies to handle imbalance dataset in the course of tackling machine learning problems. Proper handling of class imbalance dataset could leads to accurate and good result. Handling class imbalance data in a conventional manner, especially when the level of imbalance is high may leads to accuracy paradox (an assumption of realizing 99% accuracy during evaluation process when the class distribution is highly imbalanced), hence imbalance class distribution requires special consideration, and for this purpose we dealt extensively on handling and solving imbalanced class problem in machine learning, such as Data Sampling Approach, Cost sensitive learning approach and Ensemble Approach.


2017 ◽  
Vol 26 (03) ◽  
pp. 1750009 ◽  
Author(s):  
Dionisios N. Sotiropoulos ◽  
George A. Tsihrintzis

This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.


2021 ◽  
Vol 11 (14) ◽  
pp. 6310
Author(s):  
Ismael Lin ◽  
Octavio Loyola-González ◽  
Raúl Monroy ◽  
Miguel Angel Medina-Pérez

The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.


2021 ◽  
Vol 7 (1) ◽  
pp. 63
Author(s):  
Prasetyo Wibowo ◽  
Chastine Fatichah

Class imbalance occurs when the distribution of classes between the majority and the minority classes is not the same. The data on imbalanced classes may vary from mild to severe. The effect of high-class imbalance may affect the overall classification accuracy since the model is most likely to predict most of the data that fall within the majority class.  Such a model will give biased results, and the performance predictions for the minority class often have no impact on the model. The use of the oversampling technique is one way to deal with high-class imbalance, but only a few are used to solve data imbalance. This study aims for an in-depth performance analysis of the oversampling techniques to address the high-class imbalance problem. The addition of the oversampling technique will balance each class’s data to provide unbiased evaluation results in modeling. We compared the performance of Random Oversampling (ROS), ADASYN, SMOTE, and Borderline-SMOTE techniques. All oversampling techniques will be combined with machine learning methods such as Random Forest, Logistic Regression, and k-Nearest Neighbor (KNN). The test results show that Random Forest with Borderline-SMOTE gives the best value with an accuracy value of 0.9997, 0.9474 precision, 0.8571 recall, 0.9000 F1-score, 0.9388 ROC-AUC, and 0.8581 PRAUC of the overall oversampling technique.


2018 ◽  
Vol 27 (06) ◽  
pp. 1850025 ◽  
Author(s):  
Huaping Guo ◽  
Jun Zhou ◽  
Chang-an Wu ◽  
Wei She

Class-imbalance is very common in real world. However, conventional advanced methods do not work well on imbalanced data due to imbalanced class distribution. This paper proposes a simple but effective Hybrid-based Ensemble (HE) to deal with two-class imbalanced problem. HE learns a hybrid ensemble using the following two stages: (1) learning several projection matrixes from the rebalanced data obtained by under-sampling the original training set and constructing new training sets by projecting the original training set to different spaces defined by the matrixes, and (2) undersampling several subsets from each new training set and training a model on each subset. Here, feature projection aims to improve the diversity between ensemble members and under-sampling technique is to improve generalization ability of individual members on minority class. Experimental results show that, compared with other state-of-the-art methods, HE shows significantly better performance on measures of AUC, G-mean, F-measure and recall.


Sign in / Sign up

Export Citation Format

Share Document