scholarly journals Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Author(s):  
Show-Jane Yen ◽  
Yue-Shi Lee
2021 ◽  
Vol 10 (5) ◽  
pp. 2789-2795
Author(s):  
Seyyed Mohammad Javadi Moghaddam ◽  
Asadollah Noroozi

The performance of the data classification has encountered a problem when the data distribution is imbalanced. This fact results in the classifiers tend to the majority class which has the most of the instances. One of the popular approaches is to balance the dataset using over and under sampling methods. This paper presents a novel pre-processing technique that performs both over and under sampling algorithms for an imbalanced dataset. The proposed method uses the SMOTE algorithm to increase the minority class. Moreover, a cluster-based approach is performed to decrease the majority class which takes into consideration the new size of the minority class. The experimental results on 10 imbalanced datasets show the suggested algorithm has better performance in comparison to previous approaches.


2018 ◽  
Vol 10 (11) ◽  
pp. 1689 ◽  
Author(s):  
Min Ji ◽  
Lanfa Liu ◽  
Manfred Buchroithner

Earthquake is one of the most devastating natural disasters that threaten human life. It is vital to retrieve the building damage status for planning rescue and reconstruction after an earthquake. In cases when the number of completely collapsed buildings is far less than intact or less-affected buildings (e.g., the 2010 Haiti earthquake), it is difficult for the classifier to learn the minority class samples, due to the imbalance learning problem. In this study, the convolutional neural network (CNN) was utilized to identify collapsed buildings from post-event satellite imagery with the proposed workflow. Producer accuracy (PA), user accuracy (UA), overall accuracy (OA), and Kappa were used as evaluation metrics. To overcome the imbalance problem, random over-sampling, random under-sampling, and cost-sensitive methods were tested on selected test A and test B regions. The results demonstrated that the building collapsed information can be retrieved by using post-event imagery. SqueezeNet performed well in classifying collapsed and non-collapsed buildings, and achieved an average OA of 78.6% for the two test regions. After balancing steps, the average Kappa value was improved from 41.6% to 44.8% with the cost-sensitive approach. Moreover, the cost-sensitive method showed a better performance on discriminating collapsed buildings, with a PA value of 51.2% for test A and 61.1% for test B. Therefore, a suitable balancing method should be considered when facing imbalance dataset to retrieve the distribution of collapsed buildings.


2018 ◽  
Vol 7 (1.8) ◽  
pp. 113 ◽  
Author(s):  
G Shobana ◽  
Bhanu Prakash Battula

Some true applications uncover troubles in taking in classifiers from imbalanced information. Albeit a few techniques for enhancing classifiers have been presented, the distinguishing proof of conditions for the effective utilization of the specific strategy is as yet an open research issue. It is likewise worth to think about the idea of imbalanced information, qualities of the minority class dissemination and their impact on arrangement execution. In any case, current investigations on imbalanced information trouble factors have been predominantly finished with manufactured datasets and their decisions are not effortlessly material to this present reality issues, likewise on the grounds that the techniques for their distinguishing proof are not adequately created. In this paper, we recommended a novel approach Under Sampling Utilizing Diversified Distribution (USDD) for explaining the issues of class lopsidedness in genuine datasets by thinking about the systems of recognizable pieces of proof and expulsion of marginal, uncommon and anomalies sub groups utilizing k-implies. USDD utilizes exceptional procedure for recognizable proof of these kinds of cases, which depends on breaking down a class dissemination in a nearby neighborhood of the considered case utilizing k-closest approach. The exploratory outcomes recommend that the proposed USDD approach performs superior to the looked at approach as far as AUC, accuracy, review and f-measure.


2014 ◽  
Vol 556-562 ◽  
pp. 4040-4044
Author(s):  
Chen Guang Zhang ◽  
Yan Zhang ◽  
Xia Huan Zhang

In real application areas, the dataset used may be highly imbalanced and the number of instances for some classes are much higher than that of the other classes. When learning from highly imbalanced dataset, the classifier tends to be adapted to suit the majority class, which might make classifier to obtain a high predictive accuracy over the majority class, but poor accuracy over the minority class. To solve this problem, we put forward a novel graph based semi-supervised learning method for imbalanced dataset, called GSMID. GSMID characterize the class equilibrium constraint as the smoothness of class labels. It’s expected to derive the optimal assignment of class membership to unlabeled samples by maximizing the correlations of classes and simultaneously as smooth as possible on instance graph. The experiments comparing GSMID to SVM and other graph based semi-supervised learning methods on several real-world datasets show GSMM can effectively improve the classification accuracy on imbalanced dataset, especially when data is highly skewed.


Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.


2002 ◽  
Vol 16 ◽  
pp. 321-357 ◽  
Author(s):  
N. V. Chawla ◽  
K. W. Bowyer ◽  
L. O. Hall ◽  
W. P. Kegelmeyer

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.


Sign in / Sign up

Export Citation Format

Share Document