Effective Rate of Minority Class Over-Sampling for Maximizing the Imbalanced Dataset Model Performance

2021 ◽  
pp. 9-20
Author(s):  
Forhad An Naim ◽  
Ummae Hamida Hannan ◽  
Md. Humayun Kabir
2014 ◽  
Vol 556-562 ◽  
pp. 4040-4044
Author(s):  
Chen Guang Zhang ◽  
Yan Zhang ◽  
Xia Huan Zhang

In real application areas, the dataset used may be highly imbalanced and the number of instances for some classes are much higher than that of the other classes. When learning from highly imbalanced dataset, the classifier tends to be adapted to suit the majority class, which might make classifier to obtain a high predictive accuracy over the majority class, but poor accuracy over the minority class. To solve this problem, we put forward a novel graph based semi-supervised learning method for imbalanced dataset, called GSMID. GSMID characterize the class equilibrium constraint as the smoothness of class labels. It’s expected to derive the optimal assignment of class membership to unlabeled samples by maximizing the correlations of classes and simultaneously as smooth as possible on instance graph. The experiments comparing GSMID to SVM and other graph based semi-supervised learning methods on several real-world datasets show GSMM can effectively improve the classification accuracy on imbalanced dataset, especially when data is highly skewed.


Classification is a supervised learning task based on categorizing things in groups on the basis of class labels. Algorithms are trained with labeled datasets for accomplishing the task of classification. In the process of classification, datasets plays an important role. If in a dataset, instances of one label/class (majority class) are much more than instances of another label/class (minority class), such that it becomes hard to understand and learn characteristics of minority class for a classifier, such dataset is termed an imbalanced dataset. These types of datasets raise the problem of biased prediction or misclassification in the real world, as models based on such datasets may give very high accuracy during training, but as not familiar with minority class instances, would not be able to predict minority class and thus fails poorly. A survey on various techniques proposed by the researchers for handling imbalanced data has been presented and a comparison of the techniques based on f-measure has been identified and discussed.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Tri-Cong Pham ◽  
Chi-Mai Luong ◽  
Van-Dung Hoang ◽  
Antoine Doucet

AbstractMelanoma, one of the most dangerous types of skin cancer, results in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent researches have used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, training neural networks on an imbalanced dataset leads to imbalanced performance, the specificity is very high but the sensitivity is very low. This study proposes a method for improving melanoma prediction on an imbalanced dataset by reconstructed appropriate CNN architecture and optimized algorithms. The contributions involve three key features as custom loss function, custom mini-batch logic, and reformed fully connected layers. In the experiment, the training dataset is kept up to date including 17,302 images of melanoma and nevus which is the largest dataset by far. The model performance is compared to that of 157 dermatologists from 12 university hospitals in Germany based on the same dataset. The experimental results prove that our proposed approach outperforms all 157 dermatologists and achieves higher performance than the state-of-the-art approach with area under the curve of 94.4%, sensitivity of 85.0%, and specificity of 95.0%. Moreover, using the best threshold shows the most balanced measure compare to other researches, and is promisingly application to medical diagnosis, with sensitivity of 90.0% and specificity of 93.8%. To foster further research and allow for replicability, we made the source code and data splits of all our experiments publicly available.


Author(s):  
Bo Zang ◽  
Ruochen Huang ◽  
Lei Wang ◽  
Jianxin Chen ◽  
Feng Tian ◽  
...  

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
German Cuaya-Simbro ◽  
Alberto-I. Perez-Sanpablo ◽  
Eduardo-F. Morales ◽  
Ivett Quiñones Uriostegui ◽  
Lidia Nuñez-Carrera

Falls are a multifactorial cause of injuries for older people. Subjects with osteoporosis are particularly vulnerable to falls. We study the performance of different computational methods to identify people with osteoporosis who experience a fall by analysing balance parameters. Balance parameters, from eyes open and closed posturographic studies, and prospective registration of falls were obtained from a sample of 126 community-dwelling older women with osteoporosis (age 74.3 ± 6.3) using World Health Organization Questionnaire for the study of falls during a follow-up of 2.5 years. We analyzed model performance to determine falls of every developed model and to validate the relevance of the selected parameter sets. The principal findings of this research were (1) models built using oversampling methods with either IBk (KNN) or Random Forest classifier can be considered good options for a predictive clinical test and (2) feature selection for minority class (FSMC) method selected previously unnoticed balance parameters, which implies that intelligent computing methods can extract useful information with attributes which otherwise are disregarded by experts. Finally, the results obtained suggest that Random Forest classifier using the oversampling method to balance the data independent of the set of variables used got the best overall performance in measures of sensitivity (>0.71), specificity (>0.18), positive predictive value (PPV >0.74), and negative predictive value (NPV >0.66) independent of the set of variables used. Although the IBk classifier was built with oversampling data considering information from both eyes opened and closed, using all variables got the best performance (sensitivity >0.81, specificity >0.19, PPV = 0.97, and NPV = 0.66).


2021 ◽  
Vol 10 (5) ◽  
pp. 2789-2795
Author(s):  
Seyyed Mohammad Javadi Moghaddam ◽  
Asadollah Noroozi

The performance of the data classification has encountered a problem when the data distribution is imbalanced. This fact results in the classifiers tend to the majority class which has the most of the instances. One of the popular approaches is to balance the dataset using over and under sampling methods. This paper presents a novel pre-processing technique that performs both over and under sampling algorithms for an imbalanced dataset. The proposed method uses the SMOTE algorithm to increase the minority class. Moreover, a cluster-based approach is performed to decrease the majority class which takes into consideration the new size of the minority class. The experimental results on 10 imbalanced datasets show the suggested algorithm has better performance in comparison to previous approaches.


Author(s):  
Hung Ba Nguyen ◽  
Van-Nam Huynh ◽  
◽  

The imbalanced dataset is a crucial problem found in many real-world applications. Classifiers trained on these datasets tend to overfit toward the majority class, and this problem severely affects classifier accuracy. This ultimately triggers a large cost to cover the error in terms of misclassifying the minority class especially in credit-granting decision when the minority class is the bad loan applications. By comparing the industry standard with well-known machine learning and ensemble models under imbalance treatment approaches, this study shows the potential performance of these models towards the industry standard in credit scoring. More importantly, diverse performance measurements reveal different weaknesses in various aspects of a scoring model. Employing class balancing strategies can mitigate classifier errors, and both homogeneous and heterogeneous ensemble approaches yield the best significant improvement on credit scoring.


Author(s):  
Shahzad Ashraf ◽  
Sehrish Saleem ◽  
Tauqeer Ahmed ◽  
Zeeshan Aslam ◽  
Durr Muhammad

AbstractAn imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers’ performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.


Sign in / Sign up

Export Citation Format

Share Document