Resampling Methods versus Cost Functions for Training an MLP in the Class Imbalance Context

Author(s):  
R. Alejo ◽  
P. Toribio ◽  
J. M. Sotoca ◽  
R. M. Valdovinos ◽  
E. Gasca
Author(s):  
Nur Farhana Hordri ◽  
Siti Sophiayati ◽  
Nurulhuda Firdaus ◽  
Siti Mariyam

2015 ◽  
Vol 744-746 ◽  
pp. 1985-1989 ◽  
Author(s):  
Miao Hua Li ◽  
Shu Yan Chen

Traffic data is highly skewed with rare traffic incidents in the real word while most of the existing automatic incident detection (AID) algorithms suffer from limitations due to their inability to detect incidents under imbalanced traffic data condition. This paper developed feasible AID algorithms based on resampling methods to process imbalanced traffic data. In order to obtain the optimal sampling method for incident detection, we compare the detection performance of different AID algorithms based on various resampling methods. The detection performance is evaluated by the common criteria including classification rate, detection rate, false alarm rate, mean time to detection and an integrated performance index. The I-880 dataset is finally used in experiments to verify the proposed algorithms. The experimental results indicate that the proposed AID algorithm based on resampling can achieve better performance through handling imbalanced traffic data problem. Moreover, the under-sampling is competitive than over-sampling for traffic incident detection.


2021 ◽  
Vol 40 (5) ◽  
pp. 10073-10086
Author(s):  
Zhicheng Pang ◽  
Hong Li ◽  
Chiyu Wang ◽  
Jiawen Shi ◽  
Jiale Zhou

In practice, the class imbalance is prevalent in sentiment classification tasks, which is harmful to classifiers. Recently, over-sampling strategies based on data augmentation techniques have caught the eyes of researchers. They generate new samples by rewriting the original samples. Nevertheless, the samples to be rewritten are usually selected randomly, which means that useless samples may be selected, thus adding this type of samples. Based on this observation, we propose a novel balancing strategy for text sentiment classification. Our approach takes word replacement as foundation and can be divided into two stages, which not only can balance the class distribution of training set, but also can modify noisy data. In the first stage, we perform word replacement on specific samples instead of random samples to obtain new samples. According to the noise detection, the second stage revises the sentiment of noisy samples. Toward this aim, we propose an improved term weighting called TF-IGM-CW for imbalanced text datasets, which contributes to extracting the target rewritten samples and feature words. We conduct experiments on four public sentiment datasets. Results suggest that our method outperforms several other resampling methods and can be integrated with various classification algorithms easily.


2020 ◽  
Author(s):  
Vladimir Golkov ◽  
Alexander Becker ◽  
Daniel T. Plop ◽  
Daniel Čuturilo ◽  
Neda Davoudi ◽  
...  

AbstractComputer-aided drug discovery is an essential component of modern drug development. Therein, deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features. Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets. In this work we argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance, its ability to compromise over different decision thresholds, certain freedom to influence the relative weights in this compromise, fidelity to typical benchmarking measures, and equivalence to positive/unlabeled learning. We also propose new training schemes (coherent mini-batch arrangement, and usage of out-of-batch samples) for cost functions based on the ROC, as well as a cost function based on the logAUC metric that facilitates early enrichment (i.e. improves performance at high decision thresholds, as often desired when synthesizing predicted hit compounds). We demonstrate that these approaches outperform standard deep learning approaches on a series of PubChem high-throughput screening datasets that represent realistic and diverse drug discovery campaigns on major drug target families.


2018 ◽  
Vol 147 (12) ◽  
pp. 161-170
Author(s):  
Víctor D. de la Cruz-Galarza ◽  
Yenny Villuendas-Rey ◽  
Cornelio Yáñez-Márquez

Electronics ◽  
2022 ◽  
Vol 11 (2) ◽  
pp. 228
Author(s):  
Ahmad B. Hassanat ◽  
Ahmad S. Tarawneh ◽  
Samer Subhi Abed ◽  
Ghada Awad Altarawneh ◽  
Malek Alrashidi ◽  
...  

Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.


2020 ◽  
Vol 19 ◽  

In the real world, the class imbalance problem is a common issue in which classifier gives more importance to the majority class whereas less importance to the minority class. In class imbalance, imbalance metrics would not be suitable to evaluate the performance of classifiers with error rate or predictive accuracy. One type of imbalance data -handling method is resampling. In this paper, three resampling methods, oversampling, under-sampling and hybrid, methods are used with different approaches for in class imbalance of two different financial data to see the impact of class imbalance ratios on performance measures of nine different classification algorithms. Aiming to achieve better change classification performance, the performance of the classification algorithms, Bayes Net, Navie Bayes, J48, Random Forest Meta-Attribute Selected Classifier, MetaClassification via Regression, Meta-Logitboost, Logistic Regression, and Decision Tree, are measured on two Canadian Banks multiclass imbalance data with the performance measures, Precision, Recall, ROC Area and Kappa Statistic, by using WEKA software. The outcome of these performance measurements compared with three different resampling methods. The results provide us with a clear picture on the overall impact of class imbalance on the classification dataset and they indicate that proposed resampling methods can also be used for in class imbalance problems


2020 ◽  
Vol 64 (4) ◽  
pp. 40412-1-40412-11
Author(s):  
Kexin Bai ◽  
Qiang Li ◽  
Ching-Hsin Wang

Abstract To address the issues of the relatively small size of brain tumor image datasets, severe class imbalance, and low precision in existing segmentation algorithms for brain tumor images, this study proposes a two-stage segmentation algorithm integrating convolutional neural networks (CNNs) and conventional methods. Four modalities of the original magnetic resonance images were first preprocessed separately. Next, preliminary segmentation was performed using an improved U-Net CNN containing deep monitoring, residual structures, dense connection structures, and dense skip connections. The authors adopted a multiclass Dice loss function to deal with class imbalance and successfully prevented overfitting using data augmentation. The preliminary segmentation results subsequently served as the a priori knowledge for a continuous maximum flow algorithm for fine segmentation of target edges. Experiments revealed that the mean Dice similarity coefficients of the proposed algorithm in whole tumor, tumor core, and enhancing tumor segmentation were 0.9072, 0.8578, and 0.7837, respectively. The proposed algorithm presents higher accuracy and better stability in comparison with some of the more advanced segmentation algorithms for brain tumor images.


2019 ◽  
Vol 12 (10) ◽  
Author(s):  
Swati Narwane ◽  
Sudhir Sawarkar

Sign in / Sign up

Export Citation Format

Share Document