Resampling Methods versus Cost Functions for Training an MLP in the Class Imbalance Context

Traffic data is highly skewed with rare traffic incidents in the real word while most of the existing automatic incident detection (AID) algorithms suffer from limitations due to their inability to detect incidents under imbalanced traffic data condition. This paper developed feasible AID algorithms based on resampling methods to process imbalanced traffic data. In order to obtain the optimal sampling method for incident detection, we compare the detection performance of different AID algorithms based on various resampling methods. The detection performance is evaluated by the common criteria including classification rate, detection rate, false alarm rate, mean time to detection and an integrated performance index. The I-880 dataset is finally used in experiments to verify the proposed algorithms. The experimental results indicate that the proposed AID algorithm based on resampling can achieve better performance through handling imbalanced traffic data problem. Moreover, the under-sampling is competitive than over-sampling for traffic incident detection.

Download Full-text

A two-stage balancing strategy based on data augmentation for imbalanced text sentiment classification

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202716 ◽

2021 ◽

Vol 40 (5) ◽

pp. 10073-10086

Author(s):

Zhicheng Pang ◽

Hong Li ◽

Chiyu Wang ◽

Jiawen Shi ◽

Jiale Zhou

Keyword(s):

Data Augmentation ◽

Class Imbalance ◽

Sentiment Classification ◽

Term Weighting ◽

Resampling Methods ◽

Second Stage ◽

Public Sentiment ◽

Random Samples ◽

Augmentation Techniques ◽

Two Stages

In practice, the class imbalance is prevalent in sentiment classification tasks, which is harmful to classifiers. Recently, over-sampling strategies based on data augmentation techniques have caught the eyes of researchers. They generate new samples by rewriting the original samples. Nevertheless, the samples to be rewritten are usually selected randomly, which means that useless samples may be selected, thus adding this type of samples. Based on this observation, we propose a novel balancing strategy for text sentiment classification. Our approach takes word replacement as foundation and can be divided into two stages, which not only can balance the class distribution of training set, but also can modify noisy data. In the first stage, we perform word replacement on specific samples instead of random samples to obtain new samples. According to the noise detection, the second stage revises the sentiment of noisy samples. Toward this aim, we propose an improved term weighting called TF-IGM-CW for imbalanced text datasets, which contributes to extracting the target rewritten samples and feature words. We conduct experiments on four public sentiment datasets. Results suggest that our method outperforms several other resampling methods and can be integrated with various classification algorithms easily.

Download Full-text

Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions

10.1101/2020.06.25.166884 ◽

2020 ◽

Author(s):

Vladimir Golkov ◽

Alexander Becker ◽

Daniel T. Plop ◽

Daniel Čuturilo ◽

Neda Davoudi ◽

...

Keyword(s):

Deep Learning ◽

Drug Discovery ◽

High Throughput Screening ◽

Class Imbalance ◽

Ground Truth ◽

Rapid Screening ◽

Cost Functions ◽

Learning Approaches ◽

Modern Drug ◽

Training Schemes

AbstractComputer-aided drug discovery is an essential component of modern drug development. Therein, deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features. Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets. In this work we argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance, its ability to compromise over different decision thresholds, certain freedom to influence the relative weights in this compromise, fidelity to typical benchmarking measures, and equivalence to positive/unlabeled learning. We also propose new training schemes (coherent mini-batch arrangement, and usage of out-of-batch samples) for cost functions based on the ROC, as well as a cost function based on the logAUC metric that facilitates early enrichment (i.e. improves performance at high decision thresholds, as often desired when synthesizing predicted hit compounds). We demonstrate that these approaches outperform standard deep learning approaches on a series of PubChem high-throughput screening datasets that represent realistic and diverse drug discovery campaigns on major drug target families.

Download Full-text

Credit Assignment: Using Resampling Methods for Dealing with the Class Imbalance Problem

Research in Computing Science ◽

10.13053/rcs-147-12-15 ◽

2018 ◽

Vol 147 (12) ◽

pp. 161-170

Author(s):

Víctor D. de la Cruz-Galarza ◽

Yenny Villuendas-Rey ◽

Cornelio Yáñez-Márquez

Keyword(s):

Class Imbalance ◽

Credit Assignment ◽

Class Imbalance Problem ◽

Resampling Methods ◽

Imbalance Problem

Download Full-text

The Multi-Class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks

Advances in Soft Computing - The Sixth International Symposium on Neural Networks (ISNN 2009) ◽

10.1007/978-3-642-01216-7_44 ◽

2009 ◽

pp. 421-431 ◽

Cited By ~ 2

Author(s):

Roberto Alejo ◽

Jose M. Sotoca ◽

R. M. Valdovinos ◽

Gustavo A. Casañ

Keyword(s):

Neural Networks ◽

Class Imbalance ◽

Cost Functions ◽

Class Imbalance Problem ◽

Modular Neural Networks ◽

Imbalance Problem

Download Full-text

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

Electronics ◽

10.3390/electronics11020228 ◽

2022 ◽

Vol 11 (2) ◽

pp. 228

Author(s):

Ahmad B. Hassanat ◽

Ahmad S. Tarawneh ◽

Samer Subhi Abed ◽

Ghada Awad Altarawneh ◽

Malek Alrashidi ◽

...

Keyword(s):

Machine Learning ◽

Linear Time ◽

Class Imbalance ◽

Data Partitioning ◽

Majority Voting ◽

Random Data ◽

Imbalanced Datasets ◽

Resampling Methods ◽

Voting Rule ◽

Probability Of Overfitting

Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.

Download Full-text

Data Level Approach for Multiclass Imbalance Financial Data

WSEAS TRANSACTIONS ON COMPUTERS ◽

10.37394/23205.2020.19.22 ◽

2020 ◽

Vol 19 ◽

Keyword(s):

Performance Measures ◽

Predictive Accuracy ◽

Hybrid Methods ◽

Class Imbalance ◽

Financial Data ◽

Classification Algorithms ◽

Minority Class ◽

Resampling Methods ◽

Imbalance Data ◽

The Impact

In the real world, the class imbalance problem is a common issue in which classifier gives more importance to the majority class whereas less importance to the minority class. In class imbalance, imbalance metrics would not be suitable to evaluate the performance of classifiers with error rate or predictive accuracy. One type of imbalance data -handling method is resampling. In this paper, three resampling methods, oversampling, under-sampling and hybrid, methods are used with different approaches for in class imbalance of two different financial data to see the impact of class imbalance ratios on performance measures of nine different classification algorithms. Aiming to achieve better change classification performance, the performance of the classification algorithms, Bayes Net, Navie Bayes, J48, Random Forest Meta-Attribute Selected Classifier, MetaClassification via Regression, Meta-Logitboost, Logistic Regression, and Decision Tree, are measured on two Canadian Banks multiclass imbalance data with the performance measures, Precision, Recall, ROC Area and Kappa Statistic, by using WEKA software. The outcome of these performance measurements compared with three different resampling methods. The results provide us with a clear picture on the overall impact of class imbalance on the classification dataset and they indicate that proposed resampling methods can also be used for in class imbalance problems

Download Full-text

Integrating Improved U-Net and Continuous Maximum Flow Algorithm for 3D Brain Tumor Image Segmentation

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2020.64.4.040412 ◽

2020 ◽

Vol 64 (4) ◽

pp. 40412-1-40412-11

Author(s):

Kexin Bai ◽

Qiang Li ◽

Ching-Hsin Wang

Keyword(s):

Brain Tumor ◽

Data Augmentation ◽

A Priori ◽

Class Imbalance ◽

Maximum Flow ◽

Magnetic Resonance Images ◽

Tumor Segmentation ◽

Similarity Coefficients ◽

Segmentation Algorithms ◽

Flow Algorithm

Abstract To address the issues of the relatively small size of brain tumor image datasets, severe class imbalance, and low precision in existing segmentation algorithms for brain tumor images, this study proposes a two-stage segmentation algorithm integrating convolutional neural networks (CNNs) and conventional methods. Four modalities of the original magnetic resonance images were first preprocessed separately. Next, preliminary segmentation was performed using an improved U-Net CNN containing deep monitoring, residual structures, dense connection structures, and dense skip connections. The authors adopted a multiclass Dice loss function to deal with class imbalance and successfully prevented overfitting using data augmentation. The preliminary segmentation results subsequently served as the a priori knowledge for a continuous maximum flow algorithm for fine segmentation of target edges. Experiments revealed that the mean Dice similarity coefficients of the proposed algorithm in whole tumor, tumor core, and enhancing tumor segmentation were 0.9072, 0.8578, and 0.7837, respectively. The proposed algorithm presents higher accuracy and better stability in comparison with some of the more advanced segmentation algorithms for brain tumor images.

Download Full-text

Machine Learning and Class Imbalance: A Literature Survey

10.26488/iej.12.10.1202 ◽

2019 ◽

Vol 12 (10) ◽

Author(s):

Swati Narwane ◽

Sudhir Sawarkar

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Literature Survey

Download Full-text

Resampling Methods versus Cost Functions for Training an MLP in the Class Imbalance Context

Handling Class Imbalance in Credit Card Fraud using Resampling Methods

Resampling Methods for Solving Class Imbalance Problem in Traffic Incident Detection

A two-stage balancing strategy based on data augmentation for imbalanced text sentiment classification

Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost Functions

Credit Assignment: Using Resampling Methods for Dealing with the Class Imbalance Problem

The Multi-Class Imbalance Problem: Cost Functions with Modular and Non-Modular Neural Networks

RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets

Data Level Approach for Multiclass Imbalance Financial Data

Integrating Improved U-Net and Continuous Maximum Flow Algorithm for 3D Brain Tumor Image Segmentation

Machine Learning and Class Imbalance: A Literature Survey

Export Citation Format