Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

Download Full-text

A systematic study of the class imbalance problem: Automatically identifying empty camera trap images using convolutional neural networks

Ecological Informatics ◽

10.1016/j.ecoinf.2021.101350 ◽

2021 ◽

pp. 101350

Author(s):

Deng-Qi Yang ◽

Tao Li ◽

Meng-Tao Liu ◽

Xiao-Wei Li ◽

Ben-Hui Chen

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Systematic Study ◽

Class Imbalance ◽

Camera Trap ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Convolutional neural networks based focal loss for class imbalance problem: a case study of canine red blood cells morphology classification

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-020-01773-x ◽

2020 ◽

Cited By ~ 6

Author(s):

Kitsuchart Pasupa ◽

Supawit Vatathanavaro ◽

Suchat Tungjitnob

Keyword(s):

Neural Networks ◽

Red Blood Cells ◽

Convolutional Neural Networks ◽

Blood Cells ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

IOP Conference Series Materials Science and Engineering ◽

10.1088/1757-899x/288/1/012075 ◽

2018 ◽

Vol 288 ◽

pp. 012075 ◽

Cited By ~ 18

Author(s):

Hartono ◽

O S Sitompul ◽

Tulus ◽

E B Nababan

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Optimization Model ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Artificial Neural

Download Full-text

An Empirical Study for the Multi-class Imbalance Problem with Neural Networks

Lecture Notes in Computer Science - Progress in Pattern Recognition, Image Analysis and Applications ◽

10.1007/978-3-540-85920-8_59 ◽

2008 ◽

pp. 479-486 ◽

Cited By ~ 8

Author(s):

R. Alejo ◽

J. M. Sotoca ◽

G. A. Casañ

Keyword(s):

Neural Networks ◽

Empirical Study ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Applied Computer Systems ◽

10.2478/acss-2019-0013 ◽

2019 ◽

Vol 24 (2) ◽

pp. 104-110

Author(s):

Duygu Sinanc Terzi ◽

Seref Sagiroglu

Keyword(s):

Big Data ◽

Class Imbalance ◽

Area Under The Curve ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

The Common ◽

Public Datasets ◽

Distributed Cluster

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

An Insight on the Class Imbalance Problem and Its Solutions in Big Data

Large-Scale Data Streaming, Processing, and Blockchain Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-3444-1.ch002 ◽

2021 ◽

pp. 39-49

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Machine Learning ◽

Big Data ◽

Class Imbalance ◽

Classification Problem ◽

Correct Classification ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Methods And Techniques ◽

Conventional Machine ◽

Work Done

Expansion of data in the dimensions of volume, variety, or velocity is leading to big data. Learning from this big data is challenging and beyond capacity of conventional machine learning methods and techniques. Generally, big data getting generated from real-time scenarios is imbalance in nature with uneven distribution of classes. This imparts additional complexity in learning from big data since the class that is underrepresented is more influential and its correct classification becomes critical than that of overrepresented class. This chapter addresses the imbalance problem and its solutions in context of big data along with a detailed survey of work done in this area. Subsequently, it also presents an experimental view for solving imbalance classification problem and a comparative analysis between different methodologies afterwards.

Download Full-text

Training cost-sensitive neural networks with methods addressing the class imbalance problem

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2006.17 ◽

2006 ◽

Vol 18 (1) ◽

pp. 63-77 ◽

Cited By ~ 627

Author(s):

Zhi-Hua Zhou ◽

Xu-Ying Liu

Keyword(s):

Neural Networks ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Training Cost

Download Full-text

Ensemble of Neural Networks to Solve Class Imbalance Problem of Protein Secondary Structure Prediction

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2012.3602 ◽

2012 ◽

Vol 3 (6) ◽

pp. 9-20 ◽

Cited By ~ 3

Author(s):

Maryam Alirezaee

Keyword(s):

Neural Networks ◽

Secondary Structure ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Class Imbalance ◽

Protein Secondary Structure ◽

Protein Secondary Structure Prediction ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

Advances in Data Science and Adaptive Analysis ◽

10.1142/s2424922x21500054 ◽

2021 ◽

pp. 2150005

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Big Data ◽

Class Imbalance ◽

Support Vector ◽

Efficiency Gain ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Class Imbalance Problem ◽

Sampling Algorithm ◽

Imbalance Problem ◽

Hybrid Sampling

The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.

Download Full-text