scholarly journals Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

2020 ◽  
Vol 10 (4) ◽  
pp. 1276 ◽  
Author(s):  
Eréndira Rendón ◽  
Roberto Alejo ◽  
Carlos Castorena ◽  
Frank J. Isidro-Ortega ◽  
Everardo E. Granda-Gutiérrez

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

2018 ◽  
Vol 7 (2.14) ◽  
pp. 478 ◽  
Author(s):  
Hartono . ◽  
Opim Salim Sitompul ◽  
Erna Budhiarti Nababan ◽  
Tulus . ◽  
Dahlan Abdullah ◽  
...  

Data mining and machine learning techniques designed to solve classification problems require balanced class distribution. However, in reality sometimes the classification of datasets indicates the existence of a class represented by a large number of instances whereas there are classes with far fewer instances. This problem is known as the class imbalance problem. Classifier Ensembles is a method often used in overcoming class imbalance problems. Data Diversity is one of the cornerstones of ensembles. An ideal ensemble system should have accurrate individual classifiers and if there is an error it is expected to occur on different objects or instances. This research will present the results of overview and experimental study using Hybrid Approach Redefinition (HAR) Method in handling class imbalance and at the same time expected to get better data diversity. This research will be conducted using 6 datasets with different imbalanced ratios and will be compared with SMOTEBoost which is one of the Re-Weighting method which is often used in handling class imbalance. This study shows that the data diversity is related to performance in the imbalance learning ensembles and the proposed methods can obtain better data diversity.  


2019 ◽  
Vol 24 (2) ◽  
pp. 104-110
Author(s):  
Duygu Sinanc Terzi ◽  
Seref Sagiroglu

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.


Author(s):  
Khyati Ahlawat ◽  
Anuradha Chug ◽  
Amit Prakash Singh

Expansion of data in the dimensions of volume, variety, or velocity is leading to big data. Learning from this big data is challenging and beyond capacity of conventional machine learning methods and techniques. Generally, big data getting generated from real-time scenarios is imbalance in nature with uneven distribution of classes. This imparts additional complexity in learning from big data since the class that is underrepresented is more influential and its correct classification becomes critical than that of overrepresented class. This chapter addresses the imbalance problem and its solutions in context of big data along with a detailed survey of work done in this area. Subsequently, it also presents an experimental view for solving imbalance classification problem and a comparative analysis between different methodologies afterwards.


Sign in / Sign up

Export Citation Format

Share Document