An Insight on the Class Imbalance Problem and Its Solutions in Big Data

Author(s):  
Khyati Ahlawat ◽  
Anuradha Chug ◽  
Amit Prakash Singh

Expansion of data in the dimensions of volume, variety, or velocity is leading to big data. Learning from this big data is challenging and beyond capacity of conventional machine learning methods and techniques. Generally, big data getting generated from real-time scenarios is imbalance in nature with uneven distribution of classes. This imparts additional complexity in learning from big data since the class that is underrepresented is more influential and its correct classification becomes critical than that of overrepresented class. This chapter addresses the imbalance problem and its solutions in context of big data along with a detailed survey of work done in this area. Subsequently, it also presents an experimental view for solving imbalance classification problem and a comparative analysis between different methodologies afterwards.

Author(s):  
Vanessa Faria De Souza ◽  
Gabriela Perry

This paper presents the results literature review, carried out with the objective of identifying prevalent research goals and challenges in the prediction of student behavior in MOOCs, using Machine Learning. The results allowed recognizingthree goals: 1. Student Classification and 2. Dropout prediction. Regarding the challenges, five items were identified: 1. Incompatibility of AVAs, 2. Complexity of data manipulation, 3. Class Imbalance Problem, 4. Influence of External Factors and 5. Difficulty in manipulating data by untrained personnel.


2016 ◽  
Vol 7 (2) ◽  
pp. 43-71 ◽  
Author(s):  
Sangeeta Lal ◽  
Neetu Sardana ◽  
Ashish Sureka

Logging is an important yet tough decision for OSS developers. Machine-learning models are useful in improving several steps of OSS development, including logging. Several recent studies propose machine-learning models to predict logged code construct. The prediction performances of these models are limited due to the class-imbalance problem since the number of logged code constructs is small as compared to non-logged code constructs. No previous study analyzes the class-imbalance problem for logged code construct prediction. The authors first analyze the performances of J48, RF, and SVM classifiers for catch-blocks and if-blocks logged code constructs prediction on imbalanced datasets. Second, the authors propose LogIm, an ensemble and threshold-based machine-learning model. Third, the authors evaluate the performance of LogIm on three open-source projects. On average, LogIm model improves the performance of baseline classifiers, J48, RF, and SVM, by 7.38%, 9.24%, and 4.6% for catch-blocks, and 12.11%, 14.95%, and 19.13% for if-blocks logging prediction.


2019 ◽  
Vol 24 (2) ◽  
pp. 104-110
Author(s):  
Duygu Sinanc Terzi ◽  
Seref Sagiroglu

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.


Author(s):  
Hartono Hartono ◽  
Opim Salim Sitompul ◽  
Tulus Tulus ◽  
Erna Budhiarti Nababan

Class imbalance occurs when instances in a class are much higher than in other classes. This machine learning major problem can affect the predicted accuracy. Support Vector Machine (SVM) is robust and precise method in handling class imbalance problem but weak in the bias data distribution, Biased Support Vector Machine (BSVM) became popular choice to solve the problem. BSVM provide better control sensitivity yet lack accuracy compared to general SVM. This study proposes the integration of BSVM and SMOTEBoost to handle class imbalance problem. Non Support Vector (NSV) sets from negative samples and Support Vector (SV) sets from positive samples will undergo a Weighted-SMOTE process. The results indicate that implementation of Biased Support Vector Machine and Weighted-SMOTE achieve better accuracy and sensitivity.


Energies ◽  
2021 ◽  
Vol 15 (1) ◽  
pp. 212
Author(s):  
Ajit Kumar ◽  
Neetesh Saxena ◽  
Souhwan Jung ◽  
Bong Jun Choi

Critical infrastructures have recently been integrated with digital controls to support intelligent decision making. Although this integration provides various benefits and improvements, it also exposes the system to new cyberattacks. In particular, the injection of false data and commands into communication is one of the most common and fatal cyberattacks in critical infrastructures. Hence, in this paper, we investigate the effectiveness of machine-learning algorithms in detecting False Data Injection Attacks (FDIAs). In particular, we focus on two of the most widely used critical infrastructures, namely power systems and water treatment plants. This study focuses on tackling two key technical issues: (1) finding the set of best features under a different combination of techniques and (2) resolving the class imbalance problem using oversampling methods. We evaluate the performance of each algorithm in terms of time complexity and detection accuracy to meet the time-critical requirements of critical infrastructures. Moreover, we address the inherent skewed distribution problem and the data imbalance problem commonly found in many critical infrastructure datasets. Our results show that the considered minority oversampling techniques can improve the Area Under Curve (AUC) of GradientBoosting, AdaBoost, and kNN by 10–12%.


Classification is a major obstacle in Machine Learning generally and also specific when tackling class imbalance problem. A dataset is said to be imbalanced if a class we are interested in falls to the minority class and appears scanty when compared to the majority class, the minority class is also known as the positive class while the majority class is also known as the negative class. Class imbalance has been a major bottleneck for Machine Learning scientist as it often leads to using wrong model for different purposes, this Survey will lead researchers to choose the right model and the best strategies to handle imbalance dataset in the course of tackling machine learning problems. Proper handling of class imbalance dataset could leads to accurate and good result. Handling class imbalance data in a conventional manner, especially when the level of imbalance is high may leads to accuracy paradox (an assumption of realizing 99% accuracy during evaluation process when the class distribution is highly imbalanced), hence imbalance class distribution requires special consideration, and for this purpose we dealt extensively on handling and solving imbalanced class problem in machine learning, such as Data Sampling Approach, Cost sensitive learning approach and Ensemble Approach.


Author(s):  
Khyati Ahlawat ◽  
Anuradha Chug ◽  
Amit Prakash Singh

The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.


2020 ◽  
Vol 10 (4) ◽  
pp. 1276 ◽  
Author(s):  
Eréndira Rendón ◽  
Roberto Alejo ◽  
Carlos Castorena ◽  
Frank J. Isidro-Ortega ◽  
Everardo E. Granda-Gutiérrez

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.


Symmetry ◽  
2019 ◽  
Vol 11 (12) ◽  
pp. 1504 ◽  
Author(s):  
Solomon H. Ebenuwa ◽  
Mhd Saeed Sharif ◽  
Ameer Al-Nemrat ◽  
Ali H. Al-Bayatti ◽  
Nasser Alalwan ◽  
...  

Imbalanced classes in multi-classed datasets is one of the most salient hindrances to the accuracy and dependable results of predictive modeling. In predictions, there are always majority and minority classes, and in most cases it is difficult to capture the members of item belonging to the minority classes. This anomaly is traceable to the designs of the predictive algorithms because most algorithms do not factor in the unequal numbers of classes into their designs and implementations. The accuracy of most modeling processes is subjective to the ever-present consequences of the imbalanced classes. This paper employs the variance ranking technique to deal with the real-world class imbalance problem. We augmented this technique using one-versus-all re-coding of the multi-classed datasets. The proof-of-concept experimentation shows that our technique performs better when compared with the previous work done on capturing small class members in multi-classed datasets.


2017 ◽  
Vol 26 (03) ◽  
pp. 1750009 ◽  
Author(s):  
Dionisios N. Sotiropoulos ◽  
George A. Tsihrintzis

This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.


Sign in / Sign up

Export Citation Format

Share Document