A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

Convolutional neural networks based focal loss for class imbalance problem: a case study of canine red blood cells morphology classification

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-020-01773-x ◽

2020 ◽

Cited By ~ 6

Author(s):

Kitsuchart Pasupa ◽

Supawit Vatathanavaro ◽

Suchat Tungjitnob

Keyword(s):

Neural Networks ◽

Red Blood Cells ◽

Convolutional Neural Networks ◽

Blood Cells ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction

Applied Sciences ◽

10.3390/app10228059 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8059

Author(s):

Haonan Tong ◽

Shihai Wang ◽

Guangling Li

Keyword(s):

Class Imbalance ◽

Area Under The Curve ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Synthetic Sample ◽

Promising Alternative ◽

Imbalance Problem ◽

Software Defect ◽

Boosting Method ◽

High Credibility

Imbalanced data are a major factor for degrading the performance of software defect models. Software defect dataset is imbalanced in nature, i.e., the number of non-defect-prone modules is far more than that of defect-prone ones, which results in the bias of classifiers on the majority class samples. In this paper, we propose a novel credibility-based imbalance boosting (CIB) method in order to address the class-imbalance problem in software defect proneness prediction. The method measures the credibility of synthetic samples based on their distribution by introducing a credit factor to every synthetic sample, and proposes a weight updating scheme to make the base classifiers focus on synthetic samples with high credibility and real samples. Experiments are performed on 11 NASA datasets and nine PROMISE datasets by comparing CIB with MAHAKIL, AdaC2, AdaBoost, SMOTE, RUS, No sampling method in terms of four performance measures, i.e., area under the curve (AUC), F1, AGF, and Matthews correlation coefficient (MCC). Wilcoxon sign-ranked test and Cliff’s δ are separately used to perform statistical test and calculate effect size. The experimental results show that CIB is a more promising alternative for addressing the class-imbalance problem in software defect-prone prediction as compared with previous methods.

Download Full-text

A learning method for the class imbalance problem with medical data sets

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2010.03.005 ◽

2010 ◽

Vol 40 (5) ◽

pp. 509-518 ◽

Cited By ~ 89

Author(s):

Der-Chiang Li ◽

Chiao-Wen Liu ◽

Susan C. Hu

Keyword(s):

Class Imbalance ◽

Medical Data ◽

Data Sets ◽

Learning Method ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A Cost-Sensitive Sparse Representation Based Classification for Class-Imbalance Problem

Scientific Programming ◽

10.1155/2016/8035089 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9

Author(s):

Zhenbing Liu ◽

Chunyang Gao ◽

Huihua Yang ◽

Qijia He

Keyword(s):

Sparse Representation ◽

Classification Accuracy ◽

Class Imbalance ◽

Misclassification Rate ◽

Data Sets ◽

Class Imbalance Problem ◽

Misclassification Cost ◽

Practical Applications ◽

Imbalance Problem ◽

Positive Class

Sparse representation has been successfully used in pattern recognition and machine learning. However, most existing sparse representation based classification (SRC) methods are to achieve the highest classification accuracy, assuming the same losses for different misclassifications. This assumption, however, may not hold in many practical applications as different types of misclassification could lead to different losses. In real-world application, much data sets are imbalanced of the class distribution. To address these problems, we propose a cost-sensitive sparse representation based classification (CSSRC) for class-imbalance problem method by using probabilistic modeling. Unlike traditional SRC methods, we predict the class label of test samples by minimizing the misclassification losses, which are obtained via computing the posterior probabilities. Experimental results on the UCI databases validate the efficacy of the proposed approach on average misclassification cost, positive class misclassification rate, and negative class misclassification rate. In addition, we sampled test samples and training samples with different imbalance ratio and use F-measure, G-mean, classification accuracy, and running time to evaluate the performance of the proposed method. The experiments show that our proposed method performs competitively compared to SRC, CSSVM, and CS4VM.

Download Full-text

Pulsar candidate classification using generative adversary networks

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2975 ◽

2019 ◽

Vol 490 (4) ◽

pp. 5424-5439 ◽

Cited By ~ 3

Author(s):

Ping Guo ◽

Fuqing Duan ◽

Pei Wang ◽

Yao Yao ◽

Qian Yin ◽

...

Keyword(s):

Class Imbalance ◽

Feature Learning ◽

Computer Experiments ◽

Support Vector ◽

Data Sets ◽

Class Imbalance Problem ◽

Generative Adversarial Network ◽

Feature Representations ◽

Adversarial Network ◽

Imbalance Problem

ABSTRACT Discovering pulsars is a significant and meaningful research topic in the field of radio astronomy. With the advent of astronomical instruments, the volume and rate of data acquisition have grown exponentially. This development necessitates a focus on artificial intelligence (AI) technologies that can mine large astronomical data sets. Automatic pulsar candidate identification (APCI) can be considered as a task determining potential candidates for further investigation and eliminating the noise of radio-frequency interference and other non-pulsar signals. As reported in the existing literature, AI techniques, especially convolutional neural network (CNN)-based techniques, have been adopted for APCI. However, it is challenging to enhance the performance of CNN-based pulsar identification because only an extremely limited number of real pulsar samples exist, which results in a crucial class imbalance problem. To address these problems, we propose a framework that combines a deep convolution generative adversarial network (DCGAN) with a support vector machine (SVM). The DCGAN is used as a sample generation and feature learning model, and the SVM is adopted as the classifier for predicting the label of a candidate at the inference stage. The proposed framework is a novel technique, which not only can solve the class imbalance problem but also can learn the discriminative feature representations of pulsar candidates instead of computing hand-crafted features in the pre-processing steps. The proposed method can enhance the accuracy of the APCI, and the computer experiments performed on two pulsar data sets verified the effectiveness and efficiency of the proposed method.

Download Full-text

An Insight on the Class Imbalance Problem and Its Solutions in Big Data

Large-Scale Data Streaming, Processing, and Blockchain Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-3444-1.ch002 ◽

2021 ◽

pp. 39-49

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Machine Learning ◽

Big Data ◽

Class Imbalance ◽

Classification Problem ◽

Correct Classification ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Methods And Techniques ◽

Conventional Machine ◽

Work Done

Expansion of data in the dimensions of volume, variety, or velocity is leading to big data. Learning from this big data is challenging and beyond capacity of conventional machine learning methods and techniques. Generally, big data getting generated from real-time scenarios is imbalance in nature with uneven distribution of classes. This imparts additional complexity in learning from big data since the class that is underrepresented is more influential and its correct classification becomes critical than that of overrepresented class. This chapter addresses the imbalance problem and its solutions in context of big data along with a detailed survey of work done in this area. Subsequently, it also presents an experimental view for solving imbalance classification problem and a comparative analysis between different methodologies afterwards.

Download Full-text

A Novel Oversampling Technique to Solve Class Imbalance Problem: A Case Study of Students’ Grades Evaluation

10.1109/contesa52813.2021.9657151 ◽

2021 ◽

Author(s):

Dilshad Jahin ◽

Israt Jahan Emu ◽

Subrina Akter ◽

Muhammed J.A. Patwary ◽

Mohammad Arif Sobhan Bhuiyan ◽

...

Keyword(s):

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Mathematical Problems in Engineering ◽

10.1155/2018/5036710 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 8

Author(s):

Jianhong Yan ◽

Suqing Han

Keyword(s):

Learning Community ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalanced Data Sets ◽

Imbalance Data ◽

Imbalance Problem ◽

Stacked Generalization ◽

Model Generalization

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

Download Full-text

A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

Advances in Data Science and Adaptive Analysis ◽

10.1142/s2424922x21500054 ◽

2021 ◽

pp. 2150005

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Big Data ◽

Class Imbalance ◽

Support Vector ◽

Efficiency Gain ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Class Imbalance Problem ◽

Sampling Algorithm ◽

Imbalance Problem ◽

Hybrid Sampling

The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.

Download Full-text