A Novel Oversampling Technique to Solve Class Imbalance Problem: A Case Study of Students’ Grades Evaluation

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

Applying separately cost-sensitive learning and Fisher's discriminant analysis to address the class imbalance problem: A case study involving a virtual gas pipeline SCADA system

International Journal of Critical Infrastructure Protection ◽

10.1016/j.ijcip.2020.100357 ◽

2020 ◽

Vol 29 ◽

pp. 100357

Author(s):

Abouzar Choubineh ◽

David A. Wood ◽

Zahak Choubineh

Keyword(s):

Discriminant Analysis ◽

Class Imbalance ◽

Gas Pipeline ◽

Class Imbalance Problem ◽

Cost Sensitive Learning ◽

Imbalance Problem ◽

Scada System ◽

Fisher’S Discriminant Analysis

Download Full-text

Variance Ranking for Multi-Classed Imbalanced Datasets: A Case Study of One-Versus-All

Symmetry ◽

10.3390/sym11121504 ◽

2019 ◽

Vol 11 (12) ◽

pp. 1504 ◽

Cited By ~ 1

Author(s):

Solomon H. Ebenuwa ◽

Mhd Saeed Sharif ◽

Ameer Al-Nemrat ◽

Ali H. Al-Bayatti ◽

Nasser Alalwan ◽

...

Keyword(s):

Class Imbalance ◽

Proof Of Concept ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Predictive Algorithms ◽

World Class ◽

Imbalanced Classes ◽

Work Done ◽

Ranking Technique

Imbalanced classes in multi-classed datasets is one of the most salient hindrances to the accuracy and dependable results of predictive modeling. In predictions, there are always majority and minority classes, and in most cases it is difficult to capture the members of item belonging to the minority classes. This anomaly is traceable to the designs of the predictive algorithms because most algorithms do not factor in the unequal numbers of classes into their designs and implementations. The accuracy of most modeling processes is subjective to the ever-present consequences of the imbalanced classes. This paper employs the variance ranking technique to deal with the real-world class imbalance problem. We augmented this technique using one-versus-all re-coding of the multi-classed datasets. The proof-of-concept experimentation shows that our technique performs better when compared with the previous work done on capturing small class members in multi-classed datasets.

Download Full-text

Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study

IEEE Access ◽

10.1109/access.2016.2619719 ◽

2016 ◽

Vol 4 ◽

pp. 7940-7957 ◽

Cited By ~ 71

Author(s):

Adnan Amin ◽

Sajid Anwar ◽

Awais Adnan ◽

Muhammad Nawaz ◽

Newton Howard ◽

...

Keyword(s):

Class Imbalance ◽

Churn Prediction ◽

Class Imbalance Problem ◽

Customer Churn ◽

Imbalance Problem ◽

Customer Churn Prediction

Download Full-text

A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction

New Contributions in Information Systems and Technologies - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-319-16486-1_22 ◽

2015 ◽

pp. 215-225 ◽

Cited By ~ 5

Author(s):

Adnan Amin ◽

Faisal Rahim ◽

Imtiaz Ali ◽

Changez Khan ◽

Sajid Anwar

Keyword(s):

Class Imbalance ◽

Churn Prediction ◽

Class Imbalance Problem ◽

Customer Churn ◽

Imbalance Problem ◽

Customer Churn Prediction

Download Full-text

Detection of Myocardial Infarction Using ECG and Multi-Scale Feature Concatenate

Sensors ◽

10.3390/s21051906 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1906

Author(s):

Jia-Zheng Jian ◽

Tzong-Rong Ger ◽

Han-Hua Lai ◽

Chi-Ming Ku ◽

Chiung-An Chen ◽

...

Keyword(s):

Myocardial Infarction ◽

Network Structure ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Multi Scale ◽

Imbalance Problem ◽

Average Accuracy ◽

Significant Difference ◽

Electrocardiogram Ecg

Diverse computer-aided diagnosis systems based on convolutional neural networks were applied to automate the detection of myocardial infarction (MI) found in electrocardiogram (ECG) for early diagnosis and prevention. However, issues, particularly overfitting and underfitting, were not being taken into account. In other words, it is unclear whether the network structure is too simple or complex. Toward this end, the proposed models were developed by starting with the simplest structure: a multi-lead features-concatenate narrow network (N-Net) in which only two convolutional layers were included in each lead branch. Additionally, multi-scale features-concatenate networks (MSN-Net) were also implemented where larger features were being extracted through pooling the signals. The best structure was obtained via tuning both the number of filters in the convolutional layers and the number of inputting signal scales. As a result, the N-Net reached a 95.76% accuracy in the MI detection task, whereas the MSN-Net reached an accuracy of 61.82% in the MI locating task. Both networks give a higher average accuracy and a significant difference of p < 0.001 evaluated by the U test compared with the state-of-the-art. The models are also smaller in size thus are suitable to fit in wearable devices for offline monitoring. In conclusion, testing throughout the simple and complex network structure is indispensable. However, the way of dealing with the class imbalance problem and the quality of the extracted features are yet to be discussed.

Download Full-text

A Novel Focal Phi Loss for Power Line Segmentation with Auxiliary Classifier U-Net

Sensors ◽

10.3390/s21082803 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2803

Author(s):

Rabeea Jaffari ◽

Manzoor Ahmed Hashmani ◽

Constantino Carlos Reyes-Aldasoro

Keyword(s):

Loss Function ◽

Class Imbalance ◽

Power Line ◽

Aerial Images ◽

Class Imbalance Problem ◽

Trade Off ◽

Urban Scenes ◽

Imbalance Problem ◽

A Minor ◽

Evaluation Parameters

The segmentation of power lines (PLs) from aerial images is a crucial task for the safe navigation of unmanned aerial vehicles (UAVs) operating at low altitudes. Despite the advances in deep learning-based approaches for PL segmentation, these models are still vulnerable to the class imbalance present in the data. The PLs occupy only a minimal portion (1–5%) of the aerial images as compared to the background region (95–99%). Generally, this class imbalance problem is addressed via the use of PL-specific detectors in conjunction with the popular class balanced cross entropy (BBCE) loss function. However, these PL-specific detectors do not work outside their application areas and a BBCE loss requires hyperparameter tuning for class-wise weights, which is not trivial. Moreover, the BBCE loss results in low dice scores and precision values and thus, fails to achieve an optimal trade-off between dice scores, model accuracy, and precision–recall values. In this work, we propose a generalized focal loss function based on the Matthews correlation coefficient (MCC) or the Phi coefficient to address the class imbalance problem in PL segmentation while utilizing a generic deep segmentation architecture. We evaluate our loss function by improving the vanilla U-Net model with an additional convolutional auxiliary classifier head (ACU-Net) for better learning and faster model convergence. The evaluation of two PL datasets, namely the Mendeley Power Line Dataset and the Power Line Dataset of Urban Scenes (PLDU), where PLs occupy around 1% and 2% of the aerial images area, respectively, reveal that our proposed loss function outperforms the popular BBCE loss by 16% in PL dice scores on both the datasets, 19% in precision and false detection rate (FDR) values for the Mendeley PL dataset and 15% in precision and FDR values for the PLDU with a minor degradation in the accuracy and recall values. Moreover, our proposed ACU-Net outperforms the baseline vanilla U-Net for the characteristic evaluation parameters in the range of 1–10% for both the PL datasets. Thus, our proposed loss function with ACU-Net achieves an optimal trade-off for the characteristic evaluation parameters without any bells and whistles. Our code is available at Github.

Download Full-text