A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction

Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier’s performance and algorithmic fairness are evaluated with notable metrics. Results: The results indicated that Random Forest classifier outperforms other classifiers in all three datasets and, using SMOTE and ADASYN techniques cause more discrimination in the female group. The rate of unintentional discrimination seems to be higher in the original data of extremely unbalanced datasets under the following classifiers: Logistics Regression, LightGBM, and XGBoost. Conclusions: Algorithmic fairness has become a broadly studied area in recent years, yet there is a very little systematic study on the effect of using DSTs on algorithmic fairness. This study presents important findings to further the use of algorithmic fairness in CCP research.

Download Full-text

Handling Class Imbalance in Customer Churn Prediction in Telecom Sector Using Sampling Techniques, Bagging and Boosting Trees

2020 10th International Conference on Computer and Knowledge Engineering (ICCKE) ◽

10.1109/iccke50421.2020.9303698 ◽

2020 ◽

Author(s):

Sajjad Shumaly ◽

Pedram Neysaryan ◽

Yanhui Guo

Keyword(s):

Class Imbalance ◽

Sampling Techniques ◽

Churn Prediction ◽

Customer Churn ◽

Customer Churn Prediction ◽

Telecom Sector

Download Full-text

Customer Churn Prediction and Promotion Models in the Telecom Sector: A Case Study

Lecture Notes in Networks and Systems - Intelligent Systems and Applications ◽

10.1007/978-3-030-82196-8_21 ◽

2021 ◽

pp. 276-286

Author(s):

Ulku F. Gursoy ◽

Enes M. Yildiz ◽

M. Ergun Okay ◽

Mehmet S. Aktas

Keyword(s):

Churn Prediction ◽

Customer Churn ◽

Customer Churn Prediction ◽

Telecom Sector

Download Full-text

Convolutional neural networks based focal loss for class imbalance problem: a case study of canine red blood cells morphology classification

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-020-01773-x ◽

2020 ◽

Cited By ~ 6

Author(s):

Kitsuchart Pasupa ◽

Supawit Vatathanavaro ◽

Suchat Tungjitnob

Keyword(s):

Neural Networks ◽

Red Blood Cells ◽

Convolutional Neural Networks ◽

Blood Cells ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Applied Computer Systems ◽

10.2478/acss-2019-0013 ◽

2019 ◽

Vol 24 (2) ◽

pp. 104-110

Author(s):

Duygu Sinanc Terzi ◽

Seref Sagiroglu

Keyword(s):

Big Data ◽

Class Imbalance ◽

Area Under The Curve ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

The Common ◽

Public Datasets ◽

Distributed Cluster

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text