scholarly journals Combining weighted SMOTE with ensemble learning for the class-imbalanced prediction of small business credit risk

Author(s):  
Mohammad Zoynul Abedin ◽  
Chi Guotai ◽  
Petr Hajek ◽  
Tong Zhang

AbstractIn small business credit risk assessment, the default and nondefault classes are highly imbalanced. To overcome this problem, this study proposes an extended ensemble approach rooted in the weighted synthetic minority oversampling technique (WSMOTE), which is called WSMOTE-ensemble. The proposed ensemble classifier hybridizes WSMOTE and Bagging with sampling composite mixtures to guarantee the robustness and variability of the generated synthetic instances and, thus, minimize the small business class-skewed constraints linked to default and nondefault instances. The original small business dataset used in this study was taken from 3111 records from a Chinese commercial bank. By implementing a thorough experimental study of extensively skewed data-modeling scenarios, a multilevel experimental setting was established for a rare event domain. Based on the proper evaluation measures, this study proposes that the random forest classifier used in the WSMOTE-ensemble model provides a good trade-off between the performance on default class and that of nondefault class. The ensemble solution improved the accuracy of the minority class by 15.16% in comparison with its competitors. This study also shows that sampling methods outperform nonsampling algorithms. With these contributions, this study fills a noteworthy knowledge gap and adds several unique insights regarding the prediction of small business credit risk.

Author(s):  
Dennis Bams ◽  
Magdalena Pisa ◽  
Christian C. P. Wolff

2009 ◽  
Vol 17 (3) ◽  
pp. 275-306 ◽  
Author(s):  
Salvador García ◽  
Francisco Herrera

Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples (undersampling) or replicate minority class examples (oversampling). Undersampling with imbalanced datasets could be considered as a prototype selection procedure with the purpose of balancing datasets to achieve a high classification rate, avoiding the bias toward majority class examples. Evolutionary algorithms have been used for classical prototype selection showing good results, where the fitness function is associated to the classification and reduction rates. In this paper, we propose a set of methods called evolutionary undersampling that take into consideration the nature of the problem and use different fitness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison among our models and state of the art undersampling methods. The results have been contrasted by using nonparametric statistical procedures and show that evolutionary undersampling outperforms the nonevolutionary models when the degree of imbalance is increased.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Liping Chen ◽  
Jiabao Jiang ◽  
Yong Zhang

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.


2021 ◽  
Vol 880 (1) ◽  
pp. 012048
Author(s):  
Ajiwasesa Harumeka ◽  
Santi Wulan Purnami ◽  
Santi Puteri Rahayu

Abstract Logistic regression is a popular and powerful classification method. The addition of ridge regularization and optimization using a combination of linear conjugate gradients and IRLS, called Truncated Regularized Iteratively Re-weighted Least Square (TR-IRLS), can outperform Support Vector Machine (SVM) in terms of processing speed, especially when applied to large data and have competitive accuracy. However, neither SVM nor TR-IRLS is good enough when applied to unbalanced data. Fuzzy Support Vector Machine (FSVM) is an SVM development for unbalanced data that adds fuzzy membership to each observation. The fuzzy membership makes the interest of each observation in the minority class higher than the majority class. Meanwhile, TR-IRLS developed into a Rare Event Weighted Logistic Regression (RE-WLR) by adding weight to logistic regression and bias correction. The weighting of the RE-WLR depends on the undersampling scheme. It allows an “information loss”. Between FSVM and RE-WLR has a similarity, the weight based only on class differences (minority or majority). Entropy Based Fuzzy Support Vector Machine (EFSVM) is a method used to accommodate the weaknesses of FSVM by considering the class certainty of class observations. As a result, EFSVM is able to improve SVM performance for unbalanced data, even beating FSVM. For this reason, we use EF on the TR-IRLS algorithm to classify large and unbalanced data, as a proposed method. This method is called Entropy-Based Fuzzy Weighted Logistic Regression (EF-WLR). This Research shows the review of EF-WLR for unbalanced data classification.


2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Yongqing Zhang ◽  
Shaojie Qiao ◽  
Rongzhao Lu ◽  
Nan Han ◽  
Dingxiang Liu ◽  
...  

Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.


2020 ◽  
Vol 34 (08) ◽  
pp. 13396-13401
Author(s):  
Wei Wang ◽  
Christopher Lesner ◽  
Alexander Ran ◽  
Marko Rukonic ◽  
Jason Xue ◽  
...  

Machine learning applied to financial transaction records can predict how likely a small business is to repay a loan. For this purpose we compared a traditional scorecard credit risk model against various machine learning models and found that XGBoost with monotonic constraints outperformed scorecard model by 7% in K-S statistic. To deploy such a machine learning model in production for loan application risk scoring it must comply with lending industry regulations that require lenders to provide understandable and specific reasons for credit decisions. Thus we also developed a loan decision explanation technique based on the ideas of WoE and SHAP. Our research was carried out using a historical dataset of tens of thousands of loans and millions of associated financial transactions. The credit risk scoring model based on XGBoost with monotonic constraints and SHAP explanations described in this paper have been deployed by QuickBooks Capital to assess incoming loan applications since July 2019.


Author(s):  
Yue Qiu ◽  
Chuansheng Wang

Simulation is widely used to estimate losses due to default and other credit events in financial portfolios. The accurate measurement of credit risk can be modeled as a rare event simulation problem. While Monte Carlo simulation is time-consuming for rare events, importance sampling techniques can effectively reduce the simulation time, thus improving simulation efficiency. This chapter proposes a new importance sampling method to estimate rare event probability in simulation models. The optimal importance sampling distributions are derived in terms of expectation in the normal copula model developed in finance. In the normal copula model, dependency is introduced through a set of common factors of multiple obligors. The intriguing dependence between defaults of multiple obligors imposes hurdles in simulation. The simulated results demonstrate the effectiveness of the proposed approach to solving the portfolio credit risk problem.


Sign in / Sign up

Export Citation Format

Share Document