Combining weighted SMOTE with ensemble learning for the class-imbalanced prediction of small business credit risk

Complex & Intelligent Systems ◽

10.1007/s40747-021-00614-4 ◽

2022 ◽

Author(s):

Mohammad Zoynul Abedin ◽

Chi Guotai ◽

Petr Hajek ◽

Tong Zhang

Keyword(s):

Small Business ◽

Credit Risk ◽

Sampling Methods ◽

Rare Event ◽

Ensemble Classifier ◽

Skewed Data ◽

Experimental Setting ◽

Minority Class ◽

Evaluation Measures ◽

Good Trade

AbstractIn small business credit risk assessment, the default and nondefault classes are highly imbalanced. To overcome this problem, this study proposes an extended ensemble approach rooted in the weighted synthetic minority oversampling technique (WSMOTE), which is called WSMOTE-ensemble. The proposed ensemble classifier hybridizes WSMOTE and Bagging with sampling composite mixtures to guarantee the robustness and variability of the generated synthetic instances and, thus, minimize the small business class-skewed constraints linked to default and nondefault instances. The original small business dataset used in this study was taken from 3111 records from a Chinese commercial bank. By implementing a thorough experimental study of extensively skewed data-modeling scenarios, a multilevel experimental setting was established for a rare event domain. Based on the proper evaluation measures, this study proposes that the random forest classifier used in the WSMOTE-ensemble model provides a good trade-off between the performance on default class and that of nondefault class. The ensemble solution improved the accuracy of the minority class by 15.16% in comparison with its competitors. This study also shows that sampling methods outperform nonsampling algorithms. With these contributions, this study fills a noteworthy knowledge gap and adds several unique insights regarding the prediction of small business credit risk.

Download Full-text

Measuring the Default Risk of Small Business Loans: Improved Credit Risk Prediction using Deep Learning

SSRN Electronic Journal ◽

10.2139/ssrn.3729918 ◽

2020 ◽

Author(s):

Yiannis Dendramis ◽

Elias Tzavalis ◽

Aikaterini Cheimarioti

Keyword(s):

Deep Learning ◽

Small Business ◽

Credit Risk ◽

Risk Prediction ◽

Default Risk ◽

Small Business Loans

Download Full-text

Credit Risk Characteristics of US Small Business Portfolios

SSRN Electronic Journal ◽

10.2139/ssrn.2702377 ◽

2015 ◽

Cited By ~ 3

Author(s):

Dennis Bams ◽

Magdalena Pisa ◽

Christian C. P. Wolff

Keyword(s):

Small Business ◽

Credit Risk ◽

Risk Characteristics

Download Full-text

Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy

Evolutionary Computation ◽

10.1162/evco.2009.17.3.275 ◽

2009 ◽

Vol 17 (3) ◽

pp. 275-306 ◽

Cited By ~ 194

Author(s):

Salvador García ◽

Francisco Herrera

Keyword(s):

Fitness Function ◽

Imbalanced Data ◽

Selection Procedure ◽

Prototype Selection ◽

Imbalanced Datasets ◽

Classification Rate ◽

Minority Class ◽

Good Trade ◽

And Performance ◽

Nonparametric Statistical Procedures

Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples (undersampling) or replicate minority class examples (oversampling). Undersampling with imbalanced datasets could be considered as a prototype selection procedure with the purpose of balancing datasets to achieve a high classification rate, avoiding the bias toward majority class examples. Evolutionary algorithms have been used for classical prototype selection showing good results, where the fitness function is associated to the classification and reduction rates. In this paper, we propose a set of methods called evolutionary undersampling that take into consideration the nature of the problem and use different fitness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison among our models and state of the art undersampling methods. The results have been contrasted by using nonparametric statistical procedures and show that evolutionary undersampling outperforms the nonevolutionary models when the degree of imbalance is increased.

Download Full-text

Stability of Evolutionary Equilibrium on Small Business Credit Risk

2008 International Conference on Information Management, Innovation Management and Industrial Engineering ◽

10.1109/iciii.2008.309 ◽

2008 ◽

Author(s):

Lin Jin-guo ◽

Liang Xue-chun ◽

Liu Yan

Keyword(s):

Small Business ◽

Credit Risk ◽

Evolutionary Equilibrium

Download Full-text

HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition

Complexity ◽

10.1155/2021/6877284 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Liping Chen ◽

Jiabao Jiang ◽

Yong Zhang

Keyword(s):

Big Data ◽

Sampling Method ◽

Sampling Methods ◽

Data Partition ◽

Minority Class ◽

F Measure ◽

Better Than ◽

Hybrid Sampling

The classical classifiers are ineffective in dealing with the problem of imbalanced big dataset classification. Resampling the datasets and balancing samples distribution before training the classifier is one of the most popular approaches to resolve this problem. An effective and simple hybrid sampling method based on data partition (HSDP) is proposed in this paper. First, all the data samples are partitioned into different data regions. Then, the data samples in the noise minority samples region are removed and the samples in the boundary minority samples region are selected as oversampling seeds to generate the synthetic samples. Finally, a weighted oversampling process is conducted considering the generation of synthetic samples in the same cluster of the oversampling seed. The weight of each selected minority class sample is computed by the ratio between the proportion of majority class in the neighbors of this selected sample and the sum of all these proportions. Generation of synthetic samples in the same cluster of the oversampling seed guarantees new synthetic samples located inside the minority class area. Experiments conducted on eight datasets show that the proposed method, HSDP, is better than or comparable with the typical sampling methods for F-measure and G-mean.

Download Full-text

The use of entropy based fuzzy membership on weighted logistic regression for the unbalanced data

IOP Conference Series Earth and Environmental Science ◽

10.1088/1755-1315/880/1/012048 ◽

2021 ◽

Vol 880 (1) ◽

pp. 012048

Author(s):

Ajiwasesa Harumeka ◽

Santi Wulan Purnami ◽

Santi Puteri Rahayu

Keyword(s):

Support Vector Machine ◽

Logistic Regression ◽

Rare Event ◽

Least Square ◽

Fuzzy Membership ◽

Support Vector ◽

Unbalanced Data ◽

Fuzzy Support Vector Machine ◽

Minority Class ◽

Weighted Logistic Regression

Abstract Logistic regression is a popular and powerful classification method. The addition of ridge regularization and optimization using a combination of linear conjugate gradients and IRLS, called Truncated Regularized Iteratively Re-weighted Least Square (TR-IRLS), can outperform Support Vector Machine (SVM) in terms of processing speed, especially when applied to large data and have competitive accuracy. However, neither SVM nor TR-IRLS is good enough when applied to unbalanced data. Fuzzy Support Vector Machine (FSVM) is an SVM development for unbalanced data that adds fuzzy membership to each observation. The fuzzy membership makes the interest of each observation in the minority class higher than the majority class. Meanwhile, TR-IRLS developed into a Rare Event Weighted Logistic Regression (RE-WLR) by adding weight to logistic regression and bias correction. The weighting of the RE-WLR depends on the undersampling scheme. It allows an “information loss”. Between FSVM and RE-WLR has a similarity, the weight based only on class differences (minority or majority). Entropy Based Fuzzy Support Vector Machine (EFSVM) is a method used to accommodate the weaknesses of FSVM by considering the class certainty of class observations. As a result, EFSVM is able to improve SVM performance for unbalanced data, even beating FSVM. For this reason, we use EF on the TR-IRLS algorithm to classify large and unbalanced data, as a proposed method. This method is called Entropy-Based Fuzzy Weighted Logistic Regression (EF-WLR). This Research shows the review of EF-WLR for unbalanced data classification.

Download Full-text

How to balance the bioinformatics data: pseudo-negative sampling

BMC Bioinformatics ◽

10.1186/s12859-019-3269-4 ◽

2019 ◽

Vol 20 (S25) ◽

Author(s):

Yongqing Zhang ◽

Shaojie Qiao ◽

Rongzhao Lu ◽

Nan Han ◽

Dingxiang Liu ◽

...

Keyword(s):

Correlation Coefficient ◽

Sampling Methods ◽

Pearson Correlation ◽

Difficult Problem ◽

Data Sampling ◽

Classification Problems ◽

Minority Class ◽

Sensitivity Specificity ◽

Imbalance Dataset ◽

Better Than

Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.

Download Full-text

Using Small Business Banking Data for Explainable Credit Risk Scoring

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i08.7055 ◽

2020 ◽

Vol 34 (08) ◽

pp. 13396-13401

Author(s):

Wei Wang ◽

Christopher Lesner ◽

Alexander Ran ◽

Marko Rukonic ◽

Jason Xue ◽

...

Keyword(s):

Machine Learning ◽

Small Business ◽

Credit Risk ◽

Risk Model ◽

Learning Models ◽

Machine Learning Model ◽

Risk Scoring ◽

Financial Transactions ◽

Credit Risk Model ◽

Machine Learning Models

Machine learning applied to financial transaction records can predict how likely a small business is to repay a loan. For this purpose we compared a traditional scorecard credit risk model against various machine learning models and found that XGBoost with monotonic constraints outperformed scorecard model by 7% in K-S statistic. To deploy such a machine learning model in production for loan application risk scoring it must comply with lending industry regulations that require lenders to provide understandable and specific reasons for credit decisions. Thus we also developed a loan decision explanation technique based on the ideas of WoE and SHAP. Our research was carried out using a historical dataset of tens of thousands of loans and millions of associated financial transactions. The credit risk scoring model based on XGBoost with monotonic constraints and SHAP explanations described in this paper have been deployed by QuickBooks Capital to assess incoming loan applications since July 2019.

Download Full-text

Small business lending and credit risk: Granger causality evidence

Economic Modelling ◽

10.1016/j.econmod.2019.02.014 ◽

2019 ◽

Vol 83 ◽

pp. 245-255

Author(s):

Ahmet Faruk Aysan ◽

Mustafa Disli

Keyword(s):

Small Business ◽

Credit Risk ◽

Granger Causality ◽

Small Business Lending

Download Full-text

An Importance Sampling Method for Expectation of Portfolio Credit Risk

Asian Business and Management Practices - Advances in Business Strategy and Competitive Advantage ◽

10.4018/978-1-4666-6441-8.ch016 ◽

2015 ◽

pp. 210-219

Author(s):

Yue Qiu ◽

Chuansheng Wang

Keyword(s):

Credit Risk ◽

Importance Sampling ◽

Sampling Method ◽

Rare Event ◽

Copula Model ◽

Portfolio Credit Risk ◽

Importance Sampling Method ◽

Model Dependency ◽

Financial Portfolios ◽

Normal Copula

Simulation is widely used to estimate losses due to default and other credit events in financial portfolios. The accurate measurement of credit risk can be modeled as a rare event simulation problem. While Monte Carlo simulation is time-consuming for rare events, importance sampling techniques can effectively reduce the simulation time, thus improving simulation efficiency. This chapter proposes a new importance sampling method to estimate rare event probability in simulation models. The optimal importance sampling distributions are derived in terms of expectation in the normal copula model developed in finance. In the normal copula model, dependency is introduced through a set of common factors of multiple obligors. The intriguing dependence between defaults of multiple obligors imposes hurdles in simulation. The simulated results demonstrate the effectiveness of the proposed approach to solving the portfolio credit risk problem.

Download Full-text