Logistic Regression Model for Loan Prediction: A Machine Learning Approach

Author(s):  
Richa Manglani ◽  
Anuja Bokhare
Author(s):  
Dominic M. Obare ◽  
Gladys G. Njoroge ◽  
Moses M. Muraya

Financial institutions have a large amount of data on their borrowers, which can be used to predict the probability of borrowers defaulting their loan or not. Some of the models that have been used to predict individual loan defaults include linear discriminant analysis models and extreme value theory models. These models are parametric in nature since they assume that the response being investigated takes a particular functional form. However, there is a possibility that the functional form used to estimate the response is very different from the actual functional form of the response. The purpose of this research was to analyze individual loan defaults in Kenya using the logistic regression model. The data used in this study was obtained from equity bank of Kenya for the period between 2006 to 2016. A random sample of 1000 loan applicants whose loans had been approved by equity bank of Kenya during this period was obtained. Data obtained was on the credit history, purpose of the loan, loan amount, nature of the saving account, employment status, sex of the applicant, age of the applicant, security used when acquiring the loan and the area of residence of the applicant (rural or urban). This study employed a quantitative research design, it deals with individual loans defaults as group characteristics of a borrower. The data was pre-processed by seeding using R- Software and then split into training dataset and test data set. The train data was used to train the logistic regression model by employing Supervised machine learning approach. The R-statistical software was used for the analysis of the data. The test data set was used to do cross-validation of the developed logistic model which later was used for analysis prediction of individual loan defaults. This study focused on the analysis of individual loan defaults in Kenya using the logistic regression model in Machine learning. The logistic regression model predicted 303 defaults from train data set, 122 non-defaults and misclassified loans were 56 and 69. The model had an accuracy of 0.7727 with the train data and 0.7333 with the test data. The logistic regression model showed a precision of 0.8440 and 0.8244 with the train and test data respectively. The performance of the model with both the train and test data was illustrated using a plot of train errors and test errors against sample size on the same axes. The plot showed that the performance of the model increases with an increase in sample size. The study recommended the use of logistic regression in conjunction with supervised machine learning approach in loan default prediction in financial institutions and also more research should be carried out on ensemble methods of loan defaults prediction in order to increase the prediction accuracy.


2019 ◽  
Author(s):  
Oskar Flygare ◽  
Jesper Enander ◽  
Erik Andersson ◽  
Brjánn Ljótsson ◽  
Volen Z Ivanov ◽  
...  

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.


AITI ◽  
2020 ◽  
Vol 17 (1) ◽  
pp. 42-55
Author(s):  
Radius Tanone ◽  
Arnold B Emmanuel

Bank XYZ is one of the banks in Kupang City, East Nusa Tenggara Province which has several ATM machines and is placed in several merchant locations. The existing ATM machine is one of the goals of customers and non-customers in conducting transactions at the ATM machine. The placement of the ATM machines sometimes makes the machine not used optimally by the customer to transact, causing the disposal of machine resources and a condition called Not Operational Transaction (NOP). With the data consisting of several independent variables with numeric types, it is necessary to know how the classification of the dependent variable is NOP. Machine learning approach with Logistic Regression method is the solution in doing this classification. Some research steps are carried out by collecting data, analyzing using machine learning using python programming and writing reports. The results obtained with this machine learning approach is the resulting prediction value of 0.507 for its classification. This means that in the future XYZ Bank can classify NOP conditions based on the behavior of customers or non-customers in making transactions using Bank XYZ ATM machines.  


2021 ◽  
Vol 8 ◽  
Author(s):  
Robert A. Reed ◽  
Andrei S. Morgan ◽  
Jennifer Zeitlin ◽  
Pierre-Henri Jarreau ◽  
Héloïse Torchin ◽  
...  

Introduction: Preterm babies are a vulnerable population that experience significant short and long-term morbidity. Rehospitalisations constitute an important, potentially modifiable adverse event in this population. Improving the ability of clinicians to identify those patients at the greatest risk of rehospitalisation has the potential to improve outcomes and reduce costs. Machine-learning algorithms can provide potentially advantageous methods of prediction compared to conventional approaches like logistic regression.Objective: To compare two machine-learning methods (least absolute shrinkage and selection operator (LASSO) and random forest) to expert-opinion driven logistic regression modelling for predicting unplanned rehospitalisation within 30 days in a large French cohort of preterm babies.Design, Setting and Participants: This study used data derived exclusively from the population-based prospective cohort study of French preterm babies, EPIPAGE 2. Only those babies discharged home alive and whose parents completed the 1-year survey were eligible for inclusion in our study. All predictive models used a binary outcome, denoting a baby's status for an unplanned rehospitalisation within 30 days of discharge. Predictors included those quantifying clinical, treatment, maternal and socio-demographic factors. The predictive abilities of models constructed using LASSO and random forest algorithms were compared with a traditional logistic regression model. The logistic regression model comprised 10 predictors, selected by expert clinicians, while the LASSO and random forest included 75 predictors. Performance measures were derived using 10-fold cross-validation. Performance was quantified using area under the receiver operator characteristic curve, sensitivity, specificity, Tjur's coefficient of determination and calibration measures.Results: The rate of 30-day unplanned rehospitalisation in the eligible population used to construct the models was 9.1% (95% CI 8.2–10.1) (350/3,841). The random forest model demonstrated both an improved AUROC (0.65; 95% CI 0.59–0.7; p = 0.03) and specificity vs. logistic regression (AUROC 0.57; 95% CI 0.51–0.62, p = 0.04). The LASSO performed similarly (AUROC 0.59; 95% CI 0.53–0.65; p = 0.68) to logistic regression.Conclusions: Compared to an expert-specified logistic regression model, random forest offered improved prediction of 30-day unplanned rehospitalisation in preterm babies. However, all models offered relatively low levels of predictive ability, regardless of modelling method.


BMJ Open ◽  
2020 ◽  
Vol 10 (10) ◽  
pp. e040132
Author(s):  
Innocent B Mboya ◽  
Michael J Mahande ◽  
Mohanad Mohammed ◽  
Joseph Obure ◽  
Henry G Mwambi

ObjectiveWe aimed to determine the key predictors of perinatal deaths using machine learning models compared with the logistic regression model.DesignA secondary data analysis using the Kilimanjaro Christian Medical Centre (KCMC) Medical Birth Registry cohort from 2000 to 2015. We assessed the discriminative ability of models using the area under the receiver operating characteristics curve (AUC) and the net benefit using decision curve analysis.SettingThe KCMC is a zonal referral hospital located in Moshi Municipality, Kilimanjaro region, Northern Tanzania. The Medical Birth Registry is within the hospital grounds at the Reproductive and Child Health Centre.ParticipantsSingleton deliveries (n=42 319) with complete records from 2000 to 2015.Primary outcome measuresPerinatal death (composite of stillbirths and early neonatal deaths). These outcomes were only captured before mothers were discharged from the hospital.ResultsThe proportion of perinatal deaths was 3.7%. There were no statistically significant differences in the predictive performance of four machine learning models except for bagging, which had a significantly lower performance (AUC 0.76, 95% CI 0.74 to 0.79, p=0.006) compared with the logistic regression model (AUC 0.78, 95% CI 0.76 to 0.81). However, in the decision curve analysis, the machine learning models had a higher net benefit (ie, the correct classification of perinatal deaths considering a trade-off between false-negatives and false-positives)—over the logistic regression model across a range of threshold probability values.ConclusionsIn this cohort, there was no significant difference in the prediction of perinatal deaths between machine learning and logistic regression models, except for bagging. The machine learning models had a higher net benefit, as its predictive ability of perinatal death was considerably superior over the logistic regression model. The machine learning models, as demonstrated by our study, can be used to improve the prediction of perinatal deaths and triage for women at risk.


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
I.-S Kim ◽  
P S Yang ◽  
H T Yu ◽  
T H Kim ◽  
J S Uhm ◽  
...  

Abstract Background To evaluate the ability of machine learning algorithms to predict incident atrial fibrillation (AF) from the general population using health examination items. Methods We included 483,343 subjects who received national health examinations from the Korean National Health Insurance Service-based National Sample Cohort (NHIS-NSC). We trained deep neural network model (DNN) of a deep learning system and decision tree model (DT) of a machine learning system using clinical variables and health examination items (including age, sex, body mass index, history of heart failure, hypertension or diabetes, baseline creatinine, and smoking and alcohol intake habits) to predict incident AF using a training dataset of 341,771 subjects constructed from the NHIS-NSC database. The DNN and DT were validated using an independent test dataset of 141,572 remaining subjects. C-indices of DNN and DT for prediction of incident AF were compared with that of conventional logistic regression model. Results During 1,874,789 person·years (mean±standard-deviation age 47.7±14.4 years, 49.6% male), 3,282 subjects with incident AF were observed. In the validation dataset, 1,139 subjects with incident AF were observed. The c-indices of the DNN and DT for incident AF prediction were 0.828 [0.819–0.836] and 0.835 [0.825–0.844], and were significantly higher (p<0.01) than conventional logistic regression model (c-index=0.789 [0.784–0.794]). Conclusions Application of machine learning using simple clinical variables and health examination items was helpful to predict incident AF in the general population. Prospective study is warranted to construct an individualized precision medicine.


2020 ◽  
Author(s):  
Qiqiang Liang ◽  
Qinyu Zhao ◽  
Xin Xu ◽  
Yu Zhou ◽  
Man Huang

Abstract Background The prevention and control of carbapenem-resistance gram-negative bacteria (CR-GNB) is the difficulty and focus for clinicians in the intensive care unit (ICU). This study construct a CR-GNB carriage prediction model in order to predict the CR-GNB incidence in one week. Methods The database is comprised of nearly 10,000 patients. the model is constructed by the multivariate logistic regression model and three machine learning algorithms. Then we choose the optimal model and verify the accuracy by daily predicted and recorded the occurrence of CR-GNB of all patients admitted for 4 months. Results There are 1385 patients with positive CR-GNB cultures and 1535 negative patients in this study. Forty-five variables have statistical significant differences. We include the 17 variables in the multivariate logistic regression model and build three machine learning models for all variables. In terms of accuracy and the area under the receiver operating characteristic (AUROC) curve, the random forest is better than XGBoost and multivariate logistic regression model, and better than decision tree model (accuracy: 84% >82%>81%>72%), (AUROC: 0.9089 > 0.8947 ≈ 0.8987 > 0.7845). In the 4-month prospective study, 81 cases were predicted to be positive in CR-GNB culture within 7 days, 146 cases were predicted to be negative, 86 cases were positive, and 120 cases were negative, with an overall accuracy of 84% and AUROC of 91.98%. Conclusions Prediction models by machine learning can predict the occurrence of CR-GNB colonization or infection within a week period, and can real-time predict and guide medical staff to identify high-risk groups more accurately.


Phishing alludes of the mimicking of the first website. To infiltrate this sort of con,the correspondence claims will a chance to be starting with an official illustrative of a website alternately another institutional Furthermore starting with the place an individual need a probable benefits of the business with. (eg. PayPal,Amazon,UPS,Bank for america etc). It focuses those vunariblities Toward method for pop ups ,ads,fake login pages and so on. Web clients are pulled in Eventually Tom's perusing method for leveraging their trust on acquire their delicate data for example, such that usernames,passwords,account numbers or other data with open record on acquire loans or purchase all the merchandise through e-commerce locales. Upto 5% for clients appear on make lured under these attacks,so it might remain calm gainful for scammers-many about whom who send a large number for trick e-mails An day. In this system,we offer an answer with this issue Toward settling on those client mindful of such phishing exercises Eventually Tom's perusing identifying the trick joins Furthermore urls Toward utilizing the blending of the The majority powerful calculations for machine learning, Concerning illustration An result, we infer our paper with correctness from claiming 98.8% What's more mix from claiming 26 offers. The best algorithm being ,the logistic regression model.


2021 ◽  
Vol 23 (1) ◽  
Author(s):  
Seulkee Lee ◽  
Seonyoung Kang ◽  
Yeonghee Eun ◽  
Hong-Hee Won ◽  
Hyungjin Kim ◽  
...  

Abstract Background Few studies on rheumatoid arthritis (RA) have generated machine learning models to predict biologic disease-modifying antirheumatic drugs (bDMARDs) responses; however, these studies included insufficient analysis on important features. Moreover, machine learning is yet to be used to predict bDMARD responses in ankylosing spondylitis (AS). Thus, in this study, machine learning was used to predict such responses in RA and AS patients. Methods Data were retrieved from the Korean College of Rheumatology Biologics therapy (KOBIO) registry. The number of RA and AS patients in the training dataset were 625 and 611, respectively. We prepared independent test datasets that did not participate in any process of generating machine learning models. Baseline clinical characteristics were used as input features. Responders were defined as those who met the ACR 20% improvement response criteria (ACR20) and ASAS 20% improvement response criteria (ASAS20) in RA and AS, respectively, at the first follow-up. Multiple machine learning methods, including random forest (RF-method), were used to generate models to predict bDMARD responses, and we compared them with the logistic regression model. Results The RF-method model had superior prediction performance to logistic regression model (accuracy: 0.726 [95% confidence interval (CI): 0.725–0.730] vs. 0.689 [0.606–0.717], area under curve (AUC) of the receiver operating characteristic curve (ROC) 0.638 [0.576–0.658] vs. 0.565 [0.493–0.605], F1 score 0.841 [0.837–0.843] vs. 0.803 [0.732–0.828], AUC of the precision-recall curve 0.808 [0.763–0.829] vs. 0.754 [0.714–0.789]) with independent test datasets in patients with RA. However, machine learning and logistic regression exhibited similar prediction performance in AS patients. Furthermore, the patient self-reporting scales, which are patient global assessment of disease activity (PtGA) in RA and Bath Ankylosing Spondylitis Functional Index (BASFI) in AS, were revealed as the most important features in both diseases. Conclusions RF-method exhibited superior prediction performance for responses of bDMARDs to a conventional statistical method, i.e., logistic regression, in RA patients. In contrast, despite the comparable size of the dataset, machine learning did not outperform in AS patients. The most important features of both diseases, according to feature importance analysis were patient self-reporting scales.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Xinyun Liu ◽  
Jicheng Jiang ◽  
Lili Wei ◽  
Wenlu Xing ◽  
Hailong Shang ◽  
...  

Abstract Background Machine learning (ML) can include more diverse and more complex variables to construct models. This study aimed to develop models based on ML methods to predict the all-cause mortality in coronary artery disease (CAD) patients with atrial fibrillation (AF). Methods A total of 2037 CAD patients with AF were included in this study. Three ML methods were used, including the regularization logistic regression, random forest, and support vector machines. The fivefold cross-validation was used to evaluate model performance. The performance was quantified by calculating the area under the curve (AUC) with 95% confidence intervals (CI), sensitivity, specificity, and accuracy. Results After univariate analysis, 24 variables with statistical differences were included into the models. The AUC of regularization logistic regression model, random forest model, and support vector machines model was 0.732 (95% CI 0.649–0.816), 0.728 (95% CI 0.642–0.813), and 0.712 (95% CI 0.630–0.794), respectively. The regularization logistic regression model presented the highest AUC value (0.732 vs 0.728 vs 0.712), specificity (0.699 vs 0.663 vs 0.668), and accuracy (0.936 vs 0.935 vs 0.935) among the three models. However, no statistical differences were observed in the receiver operating characteristic (ROC) curve of the three models (all P > 0.05). Conclusion Combining the performance of all aspects of the models, the regularization logistic regression model was recommended to be used in clinical practice.


Sign in / Sign up

Export Citation Format

Share Document