Why segmentation matters: a Machine Learning approach for predicting loan defaults in the  Peer-to-Peer (P2P) Financial Ecosystem

Peer-to-Peer (P2P) lending is an online lending process allowing individuals to obtain or concede loans without the interference of traditional financial intermediaries. It has grown quickly the last years, with some platforms reaching billions of dollars of loans in principal in a short amount of time. Since each loan is associated with the probability of loss due to a borrower's failure, this paper addresses the borrower's default prediction problem in the P2P financial ecosystem. The main assumption, which makes this study different from the available literature, is that borrowers sharing the same homeownership status display similar risk profile, thus a model per segment should be developed. We estimate the Probability of Default (PD) of a borrower by using Logistic Regression (LR) coupled with Weight of Evidence encoding. The features set is identified via the Sequential Feature Selection (SFS). We compare the forward against the backward SFS, in terms of the Area Under the Curve (AUC), and we choose the one that maximizes this statistic. Finally, we compare the results of the chosen LR approach against two other popular Machine Learning (ML) techniques: the k Nearest Neighbors (k-NN) and the Random Forest (RF).

Download Full-text

IIMLP: integrated information-entropy-based method for LncRNA prediction

BMC Bioinformatics ◽

10.1186/s12859-020-03884-w ◽

2021 ◽

Vol 22 (S3) ◽

Author(s):

Junyi Li ◽

Huinian Li ◽

Xiao Ye ◽

Li Zhang ◽

Qingzhe Xu ◽

...

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Information Entropy ◽

Area Under The Curve ◽

Prediction Method ◽

Machine Learning Algorithms ◽

Reading Frame ◽

Non Coding Rna ◽

The One ◽

Long Non Coding Rna

Abstract Background The prediction of long non-coding RNA (lncRNA) has attracted great attention from researchers, as more and more evidence indicate that various complex human diseases are closely related to lncRNAs. In the era of bio-med big data, in addition to the prediction of lncRNAs by biological experimental methods, many computational methods based on machine learning have been proposed to make better use of the sequence resources of lncRNAs. Results We developed the lncRNA prediction method by integrating information-entropy-based features and machine learning algorithms. We calculate generalized topological entropy and generate 6 novel features for lncRNA sequences. By employing these 6 features and other features such as open reading frame, we apply supporting vector machine, XGBoost and random forest algorithms to distinguish human lncRNAs. We compare our method with the one which has more K-mer features and results show that our method has higher area under the curve up to 99.7905%. Conclusions We develop an accurate and efficient method which has novel information entropy features to analyze and classify lncRNAs. Our method is also extendable for research on the other functional elements in DNA sequences.

Download Full-text

Which PHQ-9 Items Can Effectively Screen for Suicide? Machine Learning Approaches

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18073339 ◽

2021 ◽

Vol 18 (7) ◽

pp. 3339

Author(s):

Sunhae Kim ◽

Hye-Kyung Lee ◽

Kounseok Lee

Keyword(s):

Machine Learning ◽

Primary Care ◽

Suicidal Ideation ◽

Random Forest ◽

Suicide Ideation ◽

Area Under The Curve ◽

Learning Approaches ◽

K Nearest Neighbors ◽

Predictive Values ◽

Linear Discriminant

(1) Background: The Patient Health Questionnaire-9 (PHQ-9) is a tool that screens patients for depression in primary care settings. In this study, we evaluated the efficacy of PHQ-9 in evaluating suicidal ideation (2) Methods: A total of 8760 completed questionnaires collected from college students were analyzed. The PHQ-9 was scored in combination with and evaluated against four categories (PHQ-2, PHQ-8, PHQ-9, and PHQ-10). Suicidal ideations were evaluated using the Mini-International Neuropsychiatric Interview suicidality module. Analyses used suicide ideation as the dependent variable, and machine learning (ML) algorithms, k-nearest neighbors, linear discriminant analysis (LDA), and random forest. (3) Results: Random forest application using the nine items of the PHQ-9 revealed an excellent area under the curve with a value of 0.841, with 94.3% accuracy. The positive and negative predictive values were 84.95% (95% CI = 76.03–91.52) and 95.54% (95% CI = 94.42–96.48), respectively. (4) Conclusion: This study confirmed that ML algorithms using PHQ-9 in the primary care field are reliably accurate in screening individuals with suicidal ideation.

Download Full-text

Gully Erosion Susceptibility Mapping in Highly Complex Terrain Using Machine Learning Models

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10100680 ◽

2021 ◽

Vol 10 (10) ◽

pp. 680

Author(s):

Annan Yang ◽

Chunmei Wang ◽

Guowei Pang ◽

Yongqing Long ◽

Lei Wang ◽

...

Keyword(s):

Machine Learning ◽

Complex Terrain ◽

Large Scale ◽

Area Under The Curve ◽

Gully Erosion ◽

Susceptibility Mapping ◽

Weight Of Evidence ◽

Gradient Boosting ◽

Machine Learning Classification ◽

Extreme Gradient Boosting

Gully erosion is the most severe type of water erosion and is a major land degradation process. Gully erosion susceptibility mapping (GESM)’s efficiency and interpretability remains a challenge, especially in complex terrain areas. In this study, a WoE-MLC model was used to solve the above problem, which combines machine learning classification algorithms and the statistical weight of evidence (WoE) model in the Loess Plateau. The three machine learning (ML) algorithms utilized in this research were random forest (RF), gradient boosted decision trees (GBDT), and extreme gradient boosting (XGBoost). The results showed that: (1) GESM were well predicted by combining both machine learning regression models and WoE-MLC models, with the area under the curve (AUC) values both greater than 0.92, and the latter was more computationally efficient and interpretable; (2) The XGBoost algorithm was more efficient in GESM than the other two algorithms, with the strongest generalization ability and best performance in avoiding overfitting (averaged AUC = 0.947), followed by the RF algorithm (averaged AUC = 0.944), and GBDT algorithm (averaged AUC = 0.938); and (3) slope gradient, land use, and altitude were the main factors for GESM. This study may provide a possible method for gully erosion susceptibility mapping at large scale.

Download Full-text

Research on Default Prediction of Online Lending Borrowers Based on Machine Learning

Service Science and Management ◽

10.12677/ssem.2019.81006 ◽

2019 ◽

Vol 08 (01) ◽

pp. 40-48

Author(s):

相婷王

Keyword(s):

Machine Learning ◽

Default Prediction ◽

Online Lending

Download Full-text

The Fintech Phenomenon: Protection of Consumer Privacy Data in Online Lending

Jurnal Kajian Ilmiah ◽

10.31599/jki.v21i2.564 ◽

2021 ◽

Vol 21 (2) ◽

pp. 185-194

Author(s):

Ika Dewi Sartika Saimima ◽

Valentino Gola Patria

Keyword(s):

Financial Services ◽

Legal Theory ◽

Personal Data ◽

Peer To Peer ◽

Legal Norms ◽

Consumer Privacy ◽

P2p Lending ◽

Peer Lending ◽

The One ◽

Online Lending

Abstract Financial technology innovation that occurs nowadays leads to accelerated changes in the financial sector. However, these developments are like double-edged swords, on the one hand they provide convenience for consumers, on the other hand pose risks for consumers related to the confidentiality of their personal data. Money lending business through Peer to Peer lending (P2P lending) system often results in consumers receiving threats when they are late making payments. This paper presents several cases that result in consumers experiencing personal data theft, receiving threats directed at relatives or acquaintances. Even committing fraud by taking money from borrowers or customers without following the regulations made by the Financial Services Authority (OJK). The research data is carried out in a qualitative normative way where the data is translated based on legal norms and uses legal theory that can explain and answer existing legal problems. Keywords: Consumer Protection, Peer to Peer lending (P2P lending), Private Data Protection Abstrak Inovasi teknologi keuangan yang terjadi saat ini mengarah pada akselerasi perubahan di sektor keuangan. Namun perkembangan tersebut ibarat pedang bermata dua, di satu sisi memberikan kemudahan bagi konsumen, di sisi lain menimbulkan risiko bagi konsumen terkait kerahasiaan data pribadinya. Bisnis money lending melalui sistem Peer to Peer lending (P2P lending) seringkali mengakibatkan konsumen mendapat ancaman ketika mereka terlambat melakukan pembayaran. Makalah ini menyajikan beberapa kasus yang mengakibatkan konsumen mengalami pencurian data pribadi, menerima ancaman yang ditujukan kepada kerabat atau kenalan. Bahkan melakukan penipuan dengan mengambil uang dari debitur atau nasabah tanpa mengikuti ketentuan Otoritas Jasa Keuangan (OJK). Data penelitian dilakukan secara normatif kualitatif dimana datanya diterjemahkan berdasarkan norma hukum dan menggunakan teori hukum yang dapat menjelaskan dan menjawab permasalahan hukum yang ada. Kata kunci: Peer to Peer lending (P2P lending), Perlindungan Konsumen, Perlindungan Data Pribadi Kata Kunci: Perlindungan Konsumen, Peer to Peer lending (P2P lending), Perlindungan Data Pribadi

Download Full-text

Machine Learning Models for COVID-19 Detection in Brazil Based on Symptoms (Preprint)

10.2196/preprints.27293 ◽

2021 ◽

Author(s):

Íris Viana dos Santos Santana ◽

Andressa C. M. da Silveira ◽

Álvaro Sobrinho ◽

Lenardo Chaves e Silva ◽

Leandro Dias da Silva ◽

...

Keyword(s):

Machine Learning ◽

Early Stage ◽

Area Under The Curve ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Accuracy Score ◽

K Nearest Neighbors ◽

Runny Nose ◽

Extreme Gradient Boosting

BACKGROUND controlling the COVID-19 outbreak in Brazil is considered a challenge of continental proportions due to the high population and urban density, weak implementation and maintenance of social distancing strategies, and limited testing capabilities. OBJECTIVE to contribute to addressing such a challenge, we present the implementation and evaluation of supervised Machine Learning (ML) models to assist the COVID-19 detection in Brazil based on early-stage symptoms. METHODS firstly, we conducted data preprocessing and applied the Chi-squared test in a Brazilian dataset, mainly composed of early-stage symptoms, to perform statistical analyses. Afterward, we implemented ML models using the Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), K-Nearest Neighbors (KNN), Decision Tree (DT), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost) algorithms. We evaluated the ML models using precision, accuracy score, recall, the area under the curve, and the Friedman and Nemenyi tests. Based on the comparison, we grouped the top five ML models and measured feature importance. RESULTS the MLP model presented the highest mean accuracy score, with more than 97.85%, when compared to GBM (> 97.39%), RF (> 97.36%), DT (> 97.07%), XGBoost (> 97.06%), KNN (> 95.14%), and SVM (> 94.27%). Based on the statistical comparison, we grouped MLP, GBM, DT, RF, and XGBoost, as the top five ML models, because the evaluation results are statistically indistinguishable. The ML models` importance of features used during predictions varies from gender, profession, fever, sore throat, dyspnea, olfactory disorder, cough, runny nose, taste disorder, and headache. CONCLUSIONS supervised ML models effectively assist the decision making in medical diagnosis and public administration (e.g., testing strategies), based on early-stage symptoms that do not require advanced and expensive exams.

Download Full-text

Development and Validation of Spectrophotometric and Liquid Chromatographic Methods for Estimation of Tigecycline in Injections

International Journal of Pharmaceutical Sciences and Nanotechnology ◽

10.37285/ijpsn.2020.13.5.12 ◽

2020 ◽

Vol 13 (5) ◽

pp. 5148-5154

Author(s):

Sagar Suman Panda ◽

Ravi Kumar B.V.V.

Keyword(s):

Area Under The Curve ◽

Reversed Phase ◽

Equal Concentration ◽

Pharmaceutical Dosage Form ◽

Liquid Chromatographic Method ◽

Array Detection ◽

The Difference ◽

The One ◽

Development And Validation ◽

Liquid Chromatographic

Three new analytical methods were optimized and validated for the estimation of tigecycline (TGN) in its injection formulation. A difference UV spectroscopic, an area under the curve (AUC), and an ultrafast liquid chromatographic (UFLC) method were optimized for this purpose. The difference spectrophotometric method relied on the measurement of amplitude when equal concentration solutions of TGN in HCl are scanned against TGN in NaOH as reference. The measurements were done at 340 nm (maxima) and 410nm (minima). Further, the AUC under both the maxima and minima were measured at 335-345nm and 405-415nm, respectively. The liquid chromatographic method utilized a reversed-phase column (150mm×4.6mm, 5µm) with a mobile phase of methanol: 0.01M KH2PO4 buffer pH 3.5 (using orthophosphoric acid) in the ratio 80:20 %, v/v. The flow rate was 1.0ml/min, and diode array detection was done at 349nm. TGN eluted at 1.656min. All the methods were validated for linearity, precision, accuracy, stability, and robustness. The developed methods produced validation results within the satisfactory limits of ICH guidance. Further, these methods were applied to estimate the amount of TGN present in commercial lyophilized injection formulations, and the results were compared using the One-Way ANOVA test. Overall, the methods are rapid, simple, and reliable for routine quality control of TGN in the bulk and pharmaceutical dosage form.

Download Full-text

Use of Machine Learning to Investigate the Quantitative Checklist for Autism in Toddlers (Q-CHAT) towards Early Autism Screening

Diagnostics ◽

10.3390/diagnostics11030574 ◽

2021 ◽

Vol 11 (3) ◽

pp. 574

Author(s):

Gennaro Tartarisco ◽

Giovanni Cicceri ◽

Davide Di Pietro ◽

Elisa Leonardi ◽

Stefania Aiello ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Behavioral Science ◽

Autistic Traits ◽

Classification Performance ◽

Recursive Feature Elimination ◽

Diagnostic Tools ◽

Support Vector ◽

K Nearest Neighbors ◽

Autism Screening

In the past two decades, several screening instruments were developed to detect toddlers who may be autistic both in clinical and unselected samples. Among others, the Quantitative CHecklist for Autism in Toddlers (Q-CHAT) is a quantitative and normally distributed measure of autistic traits that demonstrates good psychometric properties in different settings and cultures. Recently, machine learning (ML) has been applied to behavioral science to improve the classification performance of autism screening and diagnostic tools, but mainly in children, adolescents, and adults. In this study, we used ML to investigate the accuracy and reliability of the Q-CHAT in discriminating young autistic children from those without. Five different ML algorithms (random forest (RF), naïve Bayes (NB), support vector machine (SVM), logistic regression (LR), and K-nearest neighbors (KNN)) were applied to investigate the complete set of Q-CHAT items. Our results showed that ML achieved an overall accuracy of 90%, and the SVM was the most effective, being able to classify autism with 95% accuracy. Furthermore, using the SVM–recursive feature elimination (RFE) approach, we selected a subset of 14 items ensuring 91% accuracy, while 83% accuracy was obtained from the 3 best discriminating items in common to ours and the previously reported Q-CHAT-10. This evidence confirms the high performance and cross-cultural validity of the Q-CHAT, and supports the application of ML to create shorter and faster versions of the instrument, maintaining high classification accuracy, to be used as a quick, easy, and high-performance tool in primary-care settings.

Download Full-text

Early Prediction of Seven-Day Mortality in Intensive Care Unit Using a Machine Learning Model: Results from the SPIN-UTI Project

Journal of Clinical Medicine ◽

10.3390/jcm10050992 ◽

2021 ◽

Vol 10 (5) ◽

pp. 992

Author(s):

Martina Barchitta ◽

Andrea Maugeri ◽

Giuliana Favara ◽

Paolo Marco Riela ◽

Giovanni Gallo ◽

...

Keyword(s):

Machine Learning ◽

Intensive Care ◽

Intensive Care Units ◽

Learning Algorithm ◽

Area Under The Curve ◽

Support Vector ◽

Icu Admission ◽

Risk Of Death ◽

Saps Ii ◽

Svm Algorithm

Patients in intensive care units (ICUs) were at higher risk of worsen prognosis and mortality. Here, we aimed to evaluate the ability of the Simplified Acute Physiology Score (SAPS II) to predict the risk of 7-day mortality, and to test a machine learning algorithm which combines the SAPS II with additional patients’ characteristics at ICU admission. We used data from the “Italian Nosocomial Infections Surveillance in Intensive Care Units” network. Support Vector Machines (SVM) algorithm was used to classify 3782 patients according to sex, patient’s origin, type of ICU admission, non-surgical treatment for acute coronary disease, surgical intervention, SAPS II, presence of invasive devices, trauma, impaired immunity, antibiotic therapy and onset of HAI. The accuracy of SAPS II for predicting patients who died from those who did not was 69.3%, with an Area Under the Curve (AUC) of 0.678. Using the SVM algorithm, instead, we achieved an accuracy of 83.5% and AUC of 0.896. Notably, SAPS II was the variable that weighted more on the model and its removal resulted in an AUC of 0.653 and an accuracy of 68.4%. Overall, these findings suggest the present SVM model as a useful tool to early predict patients at higher risk of death at ICU admission.

Download Full-text

Detecting suicidal risk using MMPI-2 based on machine learning algorithm

Scientific Reports ◽

10.1038/s41598-021-94839-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sunhae Kim ◽

Hye-Kyung Lee ◽

Kounseok Lee

Keyword(s):

Machine Learning ◽

Suicidal Ideation ◽

Random Forest ◽

Minnesota Multiphasic Personality Inventory ◽

Learning Algorithm ◽

Suicidal Risk ◽

K Nearest Neighbors ◽

Large Group ◽

Suicidal Attempts ◽

Scale Scores

AbstractMinnesota Multiphasic Personality Inventory-2 (MMPI-2) is a widely used tool for early detection of psychological maladjustment and assessing the level of adaptation for a large group in clinical settings, schools, and corporations. This study aims to evaluate the utility of MMPI-2 in assessing suicidal risk using the results of MMPI-2 and suicidal risk evaluation. A total of 7,824 datasets collected from college students were analyzed. The MMPI-2-Resturcutred Clinical Scales (MMPI-2-RF) and the response results for each question of the Mini International Neuropsychiatric Interview (MINI) suicidality module were used. For statistical analysis, random forest and K-Nearest Neighbors (KNN) techniques were used with suicidal ideation and suicide attempt as dependent variables and 50 MMPI-2 scale scores as predictors. On applying the random forest method to suicidal ideation and suicidal attempts, the accuracy was 92.9% and 95%, respectively, and the Area Under the Curves (AUCs) were 0.844 and 0.851, respectively. When the KNN method was applied, the accuracy was 91.6% and 94.7%, respectively, and the AUCs were 0.722 and 0.639, respectively. The study confirmed that machine learning using MMPI-2 for a large group provides reliable accuracy in classifying and predicting the subject's suicidal ideation and past suicidal attempts.

Download Full-text