Credit Risk Model Based on Central Bank Credit Registry Data

Data science and machine-learning techniques help banks to optimize enterprise operations, enhance risk analyses and gain competitive advantage. There is a vast amount of research in credit risk, but to our knowledge, none of them uses credit registry as a data source to model the probability of default for individual clients. The goal of this paper is to evaluate different machine-learning models to create accurate model for credit risk assessment using the data from the real credit registry dataset of the Central Bank of Republic of North Macedonia. We strongly believe that the model developed in this research will be an additional source of valuable information to commercial banks, by leveraging historical data for all the population of the country in all the commercial banks. Thus, in this research, we compare five machine-learning models to classify credit risk data, i.e., logistic regression, decision tree, random forest, support vector machines (SVM) and neural network. We evaluate the five models using different machine-learning metrics, and we propose a model based on credit registry data from the central bank with detailed methodology that can predict the credit risk based on credit history of the population in the country. Our results show that the best accuracy is achieved by using decision tree performing on imbalanced data with and without scaling, followed by random forest and linear regression.

Download Full-text

Drug Classification using Black-box models and Interpretability

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38203 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1518-1529

Author(s):

Pooja Thakkar

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Learning Models ◽

Drug Classification ◽

Box Models ◽

Machine Learning Model ◽

Black Box Models ◽

Insight Into ◽

Machine Learning Models

Abstract: The focus of this study is on drug categorization utilising Machine Learning models, as well as interpretability utilizing LIME and SHAP to get a thorough understanding of the ML models. To do this, the researchers used machine learning models such as random forest, decision tree, and logistic regression to classify drugs. Then, using LIME and SHAP, they determined if these models were interpretable, which allowed them to better understand their results. It may be stated at the conclusion of this paper that LIME and SHAP can be utilised to get insight into a Machine Learning model and determine which attribute is accountable for the divergence in the outcomes. According to the LIME and SHAP results, it is also discovered that Random Forest and Decision Tree ML models are the best models to employ for drug classification, with Na to K and BP being the most significant characteristics for drug classification. Keywords: Machine Learning, Back-box models, LIME, SHAP, Decision Tree

Download Full-text

Detection of Osteosarcoma on Bone Radiographs Using Convolutional Neural Networks

10.21528/cbic2021-16 ◽

2021 ◽

Author(s):

Larissa Asito ◽

Hélcio Pereira ◽

Marcello Nogueira-Barbosa ◽

Renato Tinós

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Feature Selection ◽

Random Forest ◽

Decision Tree ◽

Convolutional Neural Networks ◽

Learning Models ◽

Computer Aided ◽

Aided Diagnosis ◽

Machine Learning Models

We propose a computer-aided diagnosis system based on convolutional neural networks (CNNs) for the identification of osteosarcoma on bone radiographs. The CNN should indicate regions of the image that may contain tumors. In order to indicate these regions on the image, we propose to split the image in windows and individually classify them by using a CNN. Techniques for pre-processing, such as window exclusion and labeling, are proposed. Two CNNs are compared in the proposed system. The first one is trained from scratch, while the second one is a pre-trained CNN (VGG16). The CNNs are compared to four machine learning models that use features extracted from the image windows as inputs: multilayer perceptron (MLP), decision tree, random forest, and MLP with feature selection. In the experiments, the best performance was obtained by the pre-trained CNN.

Download Full-text

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v5i3.1066 ◽

2020 ◽

pp. 235-242

Author(s):

Farrikh Alzami ◽

Erika Devi Udayanti ◽

Dwi Puji Prabowo ◽

Rama Aria Megantara

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Random Forest ◽

Sentiment Analysis ◽

Classification Performance ◽

Document Preparation ◽

Learning Models ◽

Polarity Classification ◽

Negative Sentiment ◽

Machine Learning Models

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

Download Full-text

On Using Decision Tree Coverage Criteria forTesting Machine Learning Models

10.1145/3482909.3482911 ◽

2021 ◽

Author(s):

Sebastião Santos ◽

Beatriz Silveira ◽

Vinicius Durelli ◽

Rafael Durelli ◽

Simone Souza ◽

...

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Learning Models ◽

Coverage Criteria ◽

Tree Coverage ◽

Machine Learning Models

Download Full-text

Random forest and long short-term memory based machine learning models for classification of ion mobility spectrometry spectra

Chemical, Biological, Radiological, Nuclear, and Explosives (CBRNE) Sensing XXII ◽

10.1117/12.2585829 ◽

2021 ◽

Author(s):

Patrick C. Riley ◽

Samir V. Deshpande ◽

Brian S. Ince ◽

Brian C. Hauck ◽

Kyle P. O'Donnell ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Ion Mobility ◽

Short Term Memory ◽

Learning Models ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory ◽

Machine Learning Models

Download Full-text

Comparison of the Performance of Machine Learning Algorithms in Predicting Heart Disease

Frontiers in Health Informatics ◽

10.30699/fhi.v10i1.349 ◽

2021 ◽

Vol 10 (1) ◽

pp. 99

Author(s):

Sajad Yousefi

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Heart Disease ◽

Decision Tree ◽

Roc Curve ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Learning Models ◽

Algorithm Performance ◽

Machine Learning Models

Introduction: Heart disease is often associated with conditions such as clogged arteries due to the sediment accumulation which causes chest pain and heart attack. Many people die due to the heart disease annually. Most countries have a shortage of cardiovascular specialists and thus, a significant percentage of misdiagnosis occurs. Hence, predicting this disease is a serious issue. Using machine learning models performed on multidimensional dataset, this article aims to find the most efficient and accurate machine learning models for disease prediction.Material and Methods: Several algorithms were utilized to predict heart disease among which Decision Tree, Random Forest and KNN supervised machine learning are highly mentioned. The algorithms are applied to the dataset taken from the UCI repository including 294 samples. The dataset includes heart disease features. To enhance the algorithm performance, these features are analyzed, the feature importance scores and cross validation are considered.Results: The algorithm performance is compared with each other, so that performance based on ROC curve and some criteria such as accuracy, precision, sensitivity and F1 score were evaluated for each model. As a result of evaluation, Accuracy, AUC ROC are 83% and 99% respectively for Decision Tree algorithm. Logistic Regression algorithm with accuracy and AUC ROC are 88% and 91% respectively has better performance than other algorithms. Therefore, these techniques can be useful for physicians to predict heart disease patients and prescribe them correctly.Conclusion: Machine learning technique can be used in medicine for analyzing the related data collections to a disease and its prediction. The area under the ROC curve and evaluating criteria related to a number of classifying algorithms of machine learning to evaluate heart disease and indeed, the prediction of heart disease is compared to determine the most appropriate classification. As a result of evaluation, better performance was observed in both Decision Tree and Logistic Regression models.

Download Full-text

Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water

Frontiers in Environmental Science ◽

10.3389/fenvs.2021.701288 ◽

2021 ◽

Vol 9 ◽

Author(s):

Daniel Lowell Weller ◽

Tanzy M. T. Love ◽

Martin Wiedmann

Keyword(s):

Machine Learning ◽

Random Forest ◽

Predictive Models ◽

Training Data ◽

Agricultural Water ◽

Learning Models ◽

Safety Hazards ◽

E Coli ◽

Resampling Method ◽

Machine Learning Models

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.

Download Full-text