Drug Classification using Black-box models and Interpretability

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.

Download Full-text

Credit Risk Model Based on Central Bank Credit Registry Data

Journal of Risk and Financial Management ◽

10.3390/jrfm14030138 ◽

2021 ◽

Vol 14 (3) ◽

pp. 138

Author(s):

Fisnik Doko ◽

Slobodan Kalajdziski ◽

Igor Mishkovski

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Central Bank ◽

Credit Risk ◽

Commercial Banks ◽

Registry Data ◽

Learning Models ◽

Model Based ◽

Machine Learning Models

Data science and machine-learning techniques help banks to optimize enterprise operations, enhance risk analyses and gain competitive advantage. There is a vast amount of research in credit risk, but to our knowledge, none of them uses credit registry as a data source to model the probability of default for individual clients. The goal of this paper is to evaluate different machine-learning models to create accurate model for credit risk assessment using the data from the real credit registry dataset of the Central Bank of Republic of North Macedonia. We strongly believe that the model developed in this research will be an additional source of valuable information to commercial banks, by leveraging historical data for all the population of the country in all the commercial banks. Thus, in this research, we compare five machine-learning models to classify credit risk data, i.e., logistic regression, decision tree, random forest, support vector machines (SVM) and neural network. We evaluate the five models using different machine-learning metrics, and we propose a model based on credit registry data from the central bank with detailed methodology that can predict the credit risk based on credit history of the population in the country. Our results show that the best accuracy is achieved by using decision tree performing on imbalanced data with and without scaling, followed by random forest and linear regression.

Download Full-text

Machine learning approach for predicting under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey

Genus ◽

10.1186/s41118-020-00106-2 ◽

2020 ◽

Vol 76 (1) ◽

Author(s):

Fikrewold H. Bitew ◽

Samuel H. Nyarko ◽

Lloyd Potter ◽

Corey S. Sparks

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Health Survey ◽

Predictive Power ◽

Demographic And Health Survey ◽

Learning Models ◽

Under Five ◽

Machine Learning Model ◽

Machine Learning Models

Abstract There is a dearth of literature on the use of machine learning models to predict important under-five mortality risks in Ethiopia. In this study, we showed spatial variations of under-five mortality and used machine learning models to predict its important sociodemographic determinants in Ethiopia. The study data were drawn from the 2016 Ethiopian Demographic and Health Survey. We used three machine learning models such as random forests, logistic regression, and K-nearest neighbors as well as one traditional logistic regression model to predict under-five mortality determinants. For each machine learning model, measures of model accuracy and receiver operating characteristic curves were used to evaluate the predictive power of each model. The descriptive results show that there are considerable regional variations in under-five mortality rates in Ethiopia. The under-five mortality prediction ability was found to be between 46.3 and 67.2% for the models considered, with the random forest model (67.2%) showing the best performance. The best predictive model shows that household size, time to the source of water, breastfeeding status, number of births in the preceding 5 years, sex of a child, birth intervals, antenatal care, birth order, type of water source, and mother’s body mass index play an important role in under-five mortality levels in Ethiopia. The random forest machine learning model produces a better predictive power for estimating under-five mortality risk factors and may help to improve policy decision-making in this regard. Childhood survival chances can be improved considerably by using these important factors to inform relevant policies.

Download Full-text

Enhancing Automatic Prediction of Spirality using SpArcFiRe's Spiral Arm Analysis and Random Forests

10.20944/preprints201806.0279.v1 ◽

2018 ◽

Author(s):

Pedro Silva ◽

Leon T. Cao ◽

Wayne B. Hayes

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithm ◽

Box Models ◽

Machine Learning Model ◽

Human Volunteers ◽

Post Hoc ◽

Black Box Models ◽

Spiral Arm

Automated machine classifications of galaxies are necessary because the size of upcoming surveys will overwhelm human volunteers. We improve upon existing machine classification methods by adding the output of SpArcFiRe to the inputs of a machine learning model. We use the human classifications from Galaxy Zoo 1 (GZ1) to train a random forest of decision trees to reproduce the human vote distributions of the Spiral class. We prefer the random forest model over other black box models like neural networks because it allows us to trace post hoc the precise reasoning behind the classification of each galaxy. We find that, across a sample of 470,000 Sloan galaxies that are large enough that details could be seen if they were there, the combination of SpArcFiRe outputs with existing SDSS features provides a better machine classification than either one alone on comparison to Galaxy Zoo 1. We suggest that adding SpArcFiRe outputs as features to any machine learning algorithm will likely improve its performance.

Download Full-text

Contextualized LORE for Fuzzy Attributes

10.3233/faia210164 ◽

2021 ◽

Author(s):

Najlaa Maaroof ◽

Antonio Moreno ◽

Mohammed Jabreel ◽

Aida Valls

Keyword(s):

Machine Learning ◽

Contextual Information ◽

Black Box ◽

New Method ◽

Learning Models ◽

Box Models ◽

Black Boxes ◽

Linguistic Labels ◽

Black Box Models ◽

Machine Learning Models

Despite the broad adoption of Machine Learning models in many domains, they remain mostly black boxes. There is a pressing need to ensure Machine Learning models that are interpretable, so that designers and users can understand the reasons behind their predictions. In this work, we propose a new method called C-LORE-F to explain the decisions of fuzzy-based black box models. This new method uses some contextual information about the attributes as well as the knowledge of the fuzzy sets associated to the linguistic labels of the fuzzy attributes to provide actionable explanations. The experimental results on three datasets reveal the effectiveness of C-LORE-F when compared with the most relevant related works.

Download Full-text

Detection of Osteosarcoma on Bone Radiographs Using Convolutional Neural Networks

10.21528/cbic2021-16 ◽

2021 ◽

Author(s):

Larissa Asito ◽

Hélcio Pereira ◽

Marcello Nogueira-Barbosa ◽

Renato Tinós

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Feature Selection ◽

Random Forest ◽

Decision Tree ◽

Convolutional Neural Networks ◽

Learning Models ◽

Computer Aided ◽

Aided Diagnosis ◽

Machine Learning Models

We propose a computer-aided diagnosis system based on convolutional neural networks (CNNs) for the identification of osteosarcoma on bone radiographs. The CNN should indicate regions of the image that may contain tumors. In order to indicate these regions on the image, we propose to split the image in windows and individually classify them by using a CNN. Techniques for pre-processing, such as window exclusion and labeling, are proposed. Two CNNs are compared in the proposed system. The first one is trained from scratch, while the second one is a pre-trained CNN (VGG16). The CNNs are compared to four machine learning models that use features extracted from the image windows as inputs: multilayer perceptron (MLP), decision tree, random forest, and MLP with feature selection. In the experiments, the best performance was obtained by the pre-trained CNN.

Download Full-text

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets

Molecules ◽

10.3390/molecules24112115 ◽

2019 ◽

Vol 24 (11) ◽

pp. 2115 ◽

Cited By ~ 2

Author(s):

Thomas M. Kaiser ◽

Pieter B. Burger

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Probabilistic Neural Network ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Learning Models ◽

Bayes Network ◽

Insight Into ◽

Machine Learning Models

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.

Download Full-text

SpArcFiRe: Enhancing Spiral Galaxy Recognition using Arm Analysis and Random Forests

10.20944/preprints201806.0279.v2 ◽

2018 ◽

Author(s):

Pedro Silva ◽

Leon T. Cao ◽

Wayne B. Hayes

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithm ◽

Black Box ◽

Box Models ◽

Machine Learning Model ◽

Human Volunteers ◽

Post Hoc ◽

Black Box Models

Automated machine classifications of galaxies are necessary because the size of upcoming surveys will overwhelm human volunteers. We improve upon existing machine classification methods by adding the output of SpArcFiRe to the inputs of a machine learning model. We use the human classifications from Galaxy Zoo 1 (GZ1) to train a random forest of decision trees to reproduce the human vote distributions of the Spiral class. We prefer the random forest model over other black box models like neural networks because it allows us to trace post hoc the precise reasoning behind the classification of each galaxy. We find that, across a sample of 470,000 Sloan galaxies that are large enough that details could be seen if they were there, the combination of SpArcFiRe outputs with existing SDSS features provides a better machine classification than either one alone on comparison to Galaxy Zoo 1. We suggest that adding SpArcFiRe outputs as features to any machine learning algorithm will likely improve its performance.

Download Full-text

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v5i3.1066 ◽

2020 ◽

pp. 235-242

Author(s):

Farrikh Alzami ◽

Erika Devi Udayanti ◽

Dwi Puji Prabowo ◽

Rama Aria Megantara

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Random Forest ◽

Sentiment Analysis ◽

Classification Performance ◽

Document Preparation ◽

Learning Models ◽

Polarity Classification ◽

Negative Sentiment ◽

Machine Learning Models

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

Download Full-text