shapr: An R-package for explaining machine learning models with dependence-aware Shapley values

AbstractBackgroundMachine learning models have proven to be useful tools for the analysis of genetic data. However, with the availability of a wide variety of such methods, model selection has become increasingly difficult, both from the human and computational perspective.ResultsWe present the R package FRESA.CAD Binary Classification Benchmarking that performs systematic comparisons between a collection of representative machine learning methods for solving binary classification problems on genetic datasets.ConclusionsFRESA.CAD Binary Benchmarking demonstrates to be a useful tool over a variety of binary classification problems comprising the analysis of genetic data showing both quantitative and qualitative advantages over similar packages.

Download Full-text

Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values

Journal of Medicinal Chemistry ◽

10.1021/acs.jmedchem.9b01101 ◽

2019 ◽

Vol 63 (16) ◽

pp. 8761-8777 ◽

Cited By ~ 9

Author(s):

Raquel Rodríguez-Pérez ◽

Jürgen Bajorath

Keyword(s):

Machine Learning ◽

Learning Models ◽

Local Approximations ◽

Shapley Values ◽

Machine Learning Models

Download Full-text

Interpreting tree ensemble machine learning models with endoR

10.1101/2022.01.03.474763 ◽

2022 ◽

Author(s):

Albane Ruaud ◽

Niklas A Pfister ◽

Ruth E Ley ◽

Nicholas D Youngblut

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

R Package ◽

Metagenomic Data ◽

Learning Models ◽

Model Interpretation ◽

Ensemble Machine Learning ◽

Fermenting Bacteria ◽

Microbiome Data ◽

Machine Learning Models

Background: Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa or genomic content may be associated. Results: We developed endoR, a method to interpret a fitted tree ensemble model. First, endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network. Both the network and importance scores derived from endoR provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed the performance of endoR on both simulated and real metagenomic data. We found endoR to infer true associations with more or comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to gain insights into components of the microbiome that predict the presence of human gut methanogens, as these hydrogen-consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Conclusion: Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems. An implementation of endoR is available as an open-source R-package on GitHub (https://github.com/leylabmpi/endoR).

Download Full-text

Statistical and Machine Learning Models for Classification of Human Wear and Delivery Days in Accelerometry Data

Sensors ◽

10.3390/s21082726 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2726

Author(s):

Ryan Moore ◽

Kristin R. Archer ◽

Leena Choi

Keyword(s):

Machine Learning ◽

R Package ◽

Learning Models ◽

Learning Context ◽

Related Data ◽

Processing Techniques ◽

Study Participants ◽

Monte Carlo Cross Validation ◽

Machine Learning Models

Accelerometers are increasingly being used in biomedical research, but the analysis of accelerometry data is often complicated by both the massive size of the datasets and the collection of unwanted data from the process of delivery to study participants. Current methods for removing delivery data involve arduous manual review of dense datasets. We aimed to develop models for the classification of days in accelerometry data as activity from human wear or the delivery process. These models can be used to automate the cleaning of accelerometry datasets that are adulterated with activity from delivery. We developed statistical and machine learning models for the classification of accelerometry data in a supervised learning context using a large human activity and delivery labeled accelerometry dataset. Model performances were assessed and compared using Monte Carlo cross-validation. We found that a hybrid convolutional recurrent neural network performed best in the classification task with an F1 score of 0.960 but simpler models such as logistic regression and random forest also had excellent performance with F1 scores of 0.951 and 0.957, respectively. The best performing models and related data processing techniques are made publicly available in the R package, Physical Activity.

Download Full-text

The Explanation Game: Explaining Machine Learning Models Using Shapley Values

Lecture Notes in Computer Science - Machine Learning and Knowledge Extraction ◽

10.1007/978-3-030-57321-8_2 ◽

2020 ◽

pp. 17-38 ◽

Cited By ~ 1

Author(s):

Luke Merrick ◽

Ankur Taly

Keyword(s):

Machine Learning ◽

Learning Models ◽

Shapley Values ◽

Machine Learning Models

Download Full-text

Opening the Black Box: Machine Learning Interpretability and Inference Tools with an Application to Economic Forecasting

Data Science for Economics and Finance ◽

10.1007/978-3-030-66891-4_3 ◽

2021 ◽

pp. 43-63

Author(s):

Marcus Buckmann ◽

Andreas Joseph ◽

Helena Robertson

Keyword(s):

Machine Learning ◽

Linear Models ◽

Black Box ◽

Comparative Case Study ◽

Learning Models ◽

Aggregate Information ◽

Data Generating Process ◽

Shapley Values ◽

Functional Forms ◽

Machine Learning Models

AbstractWe present a comprehensive comparative case study for the use of machine learning models for macroeconomics forecasting. We find that machine learning models mostly outperform conventional econometric approaches in forecasting changes in US unemployment on a 1-year horizon. To address the black box critique of machine learning models, we apply and compare two variables attribution methods: permutation importance and Shapley values. While the aggregate information derived from both approaches is broadly in line, Shapley values offer several advantages, such as the discovery of unknown functional forms in the data generating process and the ability to perform statistical inference. The latter is achieved by the Shapley regression framework, which allows for the evaluation and communication of machine learning models akin to that of linear models.

Download Full-text

Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions

Journal of Computer-Aided Molecular Design ◽

10.1007/s10822-020-00314-0 ◽

2020 ◽

Vol 34 (10) ◽

pp. 1013-1026 ◽

Cited By ~ 4

Author(s):

Raquel Rodríguez-Pérez ◽

Jürgen Bajorath

Keyword(s):

Machine Learning ◽

Learning Models ◽

Shapley Values ◽

Target Activity ◽

Machine Learning Models

Download Full-text

Improving XGBoost with Imagination Sampling

Communications of the Blyth Institute ◽

10.33014/issn.2640-5652.2.1.holloway.1 ◽

2020 ◽

Vol 2 (1) ◽

pp. 3-6

Author(s):

Eric Holloway

Keyword(s):

Machine Learning ◽

General System ◽

Learning Models ◽

Starting Point ◽

Machine Learning Models

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.

Download Full-text

Development of Machine Learning Models to Predict Student Performance in Computer Literacy Courses

International Review on Computers and Software (IRECOS) ◽

10.15866/irecos.v13i1.16863 ◽

2018 ◽

Vol 13 (1) ◽

pp. 21

Author(s):

George Anderson ◽

Oduronke T. Eyitayo

Keyword(s):

Machine Learning ◽

Student Performance ◽

Computer Literacy ◽

Learning Models ◽

Machine Learning Models

Download Full-text