Interpreting tree ensemble machine learning models with endoR

Background: Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa or genomic content may be associated. Results: We developed endoR, a method to interpret a fitted tree ensemble model. First, endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network. Both the network and importance scores derived from endoR provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed the performance of endoR on both simulated and real metagenomic data. We found endoR to infer true associations with more or comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to gain insights into components of the microbiome that predict the presence of human gut methanogens, as these hydrogen-consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Conclusion: Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems. An implementation of endoR is available as an open-source R-package on GitHub (https://github.com/leylabmpi/endoR).

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

Ensemble machine learning models for aviation incident risk prediction

Decision Support Systems ◽

10.1016/j.dss.2018.10.009 ◽

2019 ◽

Vol 116 ◽

pp. 48-63 ◽

Cited By ~ 21

Author(s):

Xiaoge Zhang ◽

Sankaran Mahadevan

Keyword(s):

Machine Learning ◽

Risk Prediction ◽

Learning Models ◽

Ensemble Machine Learning ◽

Machine Learning Models

Download Full-text

PlotMI: visualization of pairwise interactions and positional preferences learned by a deep learning model from sequence data

10.1101/2021.03.14.435285 ◽

2021 ◽

Author(s):

Tuomo Hartonen ◽

Teemu Kivioja ◽

Jussi Taipale

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Sequence Data ◽

Predictive Performance ◽

Learning Model ◽

Biological Research ◽

Learning Approaches ◽

Learning Models ◽

Model Interpretation ◽

Pairwise Interactions

Deep learning models have in recent years gained success in various tasks related to understanding information coded in the DNA sequence. Rapidly developing genome-wide measurement technologies provide large quantities of data ideally suited for modeling using deep learning or other powerful machine learning approaches. Although offering state-of-the art predictive performance, the predictions made by deep learning models can be difficult to understand. In virtually all biological research, the understanding of how a predictive model works is as important as the raw predictive performance. Thus interpretation of deep learning models is an emerging hot topic especially in context of biological research. Here we describe plotMI, a mutual information based model interpretation strategy that can intuitively visualize positional preferences and pairwise interactions learned by any machine learning model trained on sequence data with a defined alphabet as input. PlotMI is freely available at https://github.com/hartonen/plotMI.

Download Full-text

Data-Driven Approach for Predicting and Explaining the Risk of Long-Term Unemployment

E3S Web of Conferences ◽

10.1051/e3sconf/202021401023 ◽

2020 ◽

Vol 214 ◽

pp. 01023

Author(s):

Linan (Frank) Zhao

Keyword(s):

Machine Learning ◽

Age Groups ◽

Learning Models ◽

Public Authorities ◽

Ensemble Machine Learning ◽

European Public ◽

Data Driven Approach ◽

Using Data ◽

Machine Learning Models

Long-term unemployment has significant societal impact and is of particular concerns for policymakers with regard to economic growth and public finances. This paper constructs advanced ensemble machine learning models to predict citizens’ risks of becoming long-term unemployed using data collected from European public authorities for employment service. The proposed model achieves 81.2% accuracy on identifying citizens with high risks of long-term unemployment. This paper also examines how to dissect black-box machine learning models by offering explanations at both a local and global level using SHAP, a state-of-the-art model-agnostic approach to explain factors that contribute to long-term unemployment. Lastly, this paper addresses an under-explored question when applying machine learning in the public domain, that is, the inherent bias in model predictions. The results show that popular models such as gradient boosted trees may produce unfair predictions against senior age groups and immigrants. Overall, this paper sheds light on the recent increasing shift for governments to adopt machine learning models to profile and prioritize employment resources to reduce the detrimental effects of long-term unemployment and improve public welfare.

Download Full-text

Temporal and Spatial Autocorrelation as Determinants of Regional AOD-PM2.5 Model Performance in the Middle East

Remote Sensing ◽

10.3390/rs13183790 ◽

2021 ◽

Vol 13 (18) ◽

pp. 3790

Author(s):

Khang Chau ◽

Meredith Franklin ◽

Huikyo Lee ◽

Michael Garay ◽

Olga Kalashnikova

Keyword(s):

Machine Learning ◽

Middle East ◽

United Arab Emirates ◽

Atmospheric Correction ◽

Predictive Performance ◽

Variable Importance ◽

Learning Models ◽

Median Test ◽

Temporal And Spatial ◽

Machine Learning Models

Exposure to fine particulate matter (PM2.5) air pollution has been shown in numerous studies to be associated with detrimental health effects. However, the ability to conduct epidemiological assessments can be limited due to challenges in generating reliable PM2.5 estimates, particularly in parts of the world such as the Middle East where measurements are scarce and extreme meteorological events such as sandstorms are frequent. In order to supplement exposure modeling efforts under such conditions, satellite-retrieved aerosol optical depth (AOD) has proven to be useful due to its global coverage. By using AODs from the Multiangle Implementation of Atmospheric Correction (MAIAC) of the MODerate Resolution Imaging Spectroradiometer (MODIS) and the Multiangle Imaging Spectroradiometer (MISR) combined with meteorological and assimilated aerosol information from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2), we constructed machine learning models to predict PM2.5 in the area surrounding the Persian Gulf, including Kuwait, Bahrain, and the United Arab Emirates (U.A.E). Our models showed regional differences in predictive performance, with better results in the U.A.E. (median test R2 = 0.66) than Kuwait (median test R2 = 0.51). Variable importance also differed by region, where satellite-retrieved AOD variables were more important for predicting PM2.5 in Kuwait than in the U.A.E. Divergent trends in the temporal and spatial autocorrelations of PM2.5 and AOD in the two regions offered possible explanations for differences in predictive performance and variable importance. In a test of model transferability, we found that models trained in one region and applied to another did not predict PM2.5 well, even if the transferred model had better performance. Overall the results of our study suggest that models developed over large geographic areas could generate PM2.5 estimates with greater uncertainty than could be obtained by taking a regional modeling approach. Furthermore, development of methods to better incorporate spatial and temporal autocorrelations in machine learning models warrants further examination.

Download Full-text

Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models

2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT) ◽

10.1109/rteict46194.2019.9016968 ◽

2019 ◽

Cited By ~ 1

Author(s):

Naveen ◽

R. K. Sharma ◽

Anil Ramachandran Nair

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Models ◽

Cancer Prediction ◽

Ensemble Machine Learning ◽

Machine Learning Models

Download Full-text

chemmodlab: a cheminformatics modeling laboratory R package for fitting and assessing machine learning models

Journal of Cheminformatics ◽

10.1186/s13321-018-0309-4 ◽

2018 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Jeremy R. Ash ◽

Jacqueline M. Hughes-Oliver

Keyword(s):

Machine Learning ◽

R Package ◽

Learning Models ◽

Machine Learning Models

Download Full-text

A predictive performance comparison of machine learning models for judicial cases

2017 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci.2017.8285436 ◽

2017 ◽

Cited By ~ 4

Author(s):

Zhenyu Liu ◽

Huanhuan Chen

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

Performance Comparison ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Development of Combined Heavy Rain Damage Prediction Models with Machine Learning

Water ◽

10.3390/w11122516 ◽

2019 ◽

Vol 11 (12) ◽

pp. 2516 ◽

Cited By ~ 1

Author(s):

Changhyun Choi ◽

Jeonghwan Kim ◽

Jungwook Kim ◽

Hung Soo Kim

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Prediction Model ◽

Prediction Models ◽

Predictive Performance ◽

Heavy Rain ◽

Learning Models ◽

Damage Prediction ◽

Natural Disaster Management ◽

Machine Learning Models

Adequate forecasting and preparation for heavy rain can minimize life and property damage. Some studies have been conducted on the heavy rain damage prediction model (HDPM), however, most of their models are limited to the linear regression model that simply explains the linear relation between rainfall data and damage. This study develops the combined heavy rain damage prediction model (CHDPM) where the residual prediction model (RPM) is added to the HDPM. The predictive performance of the CHDPM is analyzed to be 4–14% higher than that of HDPM. Through this, we confirmed that the predictive performance of the model is improved by combining the RPM of the machine learning models to complement the linearity of the HDPM. The results of this study can be used as basic data beneficial for natural disaster management.

Download Full-text

Benchmarking machine learning models for the analysis of genetic data using FRESA.CAD Binary Classification Benchmarking

10.1101/733675 ◽

2019 ◽

Author(s):

Javier de Velasco Oriol ◽

Antonio Martinez-Torteya ◽

Victor Trevino ◽

Israel Alanis ◽

Edgar E. Vallejo ◽

...

Keyword(s):

Machine Learning ◽

Model Selection ◽

Binary Classification ◽

Genetic Data ◽

R Package ◽

Learning Models ◽

Classification Problems ◽

Machine Learning Methods ◽

Computational Perspective ◽

Machine Learning Models

AbstractBackgroundMachine learning models have proven to be useful tools for the analysis of genetic data. However, with the availability of a wide variety of such methods, model selection has become increasingly difficult, both from the human and computational perspective.ResultsWe present the R package FRESA.CAD Binary Classification Benchmarking that performs systematic comparisons between a collection of representative machine learning methods for solving binary classification problems on genetic datasets.ConclusionsFRESA.CAD Binary Benchmarking demonstrates to be a useful tool over a variety of binary classification problems comprising the analysis of genetic data showing both quantitative and qualitative advantages over similar packages.

Download Full-text