scholarly journals Interpretable machine learning with an ensemble of gradient boosting machines

2021 ◽  
pp. 106993
Author(s):  
Andrei V. Konstantinov ◽  
Lev V. Utkin
2021 ◽  
Author(s):  
Seong Hwan Kim ◽  
Eun-Tae Jeon ◽  
Sungwook Yu ◽  
Kyungmi O ◽  
Chi Kyung Kim ◽  
...  

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Shreyas J. Honrao ◽  
Xin Yang ◽  
Balachandran Radhakrishnan ◽  
Shigemasa Kuwata ◽  
Hideyuki Komatsu ◽  
...  

AbstractAll-solid-state batteries with Li metal anode can address the safety issues surrounding traditional Li-ion batteries as well as the demand for higher energy densities. However, the development of solid electrolytes and protective anode coatings possessing high ionic conductivity and good stability with Li metal has proven to be a challenge. Here, we present our informatics approach to explore the Li compound space for promising electrolytes and anode coatings using high-throughput multi-property screening and interpretable machine learning. To do this, we generate a database of battery-related materials properties by computing $$\hbox {Li}^+$$ Li + migration barriers and stability windows for over 15,000 Li-containing compounds from Materials Project. We screen through the database for candidates with good thermodynamic and electrochemical stabilities, and low $$\hbox {Li}^+$$ Li + migration barriers, identifying promising new candidates such as $$\hbox {Li}_9\hbox {S}_3$$ Li 9 S 3 N, $$\hbox {LiAlB}_2\hbox {O}_5$$ LiAlB 2 O 5 , $$\hbox {LiYO}_2$$ LiYO 2 , $$\hbox {LiSbF}_4$$ LiSbF 4 , and $$\hbox {Sr}_4\hbox {Li}(\hbox {BN}_2)_3$$ Sr 4 Li ( BN 2 ) 3 , among others. We train machine learning models, using ensemble methods, to predict migration barriers and oxidation and reduction potentials of these compounds by engineering input features that ensure accuracy and interpretability. Using only a small number of features, our gradient boosting regression models achieve $$\mathrm {R}^2$$ R 2 values of 0.95 and 0.92 on the oxidation and reduction potential prediction tasks, respectively, and 0.86 on the migration barrier prediction task. Finally, we use Shapley additive explanations and permutation feature importance analyses to interpret our machine learning predictions and identify materials properties with the largest impact on predictions in our models. We show that our approach has the potential to enable rapid discovery and design of novel solid electrolytes and anode coatings.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Young Hee Jung ◽  
Hyejoo Lee ◽  
Hee Jin Kim ◽  
Duk L. Na ◽  
Hyun Jeong Han ◽  
...  

Abstract Amyloid-β(Aβ) PET positivity in patients with suspected cerebral amyloid angiopathy (CAA) MRI markers is predictive of a worse cognitive trajectory, and it provides insights into the underlying vascular pathology (CAA vs. hypertensive angiopathy) to facilitate prognostic prediction and appropriate treatment decisions. In this study, we applied two interpretable machine learning algorithms, gradient boosting machine (GBM) and random forest (RF), to predict Aβ PET positivity in patients with CAA MRI markers. In the GBM algorithm, the number of lobar cerebral microbleeds (CMBs), deep CMBs, lacunes, CMBs in dentate nuclei, and age were ranked as the most influential to predict Aβ positivity. In the RF algorithm, the absence of diabetes was additionally chosen. Cut-off values of the above variables predictive of Aβ positivity were as follows: (1) the number of lobar CMBs > 16.4(GBM)/14.3(RF), (2) no deep CMBs(GBM/RF), (3) the number of lacunes > 7.4(GBM/RF), (4) age > 74.3(GBM)/64(RF), (5) no CMBs in dentate nucleus(GBM/RF). The classification performances based on the area under the receiver operating characteristic curve were 0.83 in GBM and 0.80 in RF. Our study demonstrates the utility of interpretable machine learning in the clinical setting by quantifying the relative importance and cutoff values of predictive variables for Aβ positivity in patients with suspected CAA markers.


Author(s):  
Nichola Fountain-Jones ◽  
Christopher Kozakiewicz ◽  
Brenna Forester ◽  
Erin Landguth ◽  
Scott Carver ◽  
...  

We introduce a new R package ‘MrIML’ (Multi-response Interpretable Machine Learning). MrIML provides a powerful and interpretable framework that enables users to harness recent advances in machine learning to map multi-locus genomic relationships, to identify loci of interest for future landscape genetics studies and to gain new insights into adaptation across environmental gradients. Relationships between genetic change and environment are often non-linear, interactive and autocorrelated. Our package helps capture this complexity and offers functions that construct, fit and conduct inference on a wide range of highly flexible models that are routinely used for single-locus landscape genetics studies but are rarely extended to estimate response functions for multiple loci. To demonstrate the package’s broad functionality, we test its ability to recover landscape relationships from simulated genomic data. We also apply the package to two empirical case studies. In the first we estimate variation in the population-level genetic composition of North American balsam poplar (Populus balsamifera, Salicaceae) and in the second we recover individual-level landscapes while estimating host drivers of feline immunodeficiency virus genetic spread in bobcats (Lynx rufus). The ability to model thousands of loci collectively and compare models from linear regression to extreme gradient boosting, within the same analytical framework, has the potential to be transformative. The MrIML framework is also extendable and not limited to mapping genetic change, for example, it can be used to quantify the environmental driver sof microbiomes and coinfection dynamics.


2021 ◽  
Vol 11 (1) ◽  
pp. 133-152
Author(s):  
Devesh Singh

Abstract In advancement of interpretable machine learning (IML), this research proposes local interpretable model-agnostic explanations (LIME) as a new visualization technique in a novel informative way to analyze the foreign direct investment (FDI) inflow. This article examines the determinants of FDI inflow through IML with a supervised learning method to analyze the foreign investment determinants in Hungary by using an open-source artificial intelligence H2O platform. This author used three ML algorithms—general linear model (GML), gradient boosting machine (GBM), and random forest (RF) classifier—to analyze the FDI inflow from 2001 to 2018. The result of this study shows that in all three classifiers GBM performs better to analyze FDI inflow determinants. The variable value of production in a region is the most influenced determinant to the inflow of FDI in Hungarian regions. Explanatory visualizations are presented from the analyzed dataset, which leads to their use in decision-making.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Seong-Hwan Kim ◽  
Eun-Tae Jeon ◽  
Sungwook Yu ◽  
Kyungmi Oh ◽  
Chi Kyung Kim ◽  
...  

AbstractWe aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multicenter prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanation (SHAP) method to evaluate feature importance. Of the 3,213 stroke patients, the 2,363 who had arrived at the hospital within 24 h of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.772; 95% confidence interval, 0.715–0.829). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the effects of the features on the predictive power of the model were individualized using the SHAP method.


2020 ◽  
Vol 39 (5) ◽  
pp. 6579-6590
Author(s):  
Sandy Çağlıyor ◽  
Başar Öztayşi ◽  
Selime Sezgin

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.


Sign in / Sign up

Export Citation Format

Share Document