Building more accurate decision trees with the additive tree

The expansion of machine learning to high-stakes application domains such as medicine, finance, and criminal justice, where making informed decisions requires clear understanding of the model, has increased the interest in interpretable machine learning. The widely used Classification and Regression Trees (CART) have played a major role in health sciences, due to their simple and intuitive explanation of predictions. Ensemble methods like gradient boosting can improve the accuracy of decision trees, but at the expense of the interpretability of the generated model. Additive models, such as those produced by gradient boosting, and full interaction models, such as CART, have been investigated largely in isolation. We show that these models exist along a spectrum, revealing previously unseen connections between these approaches. This paper introduces a rigorous formalization for the additive tree, an empirically validated learning technique for creating a single decision tree, and shows that this method can produce models equivalent to CART or gradient boosted stumps at the extremes by varying a single parameter. Although the additive tree is designed primarily to provide both the model interpretability and predictive performance needed for high-stakes applications like medicine, it also can produce decision trees represented by hybrid models between CART and boosted stumps that can outperform either of these approaches.

Download Full-text

Completing the Market

IMF Working Papers ◽

10.5089/9781513524085.001 ◽

2019 ◽

Vol 19 (292) ◽

Author(s):

Nan Hu ◽

Jian Li ◽

Alexis Meyer-Cirkel

Keyword(s):

Machine Learning ◽

Credit Risk ◽

Prediction Accuracy ◽

Ensemble Methods ◽

Predictive Performance ◽

Gradient Boosting ◽

Learning Methods ◽

Machine Learning Methods ◽

Ensemble Machine Learning ◽

Out Of Sample Prediction

We compared the predictive performance of a series of machine learning and traditional methods for monthly CDS spreads, using firms’ accounting-based, market-based and macroeconomics variables for a time period of 2006 to 2016. We find that ensemble machine learning methods (Bagging, Gradient Boosting and Random Forest) strongly outperform other estimators, and Bagging particularly stands out in terms of accuracy. Traditional credit risk models using OLS techniques have the lowest out-of-sample prediction accuracy. The results suggest that the non-linear machine learning methods, especially the ensemble methods, add considerable value to existent credit risk prediction accuracy and enable CDS shadow pricing for companies missing those securities.

Download Full-text

Forecasting US movies box office performances in Turkey using machine learning algorithms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189120 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6579-6590

Author(s):

Sandy Çağlıyor ◽

Başar Öztayşi ◽

Selime Sezgin

Keyword(s):

Machine Learning ◽

Global Economy ◽

Learning Algorithms ◽

Forecast Model ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

High Stakes ◽

Box Office ◽

Industry Forecast ◽

The Impact

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10010042 ◽

2021 ◽

Vol 10 (1) ◽

pp. 42

Author(s):

Kieu Anh Nguyen ◽

Walter Chen ◽

Bor-Shiun Lin ◽

Uma Seeboonruang

Keyword(s):

Machine Learning ◽

Soil Erosion ◽

Ensemble Methods ◽

Machine Learning Algorithms ◽

Multivariate Adaptive Regression Splines ◽

Gradient Boosting ◽

Support Vector ◽

Ensemble Machine Learning ◽

Boosting Method ◽

Bagging Method

Although machine learning has been extensively used in various fields, it has only recently been applied to soil erosion pin modeling. To improve upon previous methods of quantifying soil erosion based on erosion pin measurements, this study explored the possible application of ensemble machine learning algorithms to the Shihmen Reservoir watershed in northern Taiwan. Three categories of ensemble methods were considered in this study: (a) Bagging, (b) boosting, and (c) stacking. The bagging method in this study refers to bagged multivariate adaptive regression splines (bagged MARS) and random forest (RF), and the boosting method includes Cubist and gradient boosting machine (GBM). Finally, the stacking method is an ensemble method that uses a meta-model to combine the predictions of base models. This study used RF and GBM as the meta-models, decision tree, linear regression, artificial neural network, and support vector machine as the base models. The dataset used in this study was sampled using stratified random sampling to achieve a 70/30 split for the training and test data, and the process was repeated three times. The performance of six ensemble methods in three categories was analyzed based on the average of three attempts. It was found that GBM performed the best among the ensemble models with the lowest root-mean-square error (RMSE = 1.72 mm/year), the highest Nash-Sutcliffe efficiency (NSE = 0.54), and the highest index of agreement (d = 0.81). This result was confirmed by the spatial comparison of the absolute differences (errors) between model predictions and observations using GBM and RF in the study area. In summary, the results show that as a group, the bagging method and the boosting method performed equally well, and the stacking method was third for the erosion pin dataset considered in this study.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text

Mapping of the Canopy Openings in Mixed Beech–Fir Forest at Sentinel-2 Subpixel Level Using UAV and Machine Learning Approach

Remote Sensing ◽

10.3390/rs12233925 ◽

2020 ◽

Vol 12 (23) ◽

pp. 3925

Author(s):

Ivan Pilaš ◽

Mateo Gašparović ◽

Alan Novkinić ◽

Damir Klobučar

Keyword(s):

Machine Learning ◽

Forest Canopy ◽

Vegetation Index ◽

Predictive Performance ◽

Spatial Extent ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Sentinel 2

The presented study demonstrates a bi-sensor approach suitable for rapid and precise up-to-date mapping of forest canopy gaps for the larger spatial extent. The approach makes use of Unmanned Aerial Vehicle (UAV) red, green and blue (RGB) images on smaller areas for highly precise forest canopy mask creation. Sentinel-2 was used as a scaling platform for transferring information from the UAV to a wider spatial extent. Various approaches to an improvement in the predictive performance were examined: (I) the highest R2 of the single satellite index was 0.57, (II) the highest R2 using multiple features obtained from the single-date, S-2 image was 0.624, and (III) the highest R2 on the multitemporal set of S-2 images was 0.697. Satellite indices such as Atmospherically Resistant Vegetation Index (ARVI), Infrared Percentage Vegetation Index (IPVI), Normalized Difference Index (NDI45), Pigment-Specific Simple Ratio Index (PSSRa), Modified Chlorophyll Absorption Ratio Index (MCARI), Color Index (CI), Redness Index (RI), and Normalized Difference Turbidity Index (NDTI) were the dominant predictors in most of the Machine Learning (ML) algorithms. The more complex ML algorithms such as the Support Vector Machines (SVM), Random Forest (RF), Stochastic Gradient Boosting (GBM), Extreme Gradient Boosting (XGBoost), and Catboost that provided the best performance on the training set exhibited weaker generalization capabilities. Therefore, a simpler and more robust Elastic Net (ENET) algorithm was chosen for the final map creation.

Download Full-text

Evaluation of machine learning algorithms for classification of primary biological aerosol using a new UV-LIF spectrometer

Atmospheric Measurement Techniques ◽

10.5194/amt-10-695-2017 ◽

2017 ◽

Vol 10 (2) ◽

pp. 695-708 ◽

Cited By ~ 25

Author(s):

Simon Ruske ◽

David O. Topping ◽

Virginia E. Foot ◽

Paul H. Kaye ◽

Warren R. Stanley ◽

...

Keyword(s):

Neural Networks ◽

Decision Trees ◽

Supervised Learning ◽

Ensemble Methods ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Shape Information ◽

Accuracy Of Measurements

Abstract. Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.

Download Full-text

Discovery of novel Li SSE and anode coatings using interpretable machine learning and high-throughput multi-property screening

Scientific Reports ◽

10.1038/s41598-021-94275-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shreyas J. Honrao ◽

Xin Yang ◽

Balachandran Radhakrishnan ◽

Shigemasa Kuwata ◽

Hideyuki Komatsu ◽

...

Keyword(s):

Machine Learning ◽

High Throughput ◽

Solid Electrolytes ◽

Ensemble Methods ◽

Gradient Boosting ◽

Materials Properties ◽

Metal Anode ◽

Safety Issues ◽

Interpretable Machine Learning ◽

Migration Barriers

AbstractAll-solid-state batteries with Li metal anode can address the safety issues surrounding traditional Li-ion batteries as well as the demand for higher energy densities. However, the development of solid electrolytes and protective anode coatings possessing high ionic conductivity and good stability with Li metal has proven to be a challenge. Here, we present our informatics approach to explore the Li compound space for promising electrolytes and anode coatings using high-throughput multi-property screening and interpretable machine learning. To do this, we generate a database of battery-related materials properties by computing $$\hbox {Li}^+$$ Li + migration barriers and stability windows for over 15,000 Li-containing compounds from Materials Project. We screen through the database for candidates with good thermodynamic and electrochemical stabilities, and low $$\hbox {Li}^+$$ Li + migration barriers, identifying promising new candidates such as $$\hbox {Li}_9\hbox {S}_3$$ Li 9 S 3 N, $$\hbox {LiAlB}_2\hbox {O}_5$$ LiAlB 2 O 5 , $$\hbox {LiYO}_2$$ LiYO 2 , $$\hbox {LiSbF}_4$$ LiSbF 4 , and $$\hbox {Sr}_4\hbox {Li}(\hbox {BN}_2)_3$$ Sr 4 Li ( BN 2 ) 3 , among others. We train machine learning models, using ensemble methods, to predict migration barriers and oxidation and reduction potentials of these compounds by engineering input features that ensure accuracy and interpretability. Using only a small number of features, our gradient boosting regression models achieve $$\mathrm {R}^2$$ R 2 values of 0.95 and 0.92 on the oxidation and reduction potential prediction tasks, respectively, and 0.86 on the migration barrier prediction task. Finally, we use Shapley additive explanations and permutation feature importance analyses to interpret our machine learning predictions and identify materials properties with the largest impact on predictions in our models. We show that our approach has the potential to enable rapid discovery and design of novel solid electrolytes and anode coatings.

Download Full-text

Predicting patient outcomes in psychiatric hospitals with routine data: a machine learning approach

10.21203/rs.2.15371/v1 ◽

2019 ◽

Author(s):

Jan Wolff ◽

Alexander Gary ◽

Daniela Jung ◽

Claus Normann ◽

Klaus Kaier ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Psychiatric Hospital ◽

Hospital Care ◽

Routine Data ◽

Psychiatric Hospitals ◽

Predictive Performance ◽

Gradient Boosting ◽

Stochastic Gradient Boosting ◽

Better Than

Abstract Background A common problem in machine learning applications is availability of data at the point of decision making. The aim of the present study was to use routine data readily available at admission to predict aspects relevant to the organization of psychiatric hospital care. A further aim was to compare the results of machine learning with those obtained through traditional methods and a naive baseline classifier.Methods The study included consecutively discharged patients between 1st of January 2017 and 31st of December 2018 from nine psychiatric hospitals in Hesse, Germany. We compared the predictive performance achieved by stochastic gradient boosting (GBM) with multiple logistic regression and a naive baseline classifier. We tested the performance of our final models on unseen patients from another calendar year and from different hospitals.Results The study included 45,388 inpatient episodes. The models’ performance, as measured by the area under the Receiver Operating Characteristic curve, varied strongly between the predicted outcomes, with relatively high performance in the prediction of coercive treatment (area under the curve: 0.83) and 1:1 observations (0.80) and relatively poor performance in the prediction of short length of stay (0.69) and non-response to treatment (0.65). The GBM performed slightly better than logistic regression. Both approaches were substantially better than a naive prediction based solely on basic diagnostic grouping.Conclusion The present study has shown that administrative routine data can be used to predict aspects relevant to the organisation of psychiatric hospital care. Future research should investigate the predictive performance that is necessary to provide effective assistance in clinical practice for the benefit of both staff and patients.

Download Full-text

The sensitivity of pCO<sub>2</sub> reconstructions in the Southern Ocean to sampling scales: a semi-idealized model sampling and reconstruction approach

10.5194/bg-2021-344 ◽

2022 ◽

Author(s):

Laique Merlin Djeutchouang ◽

Nicolette Chang ◽

Luke Gregor ◽

Marcello Vichi ◽

Pedro Manuel Scheel Monteiro

Keyword(s):

Machine Learning ◽

Southern Ocean ◽

Seasonal Cycle ◽

Temporal Frequency ◽

Gradient Boosting ◽

Biogeochemical Model ◽

Clear Understanding ◽

Surface Ocean ◽

Temporal Sampling ◽

Model Domain

Abstract. The Southern Ocean is a complex system yet is sparsely sampled in both space and time. These factors raise questions about the confidence in present sampling strategies and associated machine learning (ML) reconstructions. Previous studies have not yielded a clear understanding of the origin of uncertainties and biases for the reconstructions of the partial pressure of carbon dioxide (pCO2) at the surface ocean (pCO2ocean). Here, we examine these questions by investigating the sensitivity of pCO2ocean reconstruction uncertainties and biases to a series of semi-idealized observing system simulation experiments (OSSEs) that simulate spatio-temporal sampling scales of surface ocean pCO2 in ways that are comparable to ocean CO2 observing platforms (Ship, Waveglider, Carbon-float, Saildrone). These experiments sampled a high spatial resolution (±10 km) coupled physical and biogeochemical model (NEMO-PISCES) within a sub-domain representative of the Sub-Antarctic and Polar Frontal Zones in the Southern Ocean. The reconstructions were done using a two-member ensemble approach that consisted of two machine learning (ML) methods, (1) the feed-forward neural network and (2) the gradient boosting machines. With the baseline observations being from the simulated ships mimicking observations from the Surface Ocean CO2 Atlas (SOCAT), we applied to each of the scale-sampling simulation scenarios the two-member ensemble method ML2, to reconstruct the full sub-domain pCO2ocean and assess the reconstruction skill through a statistical comparison of reconstructed pCO2ocean and model domain mean. The analysis shows that uncertainties and biases for pCO2ocean reconstructions are very sensitive to both the spatial and temporal scales of pCO2 sampling in the model domain. The four key findings from our investigation are the following: (1) improving ML-based pCO2 reconstructions in the Southern Ocean requires simultaneous high resolution observations of the meridional and the seasonal cycle (< 3 days) of pCO2ocean; (2) Saildrones stand out as the optimal platforms to simultaneously address these requirements; (3) Wavegliders with hourly/daily resolution in pseudo-mooring mode improve on Carbon-floats (10-day period), which suggests that sampling aliases from the low temporal frequency have a greater negative impact on their uncertainties, biases and reconstruction means; and (4) the present summer seasonal sampling biases in SOCAT data in the Southern Ocean may be behind a significant winter bias in the reconstructed seasonal cycle of pCO2ocean.

Download Full-text

Bio, psycho, or social: supervised machine learning to classify discursive framing of depression in online health communities

Quality & Quantity ◽

10.1007/s11135-021-01299-0 ◽

2022 ◽

Author(s):

Renáta Németh ◽

Fanni Máté ◽

Eszter Katona ◽

Márton Rakovics ◽

Domonkos Sik

Keyword(s):

Social Sciences ◽

Machine Learning ◽

Data Science ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Online Health Communities ◽

Health Communities ◽

Open Question

AbstractSupervised machine learning on textual data has successful industrial/business applications, but it is an open question whether it can be utilized in social knowledge building outside the scope of hermeneutically more trivial cases. Combining sociology and data science raises several methodological and epistemological questions. In our study the discursive framing of depression is explored in online health communities. Three discursive frameworks are introduced: the bio-medical, psychological, and social framings of depression. ~80 000 posts were collected, and a sample of them was manually classified. Conventional bag-of-words models, Gradient Boosting Machine, word-embedding-based models and a state-of-the-art Transformer-based model with transfer learning, called DistilBERT were applied to expand this classification on the whole database. According to our experience ‘discursive framing’ proves to be a complex and hermeneutically difficult concept, which affects the degree of both inter-annotator agreement and predictive performance. Our finding confirms that the level of inter-annotator disagreement provides a good estimate for the objective difficulty of the classification. By identifying the most important terms, we also interpreted the classification algorithms, which is of great importance in social sciences. We are convinced that machine learning techniques can extend the horizon of qualitative text analysis. Our paper supports a smooth fit of the new techniques into the traditional toolbox of social sciences.

Download Full-text