Partitioning environment and space in site-by-species matrices: a comparison of methods for community ecology and macroecology

AbstractCommunity ecologists and macroecologists have long sought to evaluate the importance of environmental conditions and other drivers in determining species composition across sites. Different methods have been used to estimate species-environment relationships while accounting for or partitioning the variation attributed to environment and spatial autocorrelation, but their differences and respective reliability remain poorly known. We compared the performance of four families of statistical methods in estimating the contribution of the environment and space to explain variation in multi-species occurrence and abundance. These methods included distance-based regression (MRM), constrained ordination (RDA and CCA), generalised linear and additive models (GLM, GAM), and treebased machine learning (regression trees, boosted regression trees, and random forests). Depending on the method, the spatial model consisted of either Moran’s Eigenvector Maps (MEM; in constrained ordination and GLM), smooth spatial splines (in GAM), or tree-based non-linear modelling of spatial coordinates (in machine learning). We simulated typical ecological data to assess the methods’ performance in (1) fitting environmental and spatial effects, and (2) partitioning the variation explained by the environmental and spatial effects. Differences in the fitting performance among major model types – (G)LM, GAM, machine learning – were reflected in the variation partitioning performance of the different methods. Machine learning methods, namely boosted regression trees, performed overall better. GAM performed similarly well, though likelihood optimisation did not converge for some empirical test data. The remaining methods performed worse under most simulated data variations (depending on the type of species data, sample size and coverage, autocorrelation range, and response shape). Our results suggest that tree-based machine learning is a robust approach that can be widely used for variation partitioning. Our recommendations apply to single-species niche models, community ecology, and macroecology studies aiming at disentangling the relative contributions of space vs. environment and other drivers of variation in site-by-species matrices.

Download Full-text

SAT0587 MACHINE-LEARNING DERIVED ALGORITHMS FOR OUTCOMES PREDICTION IN RHEUMATIC DISEASES: APPLICATION TO RADIOGRAPHIC PROGRESSION IN EARLY AXIAL SPONDYLOARTHRITIS

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.431 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 1252.2-1253

Author(s):

R. Garofoli ◽

M. Resche-Rigon ◽

M. Dougados ◽

D. Van der Heijde ◽

C. Roux ◽

...

Keyword(s):

Machine Learning ◽

Radiographic Progression ◽

Generalized Additive Models ◽

Regression Trees ◽

Information Criterion ◽

Additive Models ◽

Super Learner ◽

Additive Regression ◽

Selection Operator ◽

Lasso Method

Background:Axial spondyloarthritis (axSpA) is a chronic rheumatic disease that encompasses various clinical presentations: inflammatory chronic back pain, peripheral manifestations and extra-articular manifestations. The current nomenclature divides axSpA in radiographic (in the presence of radiographic sacroiliitis) and non-radiographic (in the absence of radiographic sacroiliitis, with or without MRI sacroiliitis. Given that the functional burden of the disease appears to be greater in patients with radiographic forms, it seems crucial to be able to predict which patients will be more likely to develop structural damage over time. Predictive factors for radiographic progression in axSpA have been identified through use of traditional statistical models like logistic regression. However, these models present some limitations. In order to overcome these limitations and to improve the predictive performance, machine learning (ML) methods have been developed.Objectives:To compare ML models to traditional models to predict radiographic progression in patients with early axSpA.Methods:Study design: prospective French multicentric cohort study (DESIR cohort) with 5years of follow-up. Patients: all patients included in the cohort, i.e. 708 patients with inflammatory back pain for >3 months but <3 years, highly suggestive of axSpA. Data on the first 5 years of follow-up was used. Statistical analyses: radiographic progression was defined as progression either at the spine (increase of at least 1 point per 2 years of mSASSS scores) or at the sacroiliac joint (worsening of at least one grade of the mNY score between 2 visits). Traditional modelling: we first performed a bivariate analysis between our outcome (radiographic progression) and explanatory variables at baseline to select the variables to be included in our models and then built a logistic regression model (M1). Variable selection for traditional models was performed with 2 different methods: stepwise selection based on Akaike Information Criterion (stepAIC) method (M2), and the Least Absolute Shrinkage and Selection Operator (LASSO) method (M3). We also performed sensitivity analysis on all patients with manual backward method (M4) after multiple imputation of missing data. Machine learning modelling: using the “SuperLearner” package on R, we modelled radiographic progression with stepAIC, LASSO, random forest, Discrete Bayesian Additive Regression Trees Samplers (DBARTS), Generalized Additive Models (GAM), multivariate adaptive polynomial spline regression (polymars), Recursive Partitioning And Regression Trees (RPART) and Super Learner. Finally, the accuracy of traditional and ML models was compared based on their 10-foldcross-validated AUC (cv-AUC).Results:10-fold cv-AUC for traditional models were 0.79 and 0.78 for M2 and M3, respectively. The 3 best models in the ML algorithm were the GAM, the DBARTS and the Super Learner models, with 10-fold cv-AUC of: 0.77, 0.76 and 0.74, respectively (Table 1).Table 1.Comparison of 10-fold cross-validated AUC between best traditional and machine learning models.Best modelsCross-validated AUCTraditional models M2 (step AIC method)0.79 M3 (LASSO method)0.78Machine learning approach SL Discrete Bayesian Additive Regression Trees Samplers (DBARTS)0.76 SL Generalized Additive Models (GAM)0.77 Super Learner0.74AUC: Area Under the Curve; AIC: Akaike Information Criterion; LASSO: Least Absolute Shrinkage and Selection Operator; SL: SuperLearner. N = 295.Conclusion:Traditional models predicted better radiographic progression than ML models in this early axSpA population. Further ML algorithms image-based or with other artificial intelligence methods (e.g. deep learning) might perform better than traditional models in this setting.Acknowledgments:Thanks to the French National Society of Rheumatology and the DESIR cohort.Disclosure of Interests:Romain Garofoli: None declared, Matthieu resche-rigon: None declared, Maxime Dougados Grant/research support from: AbbVie, Eli Lilly, Merck, Novartis, Pfizer and UCB Pharma, Consultant of: AbbVie, Eli Lilly, Merck, Novartis, Pfizer and UCB Pharma, Speakers bureau: AbbVie, Eli Lilly, Merck, Novartis, Pfizer and UCB Pharma, Désirée van der Heijde Consultant of: AbbVie, Amgen, Astellas, AstraZeneca, BMS, Boehringer Ingelheim, Celgene, Cyxone, Daiichi, Eisai, Eli-Lilly, Galapagos, Gilead Sciences, Inc., Glaxo-Smith-Kline, Janssen, Merck, Novartis, Pfizer, Regeneron, Roche, Sanofi, Takeda, UCB Pharma; Director of Imaging Rheumatology BV, Christian Roux: None declared, Anna Moltó Grant/research support from: Pfizer, UCB, Consultant of: Abbvie, BMS, MSD, Novartis, Pfizer, UCB

Download Full-text

MACHINE LEARNING AND GERONTOLOGY: BOOSTED REGRESSION TREES PREDICT AGE DIFFERENCES IN STRESSOR EXPERIENCE

The Gerontologist ◽

10.1093/geront/gnv195.10 ◽

2015 ◽

Vol 55 (Suppl_2) ◽

pp. 461-462

Keyword(s):

Machine Learning ◽

Age Differences ◽

Regression Trees ◽

Boosted Regression Trees

Download Full-text

Comparative performance of generalized additive models and boosted regression trees for statistical modeling of incidental catch of wahoo (Acanthocybium solandri) in the Mexican tuna purse-seine fishery

Ecological Modelling ◽

10.1016/j.ecolmodel.2012.03.006 ◽

2012 ◽

Vol 233 ◽

pp. 20-25 ◽

Cited By ~ 31

Author(s):

Raul O. Martínez-Rincón ◽

Sofía Ortega-García ◽

Juan G. Vaca-Rodríguez

Keyword(s):

Statistical Modeling ◽

Generalized Additive Models ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Purse Seine ◽

Comparative Performance ◽

Incidental Catch ◽

Purse Seine Fishery

Download Full-text

A Spatial Approach for Modeling Amphibian Road-Kills: Comparison of Regression Techniques

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10050343 ◽

2021 ◽

Vol 10 (5) ◽

pp. 343

Author(s):

Diana Sousa-Guedes ◽

Marc Franch ◽

Neftalí Sillero

Keyword(s):

Geographically Weighted Regression ◽

Environmental Variables ◽

Linear Models ◽

Model Performance ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Weighted Regression ◽

Road Kill ◽

Regression Techniques

Road networks are the main source of mortality for many species. Amphibians, which are in global decline, are the most road-killed fauna group, due to their activity patterns and preferred habitats. Many different methodologies have been applied in modeling the relationship between environment and road-kills events, such as logistic regression. Here, we compared the performance of five regression techniques to relate amphibians’ road-kill frequency to environmental variables. For this, we surveyed three country roads in northern Portugal in search of road-killed amphibians. To explain the presence of road-kills, we selected a set of environmental variables important for the presence of amphibians and the occurrence of road-kills. We compared the performances of five modeling techniques: (i) generalized linear models, (ii) generalized additive models, (iii) random forest, (iv) boosted regression trees, and (v) geographically weighted regression. The boosted regression trees and geographically weighted regression techniques performed the best, with a percentage of deviance explained between 61.8% and 76.6% and between 55.3% and 66.7%, respectively. Moreover, the geographically weighted regression showed a great advantage over the other techniques, as it allows mapping local parameter coefficients as well as local model performance (pseudo-R2). The results suggest that geographically weighted regression is a useful tool for road-kill modeling, as well as to better visualize and map the spatial variability of the models.

Download Full-text

Intercomparisons of liquid water path based on SEVIRI images and gradient boosting regression trees with in-situ observations and satellite-derived products

10.5194/egusphere-egu2020-18806 ◽

2020 ◽

Author(s):

Miae Kim ◽

Jan Cermak ◽

Hendrik Andersen ◽

Julia Fuchs ◽

Roland Stirnberg

Keyword(s):

Machine Learning ◽

Liquid Water ◽

Climate Models ◽

Regression Trees ◽

Boosted Regression Trees ◽

Gradient Boosting ◽

Liquid Water Path ◽

Water Path ◽

First Results

<div>This contribution presents a technique for the machine-learning-based retrieval of cloud liquid&#160;water path. Cloud effects are among the major uncertainties in climate models for estimating&#160;and predicting the Earth&#8217;s energy budget. The study of cloud processes requires information&#160;on cloud physical properties, such as the liquid water path (LWP), which is commonly&#160;retrieved from satellite sensors using look-up table approaches. However, the accuracy of&#160;LWP varies temporally and spatially, also due to assumptions inherent in any physical&#160;retrieval. The aim of this study is to improve the accuracy of LWP and analyze quantitatively&#160;the accuracy and its errors. To this end, a statistical LWP retrieval was developed using&#160;spectral information from geostationary satellite channels (Meteosat Spinning-Enhanced&#160;Visible and Infrared Imager, SEVIRI), and satellite viewing geometry. The machine-learning&#160;method chosen is gradient-boosted regression trees (GBRTs), which is an ensemble of&#160;decision trees but more effective than traditional tree-based models. This study reports on&#160;first results, as well as a comparison between the GBRT-derived LWP estimates and those&#160;from the SEVIRI-based products of the Climate Monitoring Satellite Application Facility&#160;(CM-SAF, CLAAS-A2), as well as MODIS products. We use case studies for individual&#160;in-situ measurement sites in Europe under varying meteorological conditions to determine&#160;the factors influencing LWP retrieval quality.</div>

Download Full-text

Machine learning for predictive auto-tuning with boosted regression trees

2012 Innovative Parallel Computing (InPar) ◽

10.1109/inpar.2012.6339587 ◽

2012 ◽

Cited By ~ 32

Author(s):

James Bergstra ◽

Nicolas Pinto ◽

David Cox

Keyword(s):

Machine Learning ◽

Regression Trees ◽

Boosted Regression Trees ◽

Auto Tuning

Download Full-text

Performance evaluation of cetacean species distribution models developed using generalized additive models and boosted regression trees

Ecology and Evolution ◽

10.1002/ece3.6316 ◽

2020 ◽

Vol 10 (12) ◽

pp. 5759-5784

Author(s):

Elizabeth A. Becker ◽

James V. Carretta ◽

Karin A. Forney ◽

Jay Barlow ◽

Stephanie Brodie ◽

...

Keyword(s):

Performance Evaluation ◽

Species Distribution ◽

Species Distribution Models ◽

Generalized Additive Models ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Distribution Models ◽

Cetacean Species

Download Full-text

Spatial prediction of demersal fish diversity in the Baltic Sea: comparison of machine learning and regression-based techniques

ICES Journal of Marine Science ◽

10.1093/icesjms/fsw136 ◽

2016 ◽

Vol 74 (1) ◽

pp. 102-111 ◽

Cited By ~ 12

Author(s):

Szymon Smoliński ◽

Krzysztof Radtke

Keyword(s):

Machine Learning ◽

Random Forest ◽

Spatial Prediction ◽

Demersal Fish ◽

Additive Models ◽

Multivariate Adaptive Regression Splines ◽

Boosted Regression Trees ◽

Support Vector ◽

Fish Diversity ◽

The Baltic

Marine spatial planning (MSP) is considered a valuable tool in the ecosystem-based management of marine areas. Predictive modelling may be applied in the MSP framework to obtain spatially explicit information about biodiversity patterns. The growing number of statistical approaches used for this purpose implies the urgent need for comparisons between different predictive techniques. In this study, we evaluated the performance of selected machine learning and regression-based methods that were applied for modelling fish community indices. We hypothesized that habitat features can influence fish assemblage and investigated the effect of environmental gradients on demersal fish diversity (species richness and Shannon–Weaver Index). We used fish data from the Baltic International Trawl Surveys (2001–2014) and maps of six potential predictors: bottom salinity, depth, seabed slope, growth season bottom temperature, seabed sediments and annual mean bottom current velocity. We compared the performance of six alternative modelling approaches: generalized linear models, generalized additive models, multivariate adaptive regression splines, support vector machines, boosted regression trees and random forests. We applied repeated 10-fold cross-validation, using accuracy as the measure of model quality. Finally, we selected random forest as the best performing algorithm and implemented it for the spatial prediction of fish diversity from the Baltic Proper to the Kattegat. To obtain information on the data reliability and confidence of the developed models, which are essential for MSP, we estimated the uncertainty of predictions with standard deviation of predictions obtained from all the trees in the ensemble random forest method. We showed how state-of-the-art predictive techniques, based on easily available data and simple Geographic Information System tools, can be used to obtain reliable spatial information about fish diversity. Our comparative work highlighted the potential of machine learning method to reduce prediction error in modelling of demersal fish diversity in the framework of MSP.

Download Full-text

Aviation Turbulence Forecasting at Upper Levels with Machine Learning Techniques Based on Regression Trees

Journal of Applied Meteorology and Climatology ◽

10.1175/jamc-d-20-0116.1 ◽

2020 ◽

Vol 59 (11) ◽

pp. 1883-1899 ◽

Cited By ~ 1

Author(s):

Domingo Muñoz-Esparza ◽

Robert D. Sharman ◽

Wiebke Deierling

Keyword(s):

Machine Learning ◽

Weather Prediction ◽

Regression Trees ◽

Probability Of Detection ◽

Boosted Regression Trees ◽

Machine Learning Techniques ◽

Model Complexity ◽

False Alarms ◽

Forecast Errors ◽

Simple Regression Model

AbstractWe explore the use of machine learning (ML) techniques, namely, regression trees (RT), for the purpose of aviation turbulence forecasting at upper levels [20–45 kft (~6–14 km) in altitude]. In particular, we develop a series of RT-based algorithms that include random forests (RF) and gradient-boosted regression trees (GBRT) methods. Numerical weather prediction model prognostic variables and derived turbulence diagnostics based on 6-h forecasts from the 3-km High-Resolution Rapid Refresh model are used as features to train these data-driven models. Training and evaluation are based on turbulence estimates of eddy dissipation rate (EDR) obtained from automated in situ aircraft reports. Our baseline RF model, consisting of 100 trees with 30 layers of maximum depth, significantly reduces forecast errors for EDR < 0.1 m2/3 s−1 (which corresponds roughly to null and light turbulence) when compared with a simple regression model, increasing the probability of detection and in turn reducing the number of false alarms. Model complexity reduction via GBRT and feature-relevance analyses is performed, indicating that considerable execution speedups can be achieved while maintaining the model’s predictive skill. Overall, the ML models exhibit enhanced performance in discriminating the EDR forecast among the light, moderate, and severe turbulence categories. In addition, these artificial intelligence techniques significantly simplify the generation of new NWP and grid-spacing specific turbulence forecast products.

Download Full-text

Regional Mapping of Groundwater Potential in Ar Rub Al Khali, Arabian Peninsula Using the Classification and Regression Trees Model

Remote Sensing ◽

10.3390/rs13122300 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2300

Author(s):

Samy Elmahdy ◽

Tarig Ali ◽

Mohamed Mohamed

Keyword(s):

Machine Learning ◽

Regional Scale ◽

Regression Trees ◽

Classification And Regression Trees ◽

Groundwater Potential ◽

Machine Learning Algorithms ◽

Conditioning Factors ◽

Potential Mapping ◽

Classification And Regression ◽

Groundwater Potential Mapping

Mapping of groundwater potential in remote arid and semi-arid regions underneath sand sheets over a very regional scale is a challenge and requires an accurate classifier. The Classification and Regression Trees (CART) model is a robust machine learning classifier used in groundwater potential mapping over a very regional scale. Ten essential groundwater conditioning factors (GWCFs) were constructed using remote sensing data. The spatial relationship between these conditioning factors and the observed groundwater wells locations was optimized and identified by using the chi-square method. A total of 185 groundwater well locations were randomly divided into 129 (70%) for training the model and 56 (30%) for validation. The model was applied for groundwater potential mapping by using optimal parameters values for additive trees were 186, the value for the learning rate was 0.1, and the maximum size of the tree was five. The validation result demonstrated that the area under the curve (AUC) of the CART was 0.920, which represents a predictive accuracy of 92%. The resulting map demonstrated that the depressions of Mondafan, Khujaymah and Wajid Mutaridah depression and the southern gulf salt basin (SGSB) near Saudi Arabia, Oman and the United Arab Emirates (UAE) borders reserve fresh fossil groundwater as indicated from the observed lakes and recovered paleolakes. The proposed model and the new maps are effective at enhancing the mapping of groundwater potential over a very regional scale obtained using machine learning algorithms, which are used rarely in the literature and can be applied to the Sahara and the Kalahari Desert.

Download Full-text