scholarly journals Why to choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence

Author(s):  
Chunrong Mi ◽  
Falk Huettmann ◽  
Yumin Guo ◽  
Xuesong Han ◽  
Lijia Wen

Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution, and more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n=33), White-naped Crane (Grus vipio, n=40), and Black-necked Crane (Grus nigricollis, n=75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models) Besides, we developed an ensemble forecast by averaging predicted probability of above four models results. Commonly-used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. Latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years, and by now, has been known to perform extremely well in ecological predictions. However, while increasingly on the rise its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and it allows robust and rapid assessments and decisions for efficient conservation.

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e2849 ◽  
Author(s):  
Chunrong Mi ◽  
Falk Huettmann ◽  
Yumin Guo ◽  
Xuesong Han ◽  
Lijia Wen

Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha,n = 33), White-naped Crane (Grus vipio,n = 40), and Black-necked Crane (Grus nigricollis,n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation.


2016 ◽  
Author(s):  
Chunrong Mi ◽  
Falk Huettmann ◽  
Yumin Guo ◽  
Xuesong Han ◽  
Lijia Wen

Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution, and more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n=33), White-naped Crane (Grus vipio, n=40), and Black-necked Crane (Grus nigricollis, n=75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models) Besides, we developed an ensemble forecast by averaging predicted probability of above four models results. Commonly-used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. Latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years, and by now, has been known to perform extremely well in ecological predictions. However, while increasingly on the rise its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and it allows robust and rapid assessments and decisions for efficient conservation.


2020 ◽  
Vol 10 (2) ◽  
pp. 635 ◽  
Author(s):  
Yingli LV ◽  
Qui-Thao Le ◽  
Hoang-Bac Bui ◽  
Xuan-Nam Bui ◽  
Hoang Nguyen ◽  
...  

In this study, the ilmenite content in beach placer sand was estimated using seven soft computing techniques, namely random forest (RF), artificial neural network (ANN), k-nearest neighbors (kNN), cubist, support vector machine (SVM), stochastic gradient boosting (SGB), and classification and regression tree (CART). The 405 beach placer borehole samples were collected from Southern Suoi Nhum deposit, Binh Thuan province, Vietnam, to test the feasibility of these soft computing techniques in estimating ilmenite content. Heavy mineral analysis indicated that valuable minerals in the placer sand are zircon, ilmenite, leucoxene, rutile, anatase, and monazite. In this study, five materials, namely rutile, anatase, leucoxene, zircon, and monazite, were used as the input variables to estimate ilmenite content based on the above mentioned soft computing models. Of the whole dataset, 325 samples were used to build the regarded soft computing models; 80 remaining samples were used for the models’ verification. Root-mean-squared error (RMSE), determination coefficient (R2), a simple ranking method, and residuals analysis technique were used as the statistical criteria for assessing the model performances. The numerical experiments revealed that soft computing techniques are capable of estimating the content of ilmenite with high accuracy. The residuals analysis also indicated that the SGB model was the most suitable for determining the ilmenite content in the context of this research.


2021 ◽  
Vol 11 (5) ◽  
pp. 2235
Author(s):  
Haewon Byeon

It is essential to understand the voice characteristics in the normal aging process to accurately distinguish presbyphonia from neurological voice disorders. This study developed the best ensemble-based machine learning classifier that could distinguish hypokinetic dysarthria from presbyphonia using classification and regression tree (CART), random forest, gradient boosting algorithm (GBM), and XGBoost and compared the prediction performance of models. The subjects of this study were 76 elderly patients diagnosed with hypokinetic dysarthria and 174 patients with presbyopia. This study developed prediction models for distinguishing hypokinetic dysarthria from presbyphonia by using CART, GBM, XGBoost, and random forest and compared the accuracy, sensitivity, and specificity of the development models to identify the prediction performance of them. The results of this study showed that random forest had the best prediction performance when it was tested with the test dataset (accuracy = 0.83, sensitivity = 0.90, and specificity = 0.80, and area under the curve (AUC) = 0.85). The main predictors for detecting hypokinetic dysarthria were Cepstral peak prominence (CPP), jitter, shimmer, L/H ratio, L/H ratio_SD, CPP max (dB), CPP min (dB), and CPPF0 in the order of magnitude. Among them, CPP was the most important predictor for identifying hypokinetic dysarthria.


2019 ◽  
Vol 11 (1) ◽  
pp. 100 ◽  
Author(s):  
Dyah R. Panuju ◽  
David J. Paull ◽  
Bambang H. Trisasongko

This research aims to detect subtle changes by combining binary change analysis, the Iteratively Reweighted Multivariate Alteration Detection (IRMAD), over dual polarimetric Advanced Land Observing Satellite (ALOS) backscatter with augmented data for post-classification change analysis. The accuracy of change detection was iteratively evaluated based on thresholds composed of mean and a range constant of standard deviation. Four datasets were examined for post-classification change analysis including the dual polarimetric backscatter as the benchmark and its augmented data with indices, entropy alpha decomposition and selected texture features. Variable importance was then evaluated to build a best subset model employing seven classifiers, including Bagged Classification and Regression Tree (CAB), Extreme Learning Machine Neural Network (ENN), Bagged Multivariate Adaptive Regression Spline (MAB), Regularised Random Forest (RFG), Original Random Forest (RFO), Support Vector Machine (SVM), and Extreme Gradient Boosting Tree (XGB). The best accuracy was 98.8%, which resulted from thresholding MAD variate-2 with constants at 1.7. The highest improvement of classification accuracy was obtained by amending the grey level co-occurrence matrix (GLCM) texture. The identification of variable importance (VI) confirmed that selected GLCM textures (mean and variance of HH or HV) were equally superior, while the contribution of index and decomposition were negligible. The best model produced similar classification accuracy at about 90% for both years 2007 and 2010. Tree-based algorithms including RFO, RFG and XGB were more robust than SVM and ENN. Subtle changes indicated by binary change analysis were somewhat hidden in post-classification analysis. Reclassification by combining all important variables and adding five classes to include subtle changes assisted by Google Earth yielded an accuracy of 82%.


Water ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 3490
Author(s):  
Noor Hafsa ◽  
Sayeed Rushd ◽  
Mohammed Al-Yaari ◽  
Muhammad Rahman

Applications of machine learning algorithms (MLAs) to modeling the adsorption efficiencies of different heavy metals have been limited by the adsorbate–adsorbent pair and the selection of specific MLAs. In the current study, adsorption efficiencies of fourteen heavy metal–adsorbent (HM-AD) pairs were modeled with a variety of ML models such as support vector regression with polynomial and radial basis function kernels, random forest (RF), stochastic gradient boosting, and bayesian additive regression tree (BART). The wet experiment-based actual measurements were supplemented with synthetic data samples. The first batch of dry experiments was performed to model the removal efficiency of an HM with a specific AD. The ML modeling was then implemented on the whole dataset to develop a generalized model. A ten-fold cross-validation method was used for the model selection, while the comparative performance of the MLAs was evaluated with statistical metrics comprising Spearman’s rank correlation coefficient, coefficient of determination (R2), mean absolute error, and root-mean-squared-error. The regression tree methods, BART, and RF demonstrated the most robust and optimum performance with 0.96 ⫹ R2 ⫹ 0.99. The current study provides a generalized methodology to implement ML in modeling the efficiency of not only a specific adsorption process but also a group of comparable processes involving multiple HM-AD pairs.


2020 ◽  
Vol 13 (1) ◽  
pp. 10
Author(s):  
Andrea Sulova ◽  
Jamal Jokar Arsanjani

Recent studies have suggested that due to climate change, the number of wildfires across the globe have been increasing and continue to grow even more. The recent massive wildfires, which hit Australia during the 2019–2020 summer season, raised questions to what extent the risk of wildfires can be linked to various climate, environmental, topographical, and social factors and how to predict fire occurrences to take preventive measures. Hence, the main objective of this study was to develop an automatized and cloud-based workflow for generating a training dataset of fire events at a continental level using freely available remote sensing data with a reasonable computational expense for injecting into machine learning models. As a result, a data-driven model was set up in Google Earth Engine platform, which is publicly accessible and open for further adjustments. The training dataset was applied to different machine learning algorithms, i.e., Random Forest, Naïve Bayes, and Classification and Regression Tree. The findings show that Random Forest outperformed other algorithms and hence it was used further to explore the driving factors using variable importance analysis. The study indicates the probability of fire occurrences across Australia as well as identifies the potential driving factors of Australian wildfires for the 2019–2020 summer season. The methodical approach and achieved results and drawn conclusions can be of great importance to policymakers, environmentalists, and climate change researchers, among others.


Sign in / Sign up

Export Citation Format

Share Document