Comparison of Three Supervised Learning Methods for Digital Soil Mapping: Application to a Complex Terrain in the Ecuadorian Andes

A digital soil mapping approach is applied to a complex, mountainous terrain in the Ecuadorian Andes. Relief features are derived from a digital elevation model and used as predictors for topsoil texture classes sand, silt, and clay. The performance of three statistical learning methods is compared: linear regression, random forest, and stochastic gradient boosting of regression trees. In linear regression, a stepwise backward variable selection procedure is applied and overfitting is controlled by minimizing Mallow’s Cp. For random forest and boosting, the effect of predictor selection and tuning procedures is assessed. 100-fold repetitions of a 5-fold cross-validation of the selected modelling procedures are employed for validation, uncertainty assessment, and method comparison. Absolute assessment of model performance is achieved by comparing the prediction error of the selected method and the mean. Boosting performs best, providing predictions that are reliably better than the mean. The median reduction of the root mean square error is around 5%. Elevation is the most important predictor. All models clearly distinguish ridges and slopes. The predicted texture patterns are interpreted as result of catena sequences (eluviation of fine particles on slope shoulders) and landslides (mixing up mineral soil horizons on slopes).

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text

Big-data and artificial-intelligence-assisted vault prediction and EVO-ICL size selection for myopia correction

British Journal of Ophthalmology ◽

10.1136/bjophthalmol-2021-319618 ◽

2021 ◽

pp. bjophthalmol-2021-319618

Author(s):

Yang Shen ◽

Lin Wang ◽

Weijun Jian ◽

Jianmin Shang ◽

Xin Wang ◽

...

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Random Forest ◽

Anterior Segment ◽

Big Data Analytics ◽

Area Under The Curve ◽

Gradient Boosting ◽

Implantable Collamer Lens ◽

Surgical Strategies ◽

The Mean

AimsTo predict the vault and the EVO-implantable collamer lens (ICL) size by artificial intelligence (AI) and big data analytics.MethodsSix thousand two hundred and ninety-seven eyes implanted with an ICL from 3536 patients were included. The vault values were measured by the anterior segment analyzer (Pentacam HR). Permutation importance and Impurity-based feature importance are used to investigate the importance between the vault and input parameters. Regression models and classification models are applied to predict the vault. The ICL size is set as the target of the prediction, and the vault and the other input features are set as the new inputs for the ICL size prediction. Data were collected from 2015 to 2020. Random Forest, Gradient Boosting and XGBoost were demonstrated satisfying accuracy and mean area under the curve (AUC) scores in vault predicting and ICL sizing.ResultsIn the prediction of the vault, the Random Forest has the best results in the regression model (R2=0.315), then follows the Gradient Boosting (R2=0.291) and XGBoost (R2=0.285). The maximum classification accuracy is 0.828 in Random Forest, and the mean AUC is 0.765. The Random Forest predicts the ICL size with an accuracy of 82.2% and the Gradient Boosting and XGBoost, which are also compatible with 81.5% and 81.8% accuracy, respectively.ConclusionsRandom Forest, Gradient Boosting and XGBoost models are applicable for vault predicting and ICL sizing. AI may assist ophthalmologists in improving ICL surgery safety, designing surgical strategies, and predicting clinical outcomes.

Download Full-text

Predicting Fine Particulate Matter (PM2.5) in the Greater London Area: An Ensemble Approach using Machine Learning Methods

Remote Sensing ◽

10.3390/rs12060914 ◽

2020 ◽

Vol 12 (6) ◽

pp. 914 ◽

Cited By ~ 4

Author(s):

Mahdieh Danesh Yazdi ◽

Zheng Kuang ◽

Konstantina Dimakopoulou ◽

Benjamin Barratt ◽

Esra Suel ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Nearest Neighbor ◽

Meteorological Data ◽

Fine Particulate Matter ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

Learning Methods ◽

Machine Learning Methods ◽

Technological Advances

Estimating air pollution exposure has long been a challenge for environmental health researchers. Technological advances and novel machine learning methods have allowed us to increase the geographic range and accuracy of exposure models, making them a valuable tool in conducting health studies and identifying hotspots of pollution. Here, we have created a prediction model for daily PM2.5 levels in the Greater London area from 1st January 2005 to 31st December 2013 using an ensemble machine learning approach incorporating satellite aerosol optical depth (AOD), land use, and meteorological data. The predictions were made on a 1 km × 1 km scale over 3960 grid cells. The ensemble included predictions from three different machine learners: a random forest (RF), a gradient boosting machine (GBM), and a k-nearest neighbor (KNN) approach. Our ensemble model performed very well, with a ten-fold cross-validated R2 of 0.828. Of the three machine learners, the random forest outperformed the GBM and KNN. Our model was particularly adept at predicting day-to-day changes in PM2.5 levels with an out-of-sample temporal R2 of 0.882. However, its ability to predict spatial variability was weaker, with a R2 of 0.396. We believe this to be due to the smaller spatial variation in pollutant levels in this area.

Download Full-text

Information Processing and Overload in Group Conversation: A Graph-Based Prediction Model

Multimodal Technologies and Interaction ◽

10.3390/mti3030046 ◽

2019 ◽

Vol 3 (3) ◽

pp. 46 ◽

Cited By ~ 1

Author(s):

Gabriel Murray

Keyword(s):

Linear Regression ◽

Language Processing ◽

Random Forests ◽

Information Overload ◽

Mean Squared Error ◽

Group Interaction ◽

Training Data ◽

Gradient Boosting ◽

Linguistic Features ◽

The Mean

Based on analyzing verbal and nonverbal features of small group conversations in a task-based scenario, this work focuses on automatic detection of group member perceptions about how well they are making use of available information, and whether they are experiencing information overload. Both the verbal and nonverbal features are derived from graph-based social network representations of the group interaction. For the task of predicting the information use ratings, a predictive model using random forests with verbal and nonverbal features significantly outperforms baselines in which the mean or median values of the training data are predicted, as well as significantly outperforming a linear regression baseline. For the task of predicting information overload ratings, the multimodal random forests model again outperforms all other models, including significant improvement over linear regression and gradient boosting models. However, on that task the best model is not significantly better than the mean and median baselines. For both tasks, we analyze performance using the full multimodal feature set versus using only linguistic features or only turn-taking features. While utilizing the full feature set yields the best performance in terms of mean squared error (MSE), there are no statistically significant differences, and using only linguistic features gives comparable performance. We provide a detailed analysis of the individual features that are most useful for each task. Beyond the immediate prediction tasks, our more general goal is to represent conversational interaction in such a way that yields a small number of features capturing the group interaction in an easily interpretable manner. The proposed approach is relevant to many other group prediction tasks as well, and is distinct from both classical natural language processing (NLP) as well as more current deep learning/artificial neural network approaches.

Download Full-text

Why choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence

PeerJ ◽

10.7717/peerj.2849 ◽

2017 ◽

Vol 5 ◽

pp. e2849 ◽

Cited By ~ 44

Author(s):

Chunrong Mi ◽

Falk Huettmann ◽

Yumin Guo ◽

Xuesong Han ◽

Lijia Wen

Keyword(s):

Random Forest ◽

Species Distribution ◽

Performance Metrics ◽

Regression Tree ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Supporting Evidence ◽

Boosted Regression Tree ◽

Stochastic Gradient Boosting

Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution and, more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha,n = 33), White-naped Crane (Grus vipio,n = 40), and Black-necked Crane (Grus nigricollis,n = 75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models). In addition, we developed an ensemble forecast by averaging predicted probability of the above four models results. Commonly used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. The latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years and has been known to perform extremely well in ecological predictions. However, while increasingly on the rise, its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and allows robust and rapid assessments and decisions for efficient conservation.

Download Full-text

Why to choose Random Forest to predict rare species distribution with few samples in large undersampled areas? Three Asian crane species models provide supporting evidence

10.7287/peerj.preprints.2517 ◽

2016 ◽

Author(s):

Chunrong Mi ◽

Falk Huettmann ◽

Yumin Guo ◽

Xuesong Han ◽

Lijia Wen

Keyword(s):

Random Forest ◽

Species Distribution ◽

Performance Metrics ◽

Regression Tree ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Supporting Evidence ◽

Boosted Regression Tree ◽

Stochastic Gradient Boosting

Species distribution models (SDMs) have become an essential tool in ecology, biogeography, evolution, and more recently, in conservation biology. How to generalize species distributions in large undersampled areas, especially with few samples, is a fundamental issue of SDMs. In order to explore this issue, we used the best available presence records for the Hooded Crane (Grus monacha, n=33), White-naped Crane (Grus vipio, n=40), and Black-necked Crane (Grus nigricollis, n=75) in China as three case studies, employing four powerful and commonly used machine learning algorithms to map the breeding distributions of the three species: TreeNet (Stochastic Gradient Boosting, Boosted Regression Tree Model), Random Forest, CART (Classification and Regression Tree) and Maxent (Maximum Entropy Models) Besides, we developed an ensemble forecast by averaging predicted probability of above four models results. Commonly-used model performance metrics (Area under ROC (AUC) and true skill statistic (TSS)) were employed to evaluate model accuracy. Latest satellite tracking data and compiled literature data were used as two independent testing datasets to confront model predictions. We found Random Forest demonstrated the best performance for the most assessment method, provided a better model fit to the testing data, and achieved better species range maps for each crane species in undersampled areas. Random Forest has been generally available for more than 20 years, and by now, has been known to perform extremely well in ecological predictions. However, while increasingly on the rise its potential is still widely underused in conservation, (spatial) ecological applications and for inference. Our results show that it informs ecological and biogeographical theories as well as being suitable for conservation applications, specifically when the study area is undersampled. This method helps to save model-selection time and effort, and it allows robust and rapid assessments and decisions for efficient conservation.

Download Full-text

Ensemble machine learning methods for spatio-temporal data analysis of plant and ratoon sugarcane

Intelligent Data Analysis ◽

10.3233/ida-205302 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1291-1322

Author(s):

Sandeep Kumar Singla ◽

Rahul Dev Garg ◽

Om Prakash Dubey

Keyword(s):

Machine Learning ◽

Random Forest ◽

Binary Classification ◽

Temporal Variations ◽

Classification Model ◽

Gradient Boosting ◽

Remotely Sensed Data ◽

Learning Methods ◽

Machine Learning Methods ◽

Classification And Regression

Recent technological enhancements in the field of information technology and statistical techniques allowed the sophisticated and reliable analysis based on machine learning methods. A number of machine learning data analytical tools may be exploited for the classification and regression problems. These tools and techniques can be effectively used for the highly data-intensive operations such as agricultural and meteorological applications, bioinformatics and stock market analysis based on the daily prices of the market. Machine learning ensemble methods such as Decision Tree (C5.0), Classification and Regression (CART), Gradient Boosting Machine (GBM) and Random Forest (RF) has been investigated in the proposed work. The proposed work demonstrates that temporal variations in the spectral data and computational efficiency of machine learning methods may be effectively used for the discrimination of types of sugarcane. The discrimination has been considered as a binary classification problem to segregate ratoon from plantation sugarcane. Variable importance selection based on Mean Decrease in Accuracy (MDA) and Mean Decrease in Gini (MDG) have been used to create the appropriate dataset for the classification. The performance of the binary classification model based on RF is the best in all the possible combination of input images. Feature selection based on MDA and MDG measures of RF is also important for the dimensionality reduction. It has been observed that RF model performed best with 97% accuracy, whereas the performance of GBM method is the lowest. Binary classification based on the remotely sensed data can be effectively handled using random forest method.

Download Full-text

Evaluation of digital soil mapping approaches with large sets of environmental covariates

10.5194/soil-2017-14 ◽

2017 ◽

Cited By ~ 4

Author(s):

Madlene Nussbaum ◽

Kay Spiess ◽

Andri Baltensweiler ◽

Urs Grob ◽

Armin Keller ◽

...

Keyword(s):

Ad Hoc ◽

Soil Depth ◽

Soil Mapping ◽

Environmental Data ◽

Digital Soil Mapping ◽

Boosted Regression Trees ◽

Gradient Boosting ◽

Validation Data ◽

Environmental Covariates ◽

Large Sets

Abstract. Spatial assessment of soil functions requires maps of basic soil properties. Unfortunately, these are either missing for many regions or are not available at the desired spatial resolution or down to required soil depth. Conventional soil map generation remains costly. Field based generation of large soil data sets and of conventional soil maps remains costly. Meanwhile, soil legacy data and comprehensive sets of spatial environmental data are available for many regions. Digital soil mapping (DSM) approaches – relating soil data (responses) to environmental data (covariates) – are facing the challenge to build statistical models from large sets of covariates originating for example from airborne imaging spectroscopy or multi-scale terrain analysis. We evaluated six approaches for DSM in three study regions in Switzerland (Berne, Greifensee, ZH forest) by mapping effective soil depth available to plants (SD), pH, soil organic matter (SOM), effective cation exchange capacity (ECEC), clay, silt, gravel content and bulk density for four soil layers (totalling 48 responses). Models were built from 300–500 environmental covariates by selecting linear models by (1) grouped lasso and by an ad-hoc stepwise procedure for (2) robust external-drift kriging (EDK). For (3) geoadditive models we selected penalized smoothing spline terms by componentwise gradient boosting (geoGAM). We further used two tree-based methods: (4) boosted regression trees (BRT) and (5) Random Forest (RF). Lastly, we computed (6) weighted model averages (MA) from predictions obtained from methods 1–5. Lasso, georob and geoGAM successfully selected strongly reduced sets of covariates (subsets of 3–6 % of all covariates). To automatically select a sparse trend model for EDK was however difficult, and the applied ad hoc procedure was computationally inefficient and over-fitted the data. Differences in predictive performance, tested on independent validation data, were mostly small and did not reveal a single best method for 48 responses. Nevertheless, RF was on average often best among methods 1–5 (28 of 48 responses), but was outcompeted by MA for 14 of these 28 responses. RF tended to over-fit the data. Performance of BRT was slightly worse than RF. GeoGAM performed poorly on some responses and was only best for 7 of 48 responses. Predictive precision of lasso was intermediate. All models generally had small bias. Only the computationally very efficient lasso had slightly larger bias likely because it tended to under-fit the data. Summarizing, although differences were small, the frequencies of best and worst performance clearly favoured RF if a single method is applied MA if multiple prediction models can be developed.

Download Full-text

FORECASTING PRICES IN THE RENTAL HOUSING MARKET WITH MACHINE LEARNING METHODS

Bulletin of V. N. Karazin Kharkiv National University Economic Series ◽

10.26565/2311-2379-2020-99-12 ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Regression Models ◽

Data Science ◽

Polynomial Regression ◽

Short Term ◽

Learning Methods ◽

Machine Learning Methods ◽

Pricing Factors

The study of pricing factors in the market of the short-term rental has been done. Airbnb was chosen as the object of the study; it is a platform for accommodation, search, and rental around the world. At the beginning of 2021, the company offers 7 million homes from more than 220 countries. The Data Science methods play a significant role in the company's success. One of the key algorithms of the company is the pricing algorithm. Using the "Price Recommendations" feature, the homeowner can analyze which dates are most likely to be booked at the current price and which are not, it helps form a favorable offer. The system calculates the recommended cost of housing based on hundreds of parameters, some of which are easy to recognize, but there are less obvious factors that can also affect demand. The paper proposes an algorithm for identifying implicit pricing factors in the short-term rental market using machine learning methods, which includes: 1) data mining and data preparation; 2) building and analysis of linear regression models; 3) building and analysis of nonlinear regression models. The study was based on ads from the Airbnb site in Washington and New York using scripts developed in Python. The following models are built and analyzed: simple linear regression, multiple linear regression, polynomial regression, decision trees, random forest, and boosting. The results of the study showed that the most important factors are accommodates, cleaning_fee, room_type, bedrooms. But based on the model evaluation criteria, they cannot be used for implementation: linear models are of low quality, while the random forest, boosting, and trees are overfitted. Still the results can be used in conducting business analysis.

Download Full-text