scholarly journals Systematic comparison of five machine-learning models in classification and interpolation of soil particle size fractions using different transformed data

2020 ◽  
Vol 24 (5) ◽  
pp. 2505-2526
Author(s):  
Mo Zhang ◽  
Wenjiao Shi ◽  
Ziwei Xu

Abstract. Soil texture and soil particle size fractions (PSFs) play an increasing role in physical, chemical, and hydrological processes. Many previous studies have used machine-learning and log-ratio transformation methods for soil texture classification and soil PSF interpolation to improve the prediction accuracy. However, few reports have systematically compared their performance with respect to both classification and interpolation. Here, five machine-learning models – K-nearest neighbour (KNN), multilayer perceptron neural network (MLP), random forest (RF), support vector machines (SVM), and extreme gradient boosting (XGB) – combined with the original data and three log-ratio transformation methods – additive log ratio (ALR), centred log ratio (CLR), and isometric log ratio (ILR) – were applied to evaluate soil texture and PSFs using both raw and log-ratio-transformed data from 640 soil samples in the Heihe River basin (HRB) in China. The results demonstrated that the log-ratio transformations decreased the skewness of soil PSF data. For soil texture classification, RF and XGB showed better performance with a higher overall accuracy and kappa coefficient. They were also recommended to evaluate the classification capacity of imbalanced data according to the area under the precision–recall curve (AUPRC). For soil PSF interpolation, RF delivered the best performance among five machine-learning models with the lowest root-mean-square error (RMSE; sand had a RMSE of 15.09 %, silt was 13.86 %, and clay was 6.31 %), mean absolute error (MAE; sand had a MAD of 10.65 %, silt was 9.99 %, and clay was 5.00 %), Aitchison distance (AD; 0.84), and standardized residual sum of squares (STRESS; 0.61), and the highest Spearman rank correlation coefficient (RCC; sand was 0.69, silt was 0.67, and clay was 0.69). STRESS was improved by using log-ratio methods, especially for CLR and ILR. Prediction maps from both direct and indirect classification were similar in the middle and upper reaches of the HRB. However, indirect classification maps using log-ratio-transformed data provided more detailed information in the lower reaches of the HRB. There was a pronounced improvement of 21.3 % in the kappa coefficient when using indirect methods for soil texture classification compared with direct methods. RF was recommended as the best strategy among the five machine-learning models, based on the accuracy evaluation of the soil PSF interpolation and soil texture classification, and ILR was recommended for component-wise machine-learning models without multivariate treatment, considering the constrained nature of compositional data. In addition, XGB was preferred over other models when the trade-off between the accuracy and runtime was considered. Our findings provide a reference for future works with respect to the spatial prediction of soil PSFs and texture using machine-learning models with skewed distributions of soil PSF data over a large area.

2020 ◽  
Author(s):  
Mo Zhang ◽  
Wenjiao Shi

Abstract. Soil texture and soil particle size fractions (PSFs) play an increasing role in physical, chemical and hydrological processes. Many previous studies have used machine-learning and log ratio transformation methods for soil texture classification and soil PSFs interpolation to improve the prediction accuracy. However, few reports systematically compared the performance of them in both classification and interpolation. Here, a total of 45 evaluation models generated from five machine-learning models – K-nearest neighbor (KNN), multilayer perceptron neural network (MLP), random forest (RF), support vector machines (SVM), extreme gradient boosting (XGB), combined with original and three log ratio methods – additive log ratio (ALR), centered log ratio (CLR) and isometric log ratio (ILR), were applied to evaluate and compare both of them using 640 soil samples in the Heihe River Basin in China. The results demonstrated that log ratio transformation methods decreased skewness of distributions of soil PSFs data. For soil texture classification, RF and XGB showed better performance with the overall accuracy and kappa coefficients, they were also recommended to evaluate classification capacity of imbalanced data according to the area under the precision-recall curve (AUPRC) analysis. For soil PSFs interpolation, RF delivered the best performance among five machine-learning models with the lowest root mean squared error (RMSE, sand: 15.09 %, silt: 13.86 %, clay: 6.31 %), mean absolute error (MAE, sand: 10.65 %, silt: 9.99 %, clay: 5.00 %), Aitchison distance (AD, 0.84) and standardized residual sum of squares (STRESS, 0.61), and the highest coefficient of determination (R2, sand: 53.28 %, silt: 45.77 %, clay: 53.75 %). STRESS was improved using log ratio methods, especially CLR and ILR. For the comparison of direct and indirect classification, prediction maps were similar on the middle and upper reaches and different on the lower reaches of the HRB. Moreover, indirect classification maps based on log ratio transformed data had more detailed information. There is a pronounced improvement with 21.3 % of kappa coefficient using indirect methods for soil texture classification compared to the direct ones. RF was recommended as the best strategy among these five machine-learning models according to the accuracy evaluation of soil PSFs interpolation and soil texture classification, and ILR was recommended for component-wise machine-learning methods without multivariate treatment considering the constrained nature of compositional data. In addition, XGB was preferred than other models when trade-off of accuracy and time was considered. Our findings can provide a reference for other research of spatial prediction of soil PSFs and texture using machine-learning methods with skewed distribution soil PSFs data in a large area.


2019 ◽  
Author(s):  
Mo Zhang ◽  
Wenjiao Shi

Abstract. Soil texture and soil particle size fractions (psf) play an increasing role in physical, chemical and hydrological processes. Digital soil mapping using machine-learning methods was widely applied to generate more detailed prediction of qualitative or quantitative outputs than traditional soil-mapping methods in soil science. As compositional data, interpolation of soil psf combined with log ratio approaches was developed to improve the prediction accuracy, which also can be used to indirectly derive soil texture. However, few reports systematically analyzed and compared the classification and regression, the accuracies of original (untransformed) and log ratio approaches, and the performance of direct and indirect soil texture classification using machine-learning methods. In this total, a total of 45 evaluation models generated from five different machine-learning models combined with original and three log ratio approaches–additive log ratio, centered log ratio and isometric log ratio (ALR, CLR and ILR, respectively), to evaluate and compare the performance of soil texture classification and soil psf interpolation. The results demonstrated that log ratio approaches modified the soil sampling data more symmetrically, and with respect to soil texture classification, random forest (RF) and extreme gradient boosting (XGB) showed notable consequences. For soil psf interpolation, RF delivered the best performance among five machine-learning models with lowest root mean squared error (RMSE, sand: 15.09 %, silt: 13.86 %, clay: 6.31 %), mean absolute error (MAE, sand: 10.65 %, silt: 9.99 %, clay: 5.00 %), Aitchison distance (AD, 0.84) and standardized residual sum of squares (STRESS, 0.61), and highest coefficient of determination (R2, sand: 53.28 %, silt: 45.77 %, clay: 53.75 %). STRESS was improved using log ratio approaches, especially CLR and ILR. There is a pronounced improvement (21.3 %) in the kappa coefficient using indirect soil texture classification compared to the direct approach. Our systematic comparison helps to elucidate the processing and selection of compositional data in spatial simulation.


2019 ◽  
Vol 58 (02/03) ◽  
pp. 079-085
Author(s):  
Akira A. Nair ◽  
Mihir Velagapudi ◽  
Lakshmana Behara ◽  
Ravitheja Venigandla ◽  
Christine T. Fong ◽  
...  

Abstract Background Hyperglycemia or high blood glucose during surgery is associated with poor postoperative outcome. Knowing in advance which patients may develop hyperglycemia allows optimal assignment of resources and earlier initiation of glucose management plan. Objective To develop predictive models to estimate peak glucose levels in surgical patients and to implement the best performing model as a point-of-care clinical tool to assist the surgical team to optimally manage glucose levels. Methods Using a large perioperative dataset (6,579 patients) of patient- and surgery-specific parameters, we developed and validated linear regression and machine learning models (random forest, extreme gradient boosting [Xg Boost], classification and regression trees [CART], and neural network) to predict the peak glucose levels during surgery. The model performances were compared in terms of mean absolute percentage error (MAPE), logarithm of the ratio of the predicted to actual value (log ratio), median prediction error, and interquartile error range. The best performing model was implemented as part of a web-based application for optimal decision-making toward glucose management during surgery. Results Accuracy of the machine learning models were higher (MAPE = 17%, log ratio = 0.029 for Xg Boost) when compared with that of the linear regression model (MAPE = 22%, log ratio = 0.041). The Xg Boost model had the smallest median prediction error (5.4 mg/dL) and the narrowest interquartile error range (−17 to 24 mg/dL) as compared with the other models. The best performing model, Xg Boost, was implemented as a web application, Hyper-G, which the perioperative providers can use at the point of care to estimate peak glucose levels during surgery. Conclusions Machine learning models are able to accurately predict peak glucose levels during surgery. Implementation of such a model as a web-based application can facilitate optimal decision-making and advance planning of glucose management strategies.


2020 ◽  
Vol 2 (1) ◽  
pp. 3-6
Author(s):  
Eric Holloway

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.


2021 ◽  
Author(s):  
Norberto Sánchez-Cruz ◽  
Jose L. Medina-Franco

<p>Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.</p>


Sign in / Sign up

Export Citation Format

Share Document