scholarly journals Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5518 ◽  
Author(s):  
Tomislav Hengl ◽  
Madlene Nussbaum ◽  
Marvin N. Wright ◽  
Gerard B.M. Heuvelink ◽  
Benedikt Gräler

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality—especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

Author(s):  
Tomislav Hengl ◽  
Madlene Nussbaum ◽  
Marvin N Wright ◽  
Gerard B.M. Heuvelink ◽  
Benedikt Gräler

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5 – fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates, sensitivity of predictions to input data quality and extrapolation problems. The key to the success of the RFsp framework might be the training data quality — especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with fewer number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.


2018 ◽  
Author(s):  
Tomislav Hengl ◽  
Madlene Nussbaum ◽  
Marvin N Wright ◽  
Gerard B.M. Heuvelink ◽  
Benedikt Gräler

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5 – fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as 'knowledge engines' in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates, sensitivity of predictions to input data quality and extrapolation problems. The key to the success of the RFsp framework might be the training data quality — especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with fewer number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.


2018 ◽  
Author(s):  
Tomislav Hengl ◽  
Madlene Nussbaum ◽  
Marvin N Wright ◽  
Gerard B.M. Heuvelink ◽  
Benedikt Gräler

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5 – fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates, sensitivity of predictions to input data quality and extrapolation problems. The key to the success of the RFsp framework might be the training data quality — especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with fewer number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.


Author(s):  
Tomislav Hengl ◽  
Madlene Nussbaum ◽  
Marvin N Wright ◽  
Gerard B.M. Heuvelink

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5--fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.


2021 ◽  
Vol 8 (1) ◽  
pp. 33
Author(s):  
Carlos Javier Gamboa-Villafruela ◽  
José Carlos Fernández-Alvarez ◽  
Maykel Márquez-Mijares ◽  
Albenis Pérez-Alarcón ◽  
Alfo José Batista-Leyva

The short-term prediction of precipitation is a difficult spatio-temporal task due to the non-uniform characterization of meteorological structures over time. Currently, neural networks such as convolutional LSTM have shown ability for the spatio-temporal prediction of complex problems. In this research, we propose an LSTM convolutional neural network (CNN-LSTM) architecture for immediate prediction of various short-term precipitation events using satellite data. The CNN-LSTM is trained with NASA Global Precipitation Measurement (GPM) precipitation data sets, each at 30-min intervals. The trained neural network model is used to predict the sixteenth precipitation data of the corresponding fifteen precipitation sequence and up to a time interval of 180 min. The results show that the increase in the number of layers, as well as in the amount of data in the training data set, improves the quality of the forecast.


Author(s):  
J. Becker ◽  
P. Böhme ◽  
A. Reckert ◽  
S. B. Eickhoff ◽  
B. E. Koop ◽  
...  

AbstractAs a contribution to the discussion about the possible effects of ethnicity/ancestry on age estimation based on DNA methylation (DNAm) patterns, we directly compared age-associated DNAm in German and Japanese donors in one laboratory under identical conditions. DNAm was analyzed by pyrosequencing for 22 CpG sites (CpGs) in the genes PDE4C, RPA2, ELOVL2, DDO, and EDARADD in buccal mucosa samples from German and Japanese donors (N = 368 and N = 89, respectively).Twenty of these CpGs revealed a very high correlation with age and were subsequently tested for differences between German and Japanese donors aged between 10 and 65 years (N = 287 and N = 83, respectively). ANCOVA was performed by testing the Japanese samples against age- and sex-matched German subsamples (N = 83 each; extracted 500 times from the German total sample). The median p values suggest a strong evidence for significant differences (p < 0.05) at least for two CpGs (EDARADD, CpG 2, and PDE4C, CpG 2) and no differences for 11 CpGs (p > 0.3).Age prediction models based on DNAm data from all 20 CpGs from German training data did not reveal relevant differences between the Japanese test samples and German subsamples. Obviously, the high number of included “robust CpGs” prevented relevant effects of differences in DNAm at two CpGs.Nevertheless, the presented data demonstrates the need for further research regarding the impact of confounding factors on DNAm in the context of ethnicity/ancestry to ensure a high quality of age estimation. One approach may be the search for “robust” CpG markers—which requires the targeted investigation of different populations, at best by collaborative research with coordinated research strategies.


ADMET & DMPK ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 29-77 ◽  
Author(s):  
Alex Avdeef

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.


Energies ◽  
2020 ◽  
Vol 13 (13) ◽  
pp. 3440
Author(s):  
Arnas Uselis ◽  
Mantas Lukoševičius ◽  
Lukas Stasytis

Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. However, in some domains, like geospatial, not all locations are exactly equal. In this work, we propose localized convolutional neural networks that enable convolutional architectures to learn local features in addition to the global ones. We investigate their instantiations in the form of learnable inputs, local weights, and a more general form. They can be added to any convolutional layers, easily end-to-end trained, introduce minimal additional complexity, and let CNNs retain most of their benefits to the extent that they are needed. In this work we address spatio-temporal prediction: test the effectiveness of our methods on a synthetic benchmark dataset and tackle three real-world wind prediction datasets. For one of them, we propose a method to spatially order the unordered data. We compare the recent state-of-the-art spatio-temporal prediction models on the same data. Models that use convolutional layers can be and are extended with our localizations. In all these cases our extensions improve the results, and thus often the state-of-the-art. We share all the code at a public repository.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Lewis H. Mervin ◽  
Maria-Anna Trapotsi ◽  
Avid M. Afzal ◽  
Ian P. Barrett ◽  
Andreas Bender ◽  
...  

AbstractMeasurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein–ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4–0.6 log units and when ideal probability estimates between 0.4–0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.


Author(s):  
Leye Wang ◽  
Xu Geng ◽  
Xiaojuan Ma ◽  
Feng Liu ◽  
Qiang Yang

Spatio-temporal prediction is a key type of tasks in urban computing, e.g., traffic flow and air quality. Adequate data is usually a prerequisite, especially when deep learning is adopted. However, the development levels of different cities are unbalanced, and still many cities suffer from data scarcity. To address the problem, we propose a novel cross-city transfer learning method for deep spatio-temporal prediction tasks, called RegionTrans. RegionTrans aims to effectively transfer knowledge from a data-rich source city to a data-scarce target city. More specifically, we first learn an inter-city region matching function to match each target city region to a similar source city region. A neural network is designed to effectively extract region-level representation for spatio-temporal prediction. Finally, an optimization algorithm is proposed to transfer learned features from the source city to the target city with the region matching function. Using citywide crowd flow prediction as a demonstration experiment, we verify the effectiveness of RegionTrans. Results show that RegionTrans can outperform the state-of-the-art fine-tuning deep spatio-temporal prediction models by reducing up to 10.7% prediction error. 


Sign in / Sign up

Export Citation Format

Share Document