Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

Mapping Intimacies ◽

10.7287/peerj.preprints.26693v1 ◽

2018 ◽

Cited By ~ 4

Author(s):

Tomislav Hengl ◽

Madlene Nussbaum ◽

Marvin N Wright ◽

Gerard B.M. Heuvelink

Keyword(s):

Random Forest ◽

Cross Validation ◽

Prediction Models ◽

High Sensitivity ◽

Spatial Prediction ◽

Proximity Effects ◽

Machine Learning Techniques ◽

Calibration Data ◽

Temporal Prediction ◽

Spatio Temporal

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5--fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

Download Full-text

Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

PeerJ ◽

10.7717/peerj.5518 ◽

2018 ◽

Vol 6 ◽

pp. e5518 ◽

Cited By ~ 78

Author(s):

Tomislav Hengl ◽

Madlene Nussbaum ◽

Marvin N. Wright ◽

Gerard B.M. Heuvelink ◽

Benedikt Gräler

Keyword(s):

Random Forest ◽

Data Quality ◽

Cross Validation ◽

Prediction Models ◽

Proximity Effects ◽

Training Data ◽

Calibration Data ◽

Temporal Prediction ◽

Spatio Temporal

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality—especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

Download Full-text

Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

10.7287/peerj.preprints.26693v3 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tomislav Hengl ◽

Madlene Nussbaum ◽

Marvin N Wright ◽

Gerard B.M. Heuvelink ◽

Benedikt Gräler

Keyword(s):

Random Forest ◽

Data Quality ◽

Cross Validation ◽

Prediction Models ◽

Proximity Effects ◽

Training Data ◽

Calibration Data ◽

Temporal Prediction ◽

Spatio Temporal

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5 – fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates, sensitivity of predictions to input data quality and extrapolation problems. The key to the success of the RFsp framework might be the training data quality — especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with fewer number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

Download Full-text

Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

10.7287/peerj.preprints.26693v2 ◽

2018 ◽

Author(s):

Tomislav Hengl ◽

Madlene Nussbaum ◽

Marvin N Wright ◽

Gerard B.M. Heuvelink ◽

Benedikt Gräler

Keyword(s):

Random Forest ◽

Data Quality ◽

Cross Validation ◽

Prediction Models ◽

Proximity Effects ◽

Training Data ◽

Calibration Data ◽

Temporal Prediction ◽

Spatio Temporal

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5 – fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as 'knowledge engines' in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates, sensitivity of predictions to input data quality and extrapolation problems. The key to the success of the RFsp framework might be the training data quality — especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with fewer number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

Download Full-text

Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

10.7287/peerj.preprints.26693 ◽

2018 ◽

Author(s):

Tomislav Hengl ◽

Madlene Nussbaum ◽

Marvin N Wright ◽

Gerard B.M. Heuvelink ◽

Benedikt Gräler

Keyword(s):

Random Forest ◽

Data Quality ◽

Cross Validation ◽

Prediction Models ◽

Proximity Effects ◽

Training Data ◽

Calibration Data ◽

Temporal Prediction ◽

Spatio Temporal

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using 5 – fold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates, sensitivity of predictions to input data quality and extrapolation problems. The key to the success of the RFsp framework might be the training data quality — especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with fewer number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

Download Full-text

Prediction of Fatal and Major Injury of Drivers, Cyclists, and Pedestrians in Collisions

PROMET - Traffic&Transportation ◽

10.7307/ptt.v32i1.3134 ◽

2020 ◽

Vol 32 (1) ◽

pp. 39-53

Author(s):

Dalia Shanshal ◽

Ceni Babaoglu ◽

Ayşe Başar

Keyword(s):

Machine Learning ◽

Random Forest ◽

Injury Severity ◽

Predictive Analytics ◽

Machine Learning Techniques ◽

Lasso Regression ◽

Severe Injuries ◽

Factors Affecting ◽

Spatio Temporal ◽

Using Data

Traffic-related deaths and severe injuries may affect every person on the roads, whether driving, cycling or walking. Toronto, the largest city in Canada and the fourth largest in North America, aims to eliminate traffic-related fatalities and serious injuries on city streets. The aim of this study is to build a prediction model using data analytics and machine learning techniques that learn from past patterns, providing additional data-driven decision support for strategic planning. A detailed exploratory analysis is presented, investigating the relationship between the variables and factors affecting collisions in Toronto. A learning-based model is proposed to predict the fatalities and severe injuries in traffic collisions through a comparison of two predictive models: Lasso Regression and Random Forest. Exploratory data analysis results reveal both spatio-temporal and behavioural patterns such as the prevalence of collisions in intersections, in the spring and summer and aggressive driving and inattentive behaviours in drivers. The prediction results show that the best predictor of injury severity for drivers, cyclists and pedestrians is Random Forest with an accuracy of 0.80, 0.89, and 0.80, respectively. The proposed methods demonstrate the effectiveness of machine learning application to traffic and collision data, both for exploratory and predictive analytics.

Download Full-text

Predicting mortality in hemodialysis patients using machine learning analysis

Clinical Kidney Journal ◽

10.1093/ckj/sfaa126 ◽

2020 ◽

Author(s):

Victoria Garcia-Montemayor ◽

Alejandro Martin-Malo ◽

Carlo Barbieri ◽

Francesco Bellocchio ◽

Sagrario Soriano ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Prediction Models ◽

Initial Period ◽

Area Under The Curve ◽

Mortality Prediction ◽

Machine Learning Techniques ◽

Prediction Of Mortality ◽

Mortality Prediction Models

Abstract Background Besides the classic logistic regression analysis, non-parametric methods based on machine learning techniques such as random forest are presently used to generate predictive models. The aim of this study was to evaluate random forest mortality prediction models in haemodialysis patients. Methods Data were acquired from incident haemodialysis patients between 1995 and 2015. Prediction of mortality at 6 months, 1 year and 2 years of haemodialysis was calculated using random forest and the accuracy was compared with logistic regression. Baseline data were constructed with the information obtained during the initial period of regular haemodialysis. Aiming to increase accuracy concerning baseline information of each patient, the period of time used to collect data was set at 30, 60 and 90 days after the first haemodialysis session. Results There were 1571 incident haemodialysis patients included. The mean age was 62.3 years and the average Charlson comorbidity index was 5.99. The mortality prediction models obtained by random forest appear to be adequate in terms of accuracy [area under the curve (AUC) 0.68–0.73] and superior to logistic regression models (ΔAUC 0.007–0.046). Results indicate that both random forest and logistic regression develop mortality prediction models using different variables. Conclusions Random forest is an adequate method, and superior to logistic regression, to generate mortality prediction models in haemodialysis patients.

Download Full-text

Quantification of Continuous Flood Hazard using Random Forrest Classification and Flood Insurance Claims at Large Spatial Scales: A Pilot Study in Southeast Texas

10.5194/nhess-2020-347 ◽

2020 ◽

Author(s):

William Mobley ◽

Antonia Sebastian ◽

Russell Blessing ◽

Wesley E. Highfield ◽

Laura Stearns ◽

...

Keyword(s):

Pilot Study ◽

Random Forest ◽

Spatial Information ◽

Gulf Coast ◽

Flood Hazard ◽

Spatial Scales ◽

High Sensitivity ◽

Machine Learning Techniques ◽

Flood Insurance ◽

Flood Hazards

Abstract. Pre-disaster planning and mitigation necessitates detailed spatial information about flood hazards and their associated risks. In the U.S., the FEMA Special Flood Hazard Area (SFHA) provides important information about areas subject to flooding during the 1 % riverine or coastal event. The binary nature of flood hazard maps obscures the distribution of property risk inside of the SFHA and the residual risk outside of the SFHA, which can undermine mitigation efforts. Machine-learning techniques provide an alternative approach to estimating flood hazards across large spatial scales at low computational expense. This study presents a pilot study for the Texas Gulf Coast Region using Random Forest Classification to predict flood probability across a 30,523 km2 area. Using a record of National Flood Insurance Program (NFIP) claims dating back to 1976 and high-resolution geospatial data, we generate a continuous flood hazard map for twelve USGS HUC-8 watersheds. Results indicate that the Random Forest model predicts flooding with a high sensitivity (AUC 0.895), especially compared to the existing FEMA regulatory floodplain. Our model identifies 649,000 structures with at least a 1 % annual chance of flooding, roughly three times more than are currently identified by FEMA as flood prone.

Download Full-text

Prediction of soil classes in a complex landscape in Southern Brazil

Pesquisa Agropecuária Brasileira ◽

10.1590/s1678-3921.pab2019.v54.00420 ◽

2019 ◽

Vol 54 ◽

Author(s):

Jean Michel Moura-Bueno ◽

Ricardo Simão Diniz Dalmolin ◽

Taciara Zborowski Horst-Heinen ◽

Luciano Campos Cancian ◽

Ricardo Bergamo Schenato ◽

...

Keyword(s):

Random Forest ◽

Prediction Models ◽

Expert Knowledge ◽

Model Performance ◽

Spatial Prediction ◽

Support Vector ◽

Kappa Index ◽

Digital Elevation ◽

Complex Landscape ◽

Elevation Model

Abstract: The objective of this work was to evaluate the use of covariate selection by expert knowledge on the performance of soil class predictive models in a complex landscape, in order to identify the best predictive model for digital soil mapping in the Southern region of Brazil. A total of 164 points were sampled in the field using the conditioned Latin hypercube, considering the covariates elevation, slope, and aspect. From the digital elevation model, environmental covariates were extracted, composing three sets, made up of: 21 covariates, covariates after the exclusion of the multicollinear ones, and covariates chosen by expert knowledge. Prediction was performed with the following models: decision tree, random forest, multiple logistic regression, and support vector machine. The accuracy of the models was evaluated by the kappa index (K), general accuracy (GA), and class accuracy. The prediction models were sensitive to the disproportionate sampling of soil classes. The best predicted map achieved a GA of 71% and K of 0.59. The use of the covariate set chosen by expert knowledge improves model performance in predicting soil classes in a complex landscape, and random forest is the best model for the spatial prediction of soil classes.

Download Full-text

Improving Earnings Predictions and Abnormal Returns with Machine Learning

Accounting Horizons ◽

10.2308/horizons-19-125 ◽

2021 ◽

Author(s):

Joshua O.S. Hunt ◽

James N. Myers ◽

Linda A. Myers

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Abnormal Returns ◽

Forecast Accuracy ◽

Trading Strategy ◽

Machine Learning Techniques ◽

Binary Outcomes ◽

High Tech ◽

Out Of Sample

Using use stepwise logit regression, Ou and Penman (1989) predicts the sign of future earnings changes and uses these predictions to form a profitable hedge portfolio. Dramatic increases in computing power and recent advances in machine learning allow us to extend Ou and Penman (1989) using a larger dataset, more computer intensive forecasting algorithms, and modern prediction models. We find that stepwise logit continues to provide good out-of-sample predictions and can be used to form a trading strategy that generates small abnormal returns, but a nonparametric machine learning technique (random forest) significantly improves out-of-sample forecast accuracy and trading strategy returns. We also find that that the models identify different independent variables as being important for prediction in the High Tech and Manufacturing industries, but this does not lead to better predictions or higher trading strategy returns. Overall, the most profitable strategy is based on earnings predictions from a random forest model using our full sample. Our results confirm the Ou and Penman (1989) finding that financial statement information can be useful for investment decisions, and suggest that recent nonparametric machine learning techniques could be useful in a variety of accounting contexts where predictions of binary outcomes are needed.

Download Full-text

Localized Convolutional Neural Networks for Geospatial Wind Forecasting

Energies ◽

10.3390/en13133440 ◽

2020 ◽

Vol 13 (13) ◽

pp. 3440

Author(s):

Arnas Uselis ◽

Mantas Lukoševičius ◽

Lukas Stasytis

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Prediction Models ◽

State Of The Art ◽

Translation Invariance ◽

Public Repository ◽

Temporal Prediction ◽

Prediction Test ◽

Data Translation ◽

Spatio Temporal

Convolutional Neural Networks (CNN) possess many positive qualities when it comes to spatial raster data. Translation invariance enables CNNs to detect features regardless of their position in the scene. However, in some domains, like geospatial, not all locations are exactly equal. In this work, we propose localized convolutional neural networks that enable convolutional architectures to learn local features in addition to the global ones. We investigate their instantiations in the form of learnable inputs, local weights, and a more general form. They can be added to any convolutional layers, easily end-to-end trained, introduce minimal additional complexity, and let CNNs retain most of their benefits to the extent that they are needed. In this work we address spatio-temporal prediction: test the effectiveness of our methods on a synthetic benchmark dataset and tackle three real-world wind prediction datasets. For one of them, we propose a method to spatially order the unordered data. We compare the recent state-of-the-art spatio-temporal prediction models on the same data. Models that use convolutional layers can be and are extended with our localizations. In all these cases our extensions improve the results, and thus often the state-of-the-art. We share all the code at a public repository.

Download Full-text