Two cross-validation techniques to comprehensively characterize global horizontal irradiation regression models: Single data-splitting is insufficient

2019 ◽  
Vol 11 (6) ◽  
pp. 063702
Author(s):  
Keith De Souza

2003 ◽  
Vol 33 (6) ◽  
pp. 976-987 ◽  
Author(s):  
Antal Kozak ◽  
Robert Kozak

A detailed study using seven data sets, two standing tree volume estimating models, and a height–diameter model showed that fit statistics and lack of fit statistics calculated directly from a regression model can be well estimated using simulations of cross validation or double cross validation. These results suggest that cross validation by data splitting and double cross validation provide little, if any, additional information in the process of evaluating regression models.



2021 ◽  
Vol 03 (01) ◽  
pp. 25-31
Author(s):  
Peter Krammer ◽  
Marcel Kvassay ◽  
Ladislav Hluchý

In this article, building on our previous work, we engage in spatiotemporal modelling of transport demand in the Montreal metropolitan area over the period of six years. We employ classical machine learning and regression models, which predict bike-sharing demand in the form of daily cumulative sums of bike trips for each considered docking station. Hourly estimates of demand are then determined by considering the statistical distribution of demand across individual hours of an average day. In order to capture seasonal and other regular variation of demand, longer-term distribution characteristics of bike trips, such as their average number falling on each day of the week, month of the year, etc., were also used as input attributes. We initially conjectured that weather would be an important source of irregular variation in bike-sharing demand, and subsequently included several available meteorological variables in our models. We validated our models by Hold-Out and 10-Fold Cross-Validation, with encouraging results.



1988 ◽  
Vol 2 (1) ◽  
pp. 39-48 ◽  
Author(s):  
David W. Osten


2020 ◽  
Author(s):  
Hylke Beck ◽  
Seth Westra ◽  
Eric Wood

<p>We introduce a unique set of global observation-based climatologies of daily precipitation (<em>P</em>) occurrence (related to the lower tail of the <em>P</em> distribution) and peak intensity (related to the upper tail of the <em>P</em> distribution). The climatologies were produced using Random Forest (RF) regression models trained with an unprecedented collection of daily <em>P</em> observations from 93,138 stations worldwide. Five-fold cross-validation was used to evaluate the generalizability of the approach and to quantify uncertainty globally. The RF models were found to provide highly satisfactory performance, yielding cross-validation coefficient of determination (<em>R</em><sup>2</sup>) values from 0.74 for the 15-year return-period daily <em>P</em> intensity to 0.86 for the >0.5 mm d<sup>-1</sup> daily <em>P</em> occurrence. The performance of the RF models was consistently superior to that of state-of-the-art reanalysis (ERA5) and satellite (IMERG) products. The highest <em>P</em> intensities over land were found along the western equatorial coast of Africa, in India, and along coastal areas of Southeast Asia. Using a 0.5 mm d<sup>-1</sup> threshold, <em>P</em> was estimated to occur 23.2 % of days on average over the global land surface (excluding Antarctica). The climatologies including uncertainty estimates will be released as the Precipitation DISTribution (PDIST) dataset via www.gloh2o.org/pdist. We expect the dataset to be useful for numerous purposes, such as the evaluation of climate models, the bias correction of gridded <em>P</em> datasets, and the design of hydraulic structures in poorly gauged regions.</p>



2020 ◽  
Vol 9 (3) ◽  
pp. 164-172
Author(s):  
Changsheng Jiang ◽  
Piaopiao Zhao ◽  
Weihua Li ◽  
Yun Tang ◽  
Guixia Liu

Abstract Neurotoxicity is one of the main causes of drug withdrawal, and the biological experimental methods of detecting neurotoxic toxicity are time-consuming and laborious. In addition, the existing computational prediction models of neurotoxicity still have some shortcomings. In response to these shortcomings, we collected a large number of data set of neurotoxicity and used PyBioMed molecular descriptors and eight machine learning algorithms to construct regression prediction models of chemical neurotoxicity. Through the cross-validation and test set validation of the models, it was found that the extra-trees regressor model had the best predictive effect on neurotoxicity (${q}_{\mathrm{test}}^2$ = 0.784). In addition, we get the applicability domain of the models by calculating the standard deviation distance and the lever distance of the training set. We also found that some molecular descriptors are closely related to neurotoxicity by calculating the contribution of the molecular descriptors to the models. Considering the accuracy of the regression models, we recommend using the extra-trees regressor model to predict the chemical autonomic neurotoxicity.





2014 ◽  
Vol 2014 ◽  
pp. 1-8 ◽  
Author(s):  
J. S. Wijnands ◽  
K. Shelton ◽  
Y. Kuleshov

Tropical cyclones (TCs) can have a major impact on the coastal communities of Australia and Pacific Island countries. Preparedness is one of the key factors to limit TC impacts and the Australian Bureau of Meteorology issues an outlook of TC seasonal activity ahead of TC season for the Australian Region (AR; 5°S to 40°S, 90°E to 160°E) and the South Pacific Ocean (SPO; 5°S to 40°S, 142.5°E to 120°W). This paper investigates the use of support vector regression models and new explanatory variables to improve the accuracy of seasonal TC predictions. Correlation analysis and subsequent cross-validation of the generated models showed that the Dipole Mode Index (DMI) performs well as an explanatory variable for TC prediction in both AR and SPO, Niño4 SST anomalies—in AR and Niño1+2 SST anomalies—in SPO. For both AR and SPO, the developed model which utilised the combination of Niño1+2 SST anomalies, Niño4 SST anomalies, and DMI had the best forecasting performance. The support vector regression models outperform the current models based on linear discriminant analysis approach for both regions, improving the standard deviation of errors in cross-validation from 2.87 to 2.27 for AR and from 4.91 to 3.92 for SPO.





2001 ◽  
Vol 13 (5) ◽  
pp. 1103-1118 ◽  
Author(s):  
S. Sundararajan ◽  
S. S. Keerthi

Gaussian processes are powerful regression models specified by parameterized mean and covariance functions. Standard approaches to choose these parameters (known by the name hyperparameters) are maximum likelihood and maximum a posteriori. In this article, we propose and investigate predictive approaches based on Geisser's predictive sample reuse (PSR) methodology and the related Stone's cross-validation (CV) methodology. More specifically, we derive results for Geisser's surrogate predictive probability (GPP), Geisser's predictive mean square error (GPE), and the standard CV error and make a comparative study. Within an approximation we arrive at the generalized cross-validation (GCV) and establish its relationship with the GPP and GPE approaches. These approaches are tested on a number of problems. Experimental results show that these approaches are strongly competitive with the existing approaches.



Sign in / Sign up

Export Citation Format

Share Document