Selection of optimal regression models via cross-validation

1988 ◽  
Vol 2 (1) ◽  
pp. 39-48 ◽  
Author(s):  
David W. Osten
1976 ◽  
Vol 1 (3) ◽  
pp. 253-277 ◽  
Author(s):  
Herbert J. Walberg ◽  
Sue Pinzur Rasher

This paper illustrates cut-and-try techniques that point to appropriate transformations of variables and to the selection of sets of variables for an equation that may improve understanding of a social process. The substance of the research reported — the relation of mental test results to state population, cultural, and school resource indexes (Walberg and Rasher, 1974) — illustrates typical problems of behavioral data: multi-colinearity, outliers, abnormal distributions, and the lack of a consensually-validated, explicit theoretical model. Despite these problems, data originally collected for purposes other than the investigator’s may yield tentative confirmations or cautions about prior findings and provisional indications for theory or policy; such inferences may be at least partially checked by cross-validation on independent or semi-independent sets of data. After discussing the sequence of analyses and the results, we conclude by mentioning a number of uncertainties and reservations about drawing substantive or policy implications.


Author(s):  
Carlos Alberto Huaira Contreras ◽  
Carlos Cristiano Hasenclever Borges ◽  
Camila Borelli Zeller ◽  
Amanda Romanelli

The paper proposes a weighted cross-validation (WCV) algorithm  to select a linear regression model with change-point under a scale mixtures of normal (SMN) distribution that yields the best prediction results. SMN distributions are used to construct robust regression models to the influence of outliers on the parameter estimation process. Thus, we relaxed the usual assumption of normality of the regression models and considered that the random errors follow a SMN distribution, specifically the Student-t distribution. In addition, we consider the fact that the parameters of the regression model can change from a specific and unknown point, called change-point. In this context, the estimations of the model parameters, which include the change-point, are obtained via the EM-type algorithm (Expectation-Maximization). The WCV method is used in the selection of the model that presents greater robustness and that offers a smaller prediction error, considering that the weighting values come from step E of the EM-type algorithm. Finally, numerical examples considering simulated and real data (data from television audiences) are presented to illustrate the proposed methodology.


2021 ◽  
Vol 213 ◽  
pp. 106676
Author(s):  
Saeed Mohammadiun ◽  
Guangji Hu ◽  
Abdorreza Alavi Gharahbagh ◽  
Reza Mirshahi ◽  
Jianbing Li ◽  
...  

2021 ◽  
Vol 03 (01) ◽  
pp. 25-31
Author(s):  
Peter Krammer ◽  
Marcel Kvassay ◽  
Ladislav Hluchý

In this article, building on our previous work, we engage in spatiotemporal modelling of transport demand in the Montreal metropolitan area over the period of six years. We employ classical machine learning and regression models, which predict bike-sharing demand in the form of daily cumulative sums of bike trips for each considered docking station. Hourly estimates of demand are then determined by considering the statistical distribution of demand across individual hours of an average day. In order to capture seasonal and other regular variation of demand, longer-term distribution characteristics of bike trips, such as their average number falling on each day of the week, month of the year, etc., were also used as input attributes. We initially conjectured that weather would be an important source of irregular variation in bike-sharing demand, and subsequently included several available meteorological variables in our models. We validated our models by Hold-Out and 10-Fold Cross-Validation, with encouraging results.


2017 ◽  
Vol 47 (5) ◽  
Author(s):  
Priscila Becker Ferreira ◽  
Paulo Roberto Nogara Rorato ◽  
Fernanda Cristina Breda ◽  
Vanessa Tomazetti Michelotti ◽  
Alexandre Pires Rosa ◽  
...  

ABSTRACT: This study aimed to test different genotypic and residual covariance matrix structures in random regression models to model the egg production of Barred Plymouth Rock and White Plymouth Rock hens aged between 5 and 12 months. In addition, we estimated broad-sense heritability, and environmental and genotypic correlations. Six random regression models were evaluated, and for each model, 12 genotypic and residual matrix structures were tested. The random regression model with linear intercept and unstructured covariance (UN) for a matrix of random effects and unstructured correlation (UNR) for residual matrix adequately model the egg production curve of hens of the two study breeds. Genotypic correlations ranged from 0.15 (between age of 5 and 12 months) to 0.99 (between age of 10 and 11 months) and increased based on the time elapsed. Egg production heritability between 5- and 12-month-old hens increased with age, varying from 0.15 to 0.51. From the age of 9 months onward, heritability was moderate with estimates of genotypic correlations higher than 90% at the age of 10, 11, and 12 months. Results suggested that selection of hens to improve egg production should commence at the ninth month of age.


2002 ◽  
Vol 14 (10) ◽  
pp. 2439-2468 ◽  
Author(s):  
Aki Vehtari ◽  
Jouko Lampinen

In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate because it describes the uncertainty in the estimate. The distributions of the expected utility estimates can also be used to compare models, for example, by computing the probability of one model having a better expected utility than some other model. We propose an approach using cross-validation predictive densities to obtain expected utility estimates and Bayesian bootstrap to obtain samples from their distributions. We also discuss the probabilistic assumptions made and properties of two practical cross-validation methods, importance sampling and k-fold cross-validation. As illustrative examples, we use multilayer perceptron neural networks and gaussian processes with Markov chain Monte Carlo sampling in one toy problem and two challenging real-world problems.


Sign in / Sign up

Export Citation Format

Share Document