Does cross validation provide additional information in the evaluation of regression models?

2003 ◽  
Vol 33 (6) ◽  
pp. 976-987 ◽  
Author(s):  
Antal Kozak ◽  
Robert Kozak

A detailed study using seven data sets, two standing tree volume estimating models, and a height–diameter model showed that fit statistics and lack of fit statistics calculated directly from a regression model can be well estimated using simulations of cross validation or double cross validation. These results suggest that cross validation by data splitting and double cross validation provide little, if any, additional information in the process of evaluating regression models.

2021 ◽  
Vol 2 (2) ◽  
pp. 40-47
Author(s):  
Sunil Kumar ◽  
Vaibhav Bhatnagar

Machine learning is one of the active fields and technologies to realize artificial intelligence (AI). The complexity of machine learning algorithms creates problems to predict the best algorithm. There are many complex algorithms in machine learning (ML) to determine the appropriate method for finding regression trends, thereby establishing the correlation association in the middle of variables is very difficult, we are going to review different types of regressions used in Machine Learning. There are mainly six types of regression model Linear, Logistic, Polynomial, Ridge, Bayesian Linear and Lasso. This paper overview the above-mentioned regression model and will try to find the comparison and suitability for Machine Learning. A data analysis prerequisite to launch an association amongst the innumerable considerations in a data set, association is essential for forecast and exploration of data. Regression Analysis is such a procedure to establish association among the datasets. The effort on this paper predominantly emphases on the diverse regression analysis model, how they binning to custom in context of different data sets in machine learning. Selection the accurate model for exploration is the most challenging assignment and hence, these models considered thoroughly in this study. In machine learning by these models in the perfect way and thru accurate data set, data exploration and forecast can provide the maximum exact outcomes.


1989 ◽  
Vol 19 (2) ◽  
pp. 179-184 ◽  
Author(s):  
David L. Verbyla ◽  
Richard F. Fisher

The conventional approach in site-quality studies has been to develop a multiple regression site index model with soil–site measurements from randomly selected plots. This approach has several weaknessess: (i) a potential prediction bias associated with most stepwise regression procedures; (ii) low precision of soil–site regression models developed in areas with diverse topography and geologic formations; and (iii) poor representation of rare prime sites by random sampling. An alternative approach, aimed at minimizing these problems, is presented. Prediction bias potential (due to overfitting a model with too many predictor variables) can be reduced by using cross validation during model development. Models that accurately predict prime sites can be more useful than imprecise soil–site regression models. This can be accomplished by stratified random sampling from prime and nonprime site areas. Classification-tree analysis was used to develop a model that predicts prime ponderosa pine (Pinusponderosa Laws.) sites on the basis of vegetation and soil variables. Forest habitat type, percent sand content, and soil pH were model predictor variables. Cross-validation was used to estimate the accuracy of the classification tree as 88%. A multiple regression model developed from randomly selected plots consistently underestimated site index when it was applied to plots randomly selected from prime site areas. The conventional regression model was also misleading because it contained a predictor variable that was not significantly different between prime and nonprime sites.


Author(s):  
Carlos Alberto Huaira Contreras ◽  
Carlos Cristiano Hasenclever Borges ◽  
Camila Borelli Zeller ◽  
Amanda Romanelli

The paper proposes a weighted cross-validation (WCV) algorithm  to select a linear regression model with change-point under a scale mixtures of normal (SMN) distribution that yields the best prediction results. SMN distributions are used to construct robust regression models to the influence of outliers on the parameter estimation process. Thus, we relaxed the usual assumption of normality of the regression models and considered that the random errors follow a SMN distribution, specifically the Student-t distribution. In addition, we consider the fact that the parameters of the regression model can change from a specific and unknown point, called change-point. In this context, the estimations of the model parameters, which include the change-point, are obtained via the EM-type algorithm (Expectation-Maximization). The WCV method is used in the selection of the model that presents greater robustness and that offers a smaller prediction error, considering that the weighting values come from step E of the EM-type algorithm. Finally, numerical examples considering simulated and real data (data from television audiences) are presented to illustrate the proposed methodology.


2021 ◽  
Vol 40 (S1) ◽  
Author(s):  
Fatimah Othman ◽  
Rashidah Ambak ◽  
Mohd Azahadi Omar ◽  
Suzana Shahar ◽  
Noor Safiza Mohd Nor ◽  
...  

Abstract Background Monitoring sodium intake through 24-h urine collection sample is recommended, but the implementation of this method can be difficult. The objective of this study was to develop and validate an equation using spot urine concentration to predict 24-h sodium excretion in the Malaysian population. Methods This was a Malaysian Community Salt Study (MyCoSS) sub-study, which was conducted from October 2017 to March 2018. Out of 798 participants in the MyCoSS study who completed 24-h urine collection, 768 of them have collected one-time spot urine the following morning. They were randomly assigned into two groups to form separate spot urine equations. The final spot urine equation was derived from the entire data set after confirming the stability of the equation by double cross-validation in both study groups. Newly derived spot urine equation was developed using the coefficients from the multiple linear regression test. A Bland-Altman plot was used to measure the mean bias and limits of agreement between estimated and measured 24-h urine sodium. The estimation of sodium intake using the new equation was compared with other established equations, namely Tanaka and INTERSALT. Results The new equation showed the least mean bias between measured and predicted sodium, − 0.35 (− 72.26, 71.56) mg/day compared to Tanaka, 629.83 (532.19, 727.47) mg/day and INTERSALT, and 360.82 (284.34, 437.29) mg/day. Predicted sodium measured from the new equation showed greater correlation with measured sodium (r = 0.50) compared to Tanaka (r =0.24) and INTERSALT (r = 0.44), P < 0.05. Conclusion Our newly developed equation from spot urine can predict least mean bias of sodium intake among the Malaysian population when 24-h urine sodium collection is not feasible.


2021 ◽  
Vol 11 (4) ◽  
pp. 1776
Author(s):  
Young Seo Kim ◽  
Han Young Joo ◽  
Jae Wook Kim ◽  
So Yun Jeong ◽  
Joo Hyun Moon

This study identified the meteorological variables that significantly impact the power generation of a solar power plant in Samcheonpo, Korea. To this end, multiple regression models were developed to estimate the power generation of the solar power plant with changing weather conditions. The meteorological data for the regression models were the daily data from January 2011 to December 2019. The dependent variable was the daily power generation of the solar power plant in kWh, and the independent variables were the insolation intensity during daylight hours (MJ/m2), daylight time (h), average relative humidity (%), minimum relative humidity (%), and quantity of evaporation (mm). A regression model for the entire data and 12 monthly regression models for the monthly data were constructed using R, a large data analysis software. The 12 monthly regression models estimated the solar power generation better than the entire regression model. The variables with the highest influence on solar power generation were the insolation intensity variables during daylight hours and daylight time.


2005 ◽  
Vol 01 (01) ◽  
pp. 129-145 ◽  
Author(s):  
XIAOBO ZHOU ◽  
XIAODONG WANG ◽  
EDWARD R. DOUGHERTY

In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables (gene expressions) and the small number of experimental conditions. Many gene-selection and classification methods have been proposed; however most of these treat gene selection and classification separately, and not under the same model. We propose a Bayesian approach to gene selection using the logistic regression model. The Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the minimum description length (MDL) principle are used in constructing the posterior distribution of the chosen genes. The same logistic regression model is then used for cancer classification. Fast implementation issues for these methods are discussed. The proposed methods are tested on several data sets including those arising from hereditary breast cancer, small round blue-cell tumors, lymphoma, and acute leukemia. The experimental results indicate that the proposed methods show high classification accuracies on these data sets. Some robustness and sensitivity properties of the proposed methods are also discussed. Finally, mixing logistic-regression based gene selection with other classification methods and mixing logistic-regression-based classification with other gene-selection methods are considered.


2021 ◽  
Vol 03 (01) ◽  
pp. 25-31
Author(s):  
Peter Krammer ◽  
Marcel Kvassay ◽  
Ladislav Hluchý

In this article, building on our previous work, we engage in spatiotemporal modelling of transport demand in the Montreal metropolitan area over the period of six years. We employ classical machine learning and regression models, which predict bike-sharing demand in the form of daily cumulative sums of bike trips for each considered docking station. Hourly estimates of demand are then determined by considering the statistical distribution of demand across individual hours of an average day. In order to capture seasonal and other regular variation of demand, longer-term distribution characteristics of bike trips, such as their average number falling on each day of the week, month of the year, etc., were also used as input attributes. We initially conjectured that weather would be an important source of irregular variation in bike-sharing demand, and subsequently included several available meteorological variables in our models. We validated our models by Hold-Out and 10-Fold Cross-Validation, with encouraging results.


Sign in / Sign up

Export Citation Format

Share Document