Bootstrapping Linear Models

Author(s):  
Russell Cheng

Bootstrap model selection is proposed for the difficult problem of selecting important factors in non-orthogonal linear models when the number of factors, P, is large. In the method, the full model is first fitted to the original data. Then B parametric bootstrap samples are drawn from the fitted model, and the full model fitted to each. A submodel is obtained from each fitted full model by rejecting those factors found unimportant in the fit. Each distinct selected submodel is then fitted to the original data and its Mallows Cp statistic calculated. A subset of good submodels based on the Cp values is then obtained. A reliability check can be made by fitting this subset to the BS samples also, to see how often each submodel is found to be a good fit. Use of the method is illustrated using a real-data sample.


2006 ◽  
Vol 45 (01) ◽  
pp. 44-50 ◽  
Author(s):  
N. H. Augustin ◽  
W. Sauerbrei ◽  
N. Holländer

Summary Objectives: We illustrate a recently proposed two-step bootstrap model averaging (bootstrap MA) approach to cope with model selection uncertainty. The predictive performance is investigated in an example and in a simulation study. Results are compared to those derived from other model selection methods. Methods: In the framework of the linear regression model we use the two-step bootstrap MA, which consists of a screening step to eliminate covariates thought to have no influence on the response, and a model-averaging step. We also apply the full model, variable selection using backward elimination based on Akaike’s Information Criterion (AIC), the Bayes Information Criterion (BIC) and the bagging approach. The predictive performance is measured by the mean squared error (MSE) and the coverage of confidence intervals for the true response. Results: We obtained similar results for all approaches in the example. In the simulation the MSE was reduced by all approaches in comparison to the full model. The smallest values are obtained for bootstrap MA. Only the bootstrap MA and the full model correctly estimated the nominal coverage. The backward elimination procedures led to substantial underestimation and bagging to an overestimation of the true coverage. The screening step of bootstrap MA eliminates most of the unimportant factors. Conclusion: The new bootstrap MA approach shows promising results for predictive performance. It increases practical usefulness by eliminating unimportant factors in the screening step.





Entropy ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. 807
Author(s):  
Xuan Cao ◽  
Kyoungjae Lee

High-dimensional variable selection is an important research topic in modern statistics. While methods using nonlocal priors have been thoroughly studied for variable selection in linear regression, the crucial high-dimensional model selection properties for nonlocal priors in generalized linear models have not been investigated. In this paper, we consider a hierarchical generalized linear regression model with the product moment nonlocal prior over coefficients and examine its properties. Under standard regularity assumptions, we establish strong model selection consistency in a high-dimensional setting, where the number of covariates is allowed to increase at a sub-exponential rate with the sample size. The Laplace approximation is implemented for computing the posterior probabilities and the shotgun stochastic search procedure is suggested for exploring the posterior space. The proposed method is validated through simulation studies and illustrated by a real data example on functional activity analysis in fMRI study for predicting Parkinson’s disease.



2011 ◽  
Vol 16 (3) ◽  
pp. 263
Author(s):  
Luz Marina Moya-Moya ◽  
Milton Januario Rueda-Varón

<p><strong></strong><strong>Objective</strong>. To present a methodology based on the concept of Kronecker products that facilitates the construction of the variance and covariance matrix for designs with balanced data structure for 2 and 3 ways, and an application ​​in R to facilitate its calculation and application in different areas. <strong>Materials and methods</strong>. We provide a starting point for people interested in using R in the analysis of variance. <strong>Results</strong>. We use an application made ​​in R for a methodology based on Kronecker products through which we build the covariance matrix for working with designs with balanced data structure developed by Moya (2003). We also present an application of the method with real data. <strong>Conclusions</strong>. With this methodology we can accelerate the development and solution of some practical problems. The proposed methodology can be applied to mixed models with fixed or random effects with any number of factors.</p> <p><strong>Key words: </strong>Kronecker products, variance and covariance matrix, balanced designs, linear models, R Gui.</p><br />



Econometrics ◽  
2021 ◽  
Vol 9 (1) ◽  
pp. 10
Author(s):  
Šárka Hudecová ◽  
Marie Hušková ◽  
Simos G. Meintanis

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.



Genetics ◽  
2000 ◽  
Vol 154 (1) ◽  
pp. 381-395
Author(s):  
Pavel Morozov ◽  
Tatyana Sitnikova ◽  
Gary Churchill ◽  
Francisco José Ayala ◽  
Andrey Rzhetsky

Abstract We propose models for describing replacement rate variation in genes and proteins, in which the profile of relative replacement rates along the length of a given sequence is defined as a function of the site number. We consider here two types of functions, one derived from the cosine Fourier series, and the other from discrete wavelet transforms. The number of parameters used for characterizing the substitution rates along the sequences can be flexibly changed and in their most parameter-rich versions, both Fourier and wavelet models become equivalent to the unrestricted-rates model, in which each site of a sequence alignment evolves at a unique rate. When applied to a few real data sets, the new models appeared to fit data better than the discrete gamma model when compared with the Akaike information criterion and the likelihood-ratio test, although the parametric bootstrap version of the Cox test performed for one of the data sets indicated that the difference in likelihoods between the two models is not significant. The new models are applicable to testing biological hypotheses such as the statistical identity of rate variation profiles among homologous protein families. These models are also useful for determining regions in genes and proteins that evolve significantly faster or slower than the sequence average. We illustrate the application of the new method by analyzing human immunoglobulin and Drosophilid alcohol dehydrogenase sequences.







Author(s):  
Fiorella Pia Salvatore ◽  
Alessia Spada ◽  
Francesca Fortunato ◽  
Demetris Vrontis ◽  
Mariantonietta Fiore

The purpose of this paper is to investigate the determinants influencing the costs of cardiovascular disease in the regional health service in Italy’s Apulia region from 2014 to 2016. Data for patients with acute myocardial infarction (AMI), heart failure (HF), and atrial fibrillation (AF) were collected from the hospital discharge registry. Generalized linear models (GLM), and generalized linear mixed models (GLMM) were used to identify the role of random effects in improving the model performance. The study was based on socio-demographic variables and disease-specific variables (diagnosis-related group, hospitalization type, hospital stay, surgery, and economic burden of the hospital discharge form). Firstly, both models indicated an increase in health costs in 2016, and lower spending values for women (p < 0.001) were shown. GLMM indicates a significant increase in health expenditure with increasing age (p < 0.001). Day-hospital has the lowest cost, surgery increases the cost, and AMI is the most expensive pathology, contrary to AF (p < 0.001). Secondly, AIC and BIC assume the lowest values for the GLMM model, indicating the random effects’ relevance in improving the model performance. This study is the first that considers real data to estimate the economic burden of CVD from the regional health service’s perspective. It appears significant for its ability to provide a large set of estimates of the economic burden of CVD, providing information to managers for health management and planning.



Sign in / Sign up

Export Citation Format

Share Document