scholarly journals Two seemingly paradoxical results in linear models: the variance inflation factor and the analysis of covariance

2021 ◽  
Vol 9 (1) ◽  
pp. 1-8
Author(s):  
Peng Ding

Abstract A result from a standard linear model course is that the variance of the ordinary least squares (OLS) coefficient of a variable will never decrease when including additional covariates into the regression. The variance inflation factor (VIF) measures the increase of the variance. Another result from a standard linear model or experimental design course is that including additional covariates in a linear model of the outcome on the treatment indicator will never increase the variance of the OLS coefficient of the treatment at least asymptotically. This technique is called the analysis of covariance (ANCOVA), which is often used to improve the efficiency of treatment effect estimation. So we have two paradoxical results: adding covariates never decreases the variance in the first result but never increases the variance in the second result. In fact, these two results are derived under different assumptions. More precisely, the VIF result conditions on the treatment indicators but the ANCOVA result averages over them. Comparing the estimators with and without adjusting for additional covariates in a completely randomized experiment, I show that the former has smaller variance averaging over the treatment indicators, and the latter has smaller variance at the cost of a larger bias conditioning on the treatment indicators. Therefore, there is no real paradox.

Author(s):  
Andrea Onofri ◽  
Niccolò Terzaroli ◽  
Luigi Russi

Abstract Key message A new R-software procedure for fixed/random Diallel models was developed. We eased the diallel schemes approach by considering them as specific cases with different parameterisations of a general linear model. Abstract Diallel experiments are based on a set of possible crosses between some homozygous (inbred) lines. For these experiments, six main diallel models are available in literature, to quantify genetic effects, such as general combining ability (GCA), specific combining ability (SCA), reciprocal (maternal) effects and heterosis. Those models tend to be presented as separate entities, to be fitted by using specialised software. In this manuscript, we reinforce the idea that diallel models should be better regarded as specific cases (different parameterisations) of a general linear model and might be fitted with general purpose software facilities, as used for all other types of linear models. We start from the estimation of fixed genetical effects within the R environment and try to bridge the gap between diallel models, linear models and ordinary least squares estimation (OLS). First, we review the main diallel models in literature. Second, we build a set of tools to enable geneticists, plant/animal breeders and students to fit diallel models by using the most widely known R functions for OLS fitting, i.e. the ‘lm()’ function and related methods. Here, we give three examples to show how diallel models can be built by using the typical process of GLMs and fitted, inspected and processed as all other types of linear models in R. Finally, we give a fourth example to show how our tools can be also used to fit random/mixed effect diallel models in the Bayesian framework.


1998 ◽  
Vol 14 (6) ◽  
pp. 701-743 ◽  
Author(s):  
Frank Kleibergen ◽  
Herman K. van Dijk

Diffuse priors lead to pathological posterior behavior when used in Bayesian analyses of simultaneous equation models (SEM's). This results from the local nonidentification of certain parameters in SEM's. When this a priori known feature is not captured appropriately, it results in an a posteriori favoring of certain specific parameter values that is not the consequence of strong data information but of local nonidentification. We show that a proper consistent Bayesian analysis of a SEM explicitly has to consider the reduced form of the SEM as a standard linear model on which nonlinear (reduced rank) restrictions are imposed, which result from a singular value decomposition. The priors/posteriors of the parameters of the SEM are therefore proportional to the priors/posteriors of the parameters of the linear model under the condition that the restrictions hold. This leads to a framework for constructing priors and posteriors for the parameters of SEM's. The framework is used to construct priors and posteriors for one, two, and three structural equation SEM's. These examples together with a theorem, showing that the reduced forms of SEM's accord with sets of reduced rank restrictions on standard linear models, show how Bayesian analyses of generally specified SEM's can be conducted.


Mathematics ◽  
2020 ◽  
Vol 8 (4) ◽  
pp. 605 ◽  
Author(s):  
Román Salmerón Gómez ◽  
Ainara Rodríguez Sánchez ◽  
Catalina García García ◽  
José García Pérez

The raise regression has been proposed as an alternative to ordinary least squares estimation when a model presents collinearity. In order to analyze whether the problem has been mitigated, it is necessary to develop measures to detect collinearity after the application of the raise regression. This paper extends the concept of the variance inflation factor to be applied in a raise regression. The relevance of this extension is that it can be applied to determine the raising factor which allows an optimal application of this technique. The mean square error is also calculated since the raise regression provides a biased estimator. The results are illustrated by two empirical examples where the application of the raise estimator is compared to the application of the ridge and Lasso estimators that are commonly applied to estimate models with multicollinearity as an alternative to ordinary least squares.


2021 ◽  
Author(s):  
Abolfazl Ghanbari ◽  
Behzad Baradaran ◽  
Hamed Ahmadi ◽  
Maryam Ahmadi

Abstract Background: Within six months of the COVID-19 outbreak, 350279 people were infected, and 20125 people died of COVID-19 in Iran. There is an urgent need to find the most accurate effective indicators on this disease's outbreak in order to control and predict. Methods: We examined the effect of 36 demographic, economic, environmental, health infrastructure, social, and topographic independent variables on the COVID-19 infection and mortality rates using the ordinary least squares (OLS) model in ArcGIS 10.5. Regarding adjusted R-squared>0/7, we selected 20 variables for COVID-19 infection rate and 16 variables for the mortality rate. The collinearity problem between the selected variables resolved after using the variance inflation factor (VIF). Then, we performed the OLS and geographically weighted regression (GWR) models in ArcGIS 10.5.Results: Having a large number of men, having a large population, lack of specialist doctors, lack of hospital, having a large urban population, having a large number of people aged 65 and over or older individuals, and high natural mortality rate had the most prominent impact on the COVID-19 infection increasing rate. Also, lack of ICU beds, low number of insured people, lack of subspecialist physicians, and lack of hospital beds had the most prominent impact on increasing of COVID-19 mortality. Then the variables with VIF above 7.5 were removed and finally, high incoming immigrants rate and lack of nurses were identified as two independent variables to predict COVID-19 infection rate. In addition, high incoming immigrants rate and high number of doctor consultation were recognized as two variables to predict mortality rate due to COVID-19. The results of the Akaike information criterion (AIC) and adj.R2 showed that both models were appropriate for these analyses.Conclusions: Based on our results, there would be a considerable increase in COVID-19 infection in Kerman, Esfahan, and Kermanshah provinces. In addition, there would be a remarkable decrease in COVID-19 infection in Khuzestan, Lorestan, Azarbayjan Shargi, and Tehran provinces. Regarding COVID-19 mortality, there would be a substantial rise in Fars and Khorasan Razavi provinces. Moreover, our analyses predicted a considerable diminish in COVID-19 mortality in Tehran, Ardebil, Zanjan, Gilan, Golestan, Lorestan, Khuzestan, Bushehr, and Hormozgan provinces.


Author(s):  
K.Lakshmi, Et. al.

The primary objective of this research article is to present the mathematical and statistical aspects of linear models and their characteristic properties. Linear model is the most common modeling used in science. Actually linear models have many different meanings depend on the context. Linear model is often preferred than other model such as quadratic model because of its ability to interpret easily. In the other hand most of the real life cases have linear relationship .Modeling the cases using linear model will able us to determine the relative influence of one or more independent variables to the dependent variable. In the present talk an attempt has been made to propose the specific forms of simple and multiple linear regression models. In this conversation mathematical aspects of linear models have been extensively depicted. Different types of mathematical models are discussed here and the methods of fitting transformed models are proposed.Furthermore specific form of linear statistical model is presented and the crucial assumptions of general linear model are extensively discussed.At the last stage of this article the method of ordinary least squares estimation of parameters of a linear model has been proposed


Author(s):  
V. G. Jemilohun

This study investigates the impact of violation of the assumption of the hierarchical linear model where covariate of level – 1 collinear with the correct functional and omitted variable model. This was carried out via Monte Carlo simulation. In an attempt to achieve this omitted variable bias was introduced. The study considers the multicollinearity effects when the models are in the correct form and when they are not in the correct form.  Also, multicollinearity test was carried out on the data set to find out whether there is presence of multicollinearity among the data set using Variance Inflation Factor (VIF).  At the end of the study, the result shows that, omitted variable has tremendous impact on hierarchical linear model.


2021 ◽  
pp. 139-180
Author(s):  
Justin C. Touchon

Chapter 6 continues exploring the world of statistics that are covered within the linear model, namely two-way and three-way ANOVA, linear regression and analysis of covariance (ANCOVA). In each type of model, a detailed description of how to interpret the summary output is undertaken, including understanding how to interpret and plot interactions. Conducting post-hoc analyses and using the predict() function are also covered. The chapter ends by reinforcing earlier plotting skills in ggplot2 by walking through an example of making a professional looking figure with multiple non-linear regression curves and confidence intervals.


2006 ◽  
Vol 82 (2) ◽  
pp. 233-239 ◽  
Author(s):  
O. González-Recio ◽  
Y.M. Chang ◽  
D. Gianola ◽  
K. A. Weigel

AbstractDays open data from 113 569 lactation records in 774 Spanish Holstein herds were analysed using standard linear models under two different editing procedures, and with two alternative methodologies that account for censoring: a censored linear model (CLM) and a Weibull survival analysis (SA) model. The first editing procedure excluded from the linear model all censored records for days open (LMnc), and the second defined days open as days from calving to the last known insemination or culling date, treating censored records as complete (LM). Sire variance estimates for days open were 61, 70 and 139 for LMnc, LM and CLM, respectively, and 0·026 for SA on a logarithmic scale. Heritability estimates were 0·05, 0·06 and 0·08 with LMnc, LM and CLM, respectively. Rankings of sires varied between methodologies: sire evaluations from LMnc and LM had rank correlations with evaluations from SA equal to −0·65 and −0·82, respectively, and of 0·71 and 0·87 with evaluations from CLM. The rank correlation between evaluations from SA and CLM was −0·98, suggesting stronger agreement of sire rankings between models that take censoring into account.The SA model had a better predictive ability of daughter fertility at early stages of lactation than the other methods, as measured by chi-squared statistics for predicted pregnancy status at 75, 103, 140, or 200 days post partum in a split data set. The CLM also predicted daughter fertility more accurately than any of the two standard linear models.


2016 ◽  
Vol 4 (1) ◽  
Author(s):  
Shuangzhe Liu ◽  
Tiefeng Ma ◽  
Yonghui Liu

AbstractIn this work, we consider the general linear model or its variants with the ordinary least squares, generalised least squares or restricted least squares estimators of the regression coefficients and variance. We propose a newly unified set of definitions for local sensitivity for both situations, one for the estimators of the regression coefficients, and the other for the estimators of the variance. Based on these definitions, we present the estimators’ sensitivity results.We include brief remarks on possible links of these definitions and sensitivity results to local influence and other existing results.


2021 ◽  
Vol 17 (5) ◽  
pp. 636-646
Author(s):  
Shelan Saied Ismaeel ◽  
Habshah Midi ◽  
Muhammed Sani

It is now evident that high leverage points (HLPs) can induce the multicollinearity pattern of a data in fixed effect panel data model. Those observations that are responsible for this phenomenon are called high leverage collinearity-enhancing observations (HLCEO). The commonly used within group ordinary least squares (WOLS) estimator for estimating the parameters of fixed effect panel data model is easily affected by HLCEOs. In their presence, the WOLS estimates may produce large variances and this would lead to erroneous interpretation. Therefore, it is imperative to detect the multicollinearity which is caused by HLCEOs. The classical Variance Inflation Factor (CVIF) is the commonly used diagnostic method for detecting multicollinearity in panel data. However, it is not correctly diagnosed multicollinearity in the presence of HLCEOs. Hence, in this paper three new robust diagnostic methods of diagnosing multicollinearity in panel data are proposed, namely the RVIF (WGM-FIMGT), RVIF (WGM-DRGP) and RVIF (WMM) and compared their performances with the CVIF. The numerical evidences show that the CVIF incorrectly diagnosed multicollinearity but our proposed methods correctly diagnosed no multicollinearity in the presence of HLCEOs where RVIF (WGM-FIMGT) being the best method as it has the least computational running time.


Sign in / Sign up

Export Citation Format

Share Document