Consequences of ignoring clustering in linear regression

Abstract Background Clustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions. Methods We simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate. Results We found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered. Conclusions In this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥ 0.01).

Download Full-text

Consequences of Ignoring Clustering in Linear Regression

10.21203/rs.3.rs-98069/v1 ◽

2020 ◽

Author(s):

Georgia Ntani ◽

Hazel Inskip ◽

Clive Osmond ◽

David Coggon

Keyword(s):

Linear Regression ◽

Statistical Inference ◽

Confidence Intervals ◽

Type I Error ◽

Explanatory Variable ◽

Simulated Data ◽

Ordinary Least Squares ◽

Outcome Variable ◽

Type I ◽

Random Intercept

Abstract BackgroundClustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often used. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression analysis could lead to importantly erroneous conclusions. MethodsWe simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and cluster-unadjusted ordinary least squares (OLS) models and compared the derived estimates of effect, as quantified by regression coefficients, and their estimated precision. We also assessed the extent to which coverage by 95% confidence intervals and rates of Type I error were appropriate. ResultsWe found that effects estimated from OLS linear regression models that ignored clustering were on average unbiased. The precision of effect estimates from the OLS model was overestimated when both the outcome and explanatory variable were continuous. By contrast, in linear regression with a binary explanatory variable, in most circumstances, the precision of effects was somewhat underestimated by the OLS model. The magnitude of bias, both in point estimates and their precision, increased with greater clustering of the outcome variable, and was influenced also by the amount of clustering in the explanatory variable. The cluster-unadjusted model resulted in poor coverage rates by 95% confidence intervals and high rates of Type I error especially when the explanatory variable was continuous. ConclusionsIn this study we identified situations in which an OLS regression model is more likely to affect statistical inference, namely when the explanatory variable is continuous, and its intraclass correlation coefficient is higher than 0.01. Situations in which statistical inference is less likely to be affected have also been identified.

Download Full-text

Type I Error Rates, Coverage of Confidence Intervals, and Variance Estimation in Propensity-Score Matched Analyses

The International Journal of Biostatistics ◽

10.2202/1557-4679.1146 ◽

2009 ◽

Vol 5 (1) ◽

Cited By ~ 65

Author(s):

Peter C Austin

Keyword(s):

Propensity Score ◽

Confidence Intervals ◽

Variance Estimation ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Type I Error Rates

Download Full-text

On Sample Size Requirements for Johansen’s Test

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986021002169 ◽

1996 ◽

Vol 21 (2) ◽

pp. 169-178 ◽

Cited By ~ 5

Author(s):

William T. Coombs ◽

James Algina

Keyword(s):

Sample Size ◽

Type I Error ◽

Simulated Data ◽

Error Rates ◽

Type I ◽

Type I Error Rates ◽

Dependent Variables

Type I error rates for the Johansen test were estimated using simulated data for a variety of conditions. The design of the experiment was a 2 × 2× 2× 3× 9× 3 factorial. The factors were (a) type of distribution, (b) number of dependent variables, (c) number of groups, (d) ratio of the smallest sample size to the number of dependent variables, (e) sample size ratios, and (f) degree of heteroscedasticity. The results indicate that Type I error rates for the Johansen test depend heavily on the number of groups and the ratio of the smallest sample size to the number of dependent variables. Type I error rates depend to a lesser extent on the distribution types used in the study. Based on the results, sample size guidelines are presented.

Download Full-text

Using a Visual Structured Criterion for the Analysis of Alternating-Treatment Designs

Behavior Modification ◽

10.1177/0145445517739278 ◽

2017 ◽

Vol 43 (1) ◽

pp. 115-131 ◽

Cited By ~ 3

Author(s):

Marc J. Lanovaz ◽

Patrick Cardinal ◽

Mary Francis

Keyword(s):

Visual Analysis ◽

Type I Error ◽

Single Case ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Type I Error Rates ◽

Single Case Designs ◽

Treatment Designs

Although visual inspection remains common in the analysis of single-case designs, the lack of agreement between raters is an issue that may seriously compromise its validity. Thus, the purpose of our study was to develop and examine the properties of a simple structured criterion to supplement the visual analysis of alternating-treatment designs. To this end, we generated simulated data sets with varying number of points, number of conditions, effect sizes, and autocorrelations, and then measured Type I error rates and power produced by the visual structured criterion (VSC) and permutation analyses. We also validated the results for Type I error rates using nonsimulated data. Overall, our results indicate that using the VSC as a supplement for the analysis of systematically alternating-treatment designs with at least five points per condition generally provides adequate control over Type I error rates and sufficient power to detect most behavior changes.

Download Full-text

Type I error rates and power of several versions of scaled chi-square difference tests in investigations of measurement invariance.

Psychological Methods ◽

10.1037/met0000097 ◽

2017 ◽

Vol 22 (3) ◽

pp. 467-485 ◽

Cited By ~ 4

Author(s):

Jordan Campbell Brace ◽

Victoria Savalei

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Chi Square ◽

Type I Error Rates

Download Full-text

Type I Error Rates for Parscale’s Fit Index

Educational and Psychological Measurement ◽

10.1177/0013164404264849 ◽

2005 ◽

Vol 65 (1) ◽

pp. 42-50 ◽

Cited By ~ 17

Author(s):

Christine E. Demars

Keyword(s):

Type I Error ◽

Error Rates ◽

Type I ◽

Type I Error Rates ◽

Fit Index

Download Full-text

Control of Type I Error Rates in Bayesian Sequential Designs

Bayesian Analysis ◽

10.1214/18-ba1109 ◽

2019 ◽

Vol 14 (2) ◽

pp. 399-425 ◽

Cited By ~ 4

Author(s):

Haolun Shi ◽

Guosheng Yin

Keyword(s):

Type I Error ◽

Error Rates ◽

Type I ◽

Sequential Designs ◽

Type I Error Rates

Download Full-text

Sisvar: a Guide for its Bootstrap procedures in multiple comparisons

Ciência e Agrotecnologia ◽

10.1590/s1413-70542014000200001 ◽

2014 ◽

Vol 38 (2) ◽

pp. 109-112 ◽

Cited By ~ 299

Author(s):

Daniel Furtado Ferreira

Keyword(s):

Scientific Community ◽

Type I Error ◽

Multiple Comparisons ◽

Error Rates ◽

Type I ◽

Type I Error Rates ◽

Statistical Analysis System ◽

Scientific Results ◽

Analysis System ◽

Multiple Comparison Procedures

Sisvar is a statistical analysis system with a large usage by the scientific community to produce statistical analyses and to produce scientific results and conclusions. The large use of the statistical procedures of Sisvar by the scientific community is due to it being accurate, precise, simple and robust. With many options of analysis, Sisvar has a not so largely used analysis that is the multiple comparison procedures using bootstrap approaches. This paper aims to review this subject and to show some advantages of using Sisvar to perform such analysis to compare treatments means. Tests like Dunnett, Tukey, Student-Newman-Keuls and Scott-Knott are performed alternatively by bootstrap methods and show greater power and better controls of experimentwise type I error rates under non-normal, asymmetric, platykurtic or leptokurtic distributions.

Download Full-text