Inference on the marginal distribution of clustered data with informative cluster size

Abstract Background Clustered data arise in research when patients are clustered within larger units. Generalised Estimating Equations (GEE) and Generalised Linear Models (GLMM) can be used to provide marginal and cluster-specific inference and predictions, respectively. Methods Confounding by Cluster (CBC) and Informative cluster size (ICS) are two complications that may arise when modelling clustered data. CBC can arise when the distribution of a predictor variable (termed ‘exposure’), varies between clusters causing confounding of the exposure-outcome relationship. ICS means that the cluster size conditional on covariates is not independent of the outcome. In both situations, standard GEE and GLMM may provide biased or misleading inference, and modifications have been proposed. However, both CBC and ICS are routinely overlooked in the context of risk prediction, and their impact on the predictive ability of the models has been little explored. We study the effect of CBC and ICS on the predictive ability of risk models for binary outcomes when GEE and GLMM are used. We examine whether two simple approaches to handle CBC and ICS, which involve adjusting for the cluster mean of the exposure and the cluster size, respectively, can improve the accuracy of predictions. Results Both CBC and ICS can be viewed as violations of the assumptions in the standard GLMM; the random effects are correlated with exposure for CBC and cluster size for ICS. Based on these principles, we simulated data subject to CBC/ICS. The simulation studies suggested that the predictive ability of models derived from using standard GLMM and GEE ignoring CBC/ICS was affected. Marginal predictions were found to be mis-calibrated. Adjusting for the cluster-mean of the exposure or the cluster size improved calibration, discrimination and the overall predictive accuracy of marginal predictions, by explaining part of the between cluster variability. The presence of CBC/ICS did not affect the accuracy of conditional predictions. We illustrate these concepts using real data from a multicentre study with potential CBC. Conclusion Ignoring CBC and ICS when developing prediction models for clustered data can affect the accuracy of marginal predictions. Adjusting for the cluster mean of the exposure or the cluster size can improve the predictive accuracy of marginal predictions.

Download Full-text

Review of methods for handling confounding by cluster and informative cluster size in clustered data

Statistics in Medicine ◽

10.1002/sim.6277 ◽

2014 ◽

Vol 33 (30) ◽

pp. 5371-5387 ◽

Cited By ~ 29

Author(s):

Shaun Seaman ◽

Menelaos Pavlou ◽

Andrew Copas

Keyword(s):

Cluster Size ◽

Clustered Data ◽

Informative Cluster Size

Download Full-text

Variance estimation in tests of clustered categorical data with informative cluster size

Statistical Methods in Medical Research ◽

10.1177/0962280220928572 ◽

2020 ◽

Vol 29 (11) ◽

pp. 3396-3408 ◽

Cited By ~ 1

Author(s):

Mary Gregg ◽

Somnath Datta ◽

Doug Lorenz

Keyword(s):

Categorical Data ◽

Cluster Size ◽

Goodness Of Fit ◽

Spinal Cord Injuries ◽

Variance Estimation ◽

Clustered Data ◽

Outcome Variable ◽

Data Set ◽

Variance Estimators ◽

Informative Cluster Size

In the analysis of clustered data, inverse cluster size weighting has been shown to be resistant to the potentially biasing effects of informative cluster size, where the number of observations within a cluster is associated with the outcome variable of interest. The method of inverse cluster size reweighting has been implemented to establish clustered data analogues of common tests for independent data, but the method has yet to be extended to tests of categorical data. Many variance estimators have been implemented across established cluster-weighted tests, but potential effects of differing methods on test performance has not previously been explored. Here, we develop cluster-weighted estimators of marginal proportions that remain unbiased under informativeness, and derive analogues of three popular tests for clustered categorical data, the one-sample proportion, goodness of fit, and independence chi square tests. We construct these tests using several variance estimators and show substantial differences in the performance of cluster-weighted tests based on variance estimation technique, with variance estimators constructed under the null hypothesis maintaining size closest to nominal. We illustrate the proposed tests through an application to a data set of functional measures from patients with spinal cord injuries participating in a rehabilitation program.

Download Full-text

Within-cluster resampling for multilevel models under informative cluster size

Biometrika ◽

10.1093/biomet/asz035 ◽

2019 ◽

Vol 106 (4) ◽

pp. 965-972

Author(s):

D Lee ◽

J K Kim ◽

C J Skinner

Keyword(s):

Maximum Likelihood ◽

Cluster Size ◽

Multilevel Model ◽

Multilevel Models ◽

Fixed Number ◽

Regression Coefficients ◽

Likelihood Estimator ◽

Correct Model ◽

Resampling Method ◽

Informative Cluster Size

Summary A within-cluster resampling method is proposed for fitting a multilevel model in the presence of informative cluster size. Our method is based on the idea of removing the information in the cluster sizes by drawing bootstrap samples which contain a fixed number of observations from each cluster. We then estimate the parameters by maximizing an average, over the bootstrap samples, of a suitable composite loglikelihood. The consistency of the proposed estimator is shown and does not require that the correct model for cluster size is specified. We give an estimator of the covariance matrix of the proposed estimator, and a test for the noninformativeness of the cluster sizes. A simulation study shows, as in Neuhaus & McCulloch (2011), that the standard maximum likelihood estimator exhibits little bias for some regression coefficients. However, for those parameters which exhibit nonnegligible bias, the proposed method is successful in correcting for this bias.

Download Full-text

Marginal Analyses of Clustered Data When Cluster Size Is Informative

Biometrics ◽

10.1111/1541-0420.00005 ◽

2003 ◽

Vol 59 (1) ◽

pp. 36-42 ◽

Cited By ~ 121

Author(s):

John M. Williamson ◽

Somnath Datta ◽

Glen A. Satten

Keyword(s):

Cluster Size ◽

Clustered Data

Download Full-text

Regression analysis of clustered interval-censored failure time data with linear transformation models in the presence of informative cluster size

Journal of Nonparametric Statistics ◽

10.1080/10485252.2018.1469755 ◽

2018 ◽

Vol 30 (3) ◽

pp. 703-715

Author(s):

Hui Zhao ◽

Chenchen Ma ◽

Junlong Li ◽

Jianguo Sun

Keyword(s):

Regression Analysis ◽

Linear Transformation ◽

Cluster Size ◽

Failure Time ◽

Failure Time Data ◽

Time Data ◽

Transformation Models ◽

Linear Transformation Models ◽

Interval Censored ◽

Informative Cluster Size

Download Full-text

Inferring marginal association with paired and unpaired clustered data

Statistical Methods in Medical Research ◽

10.1177/0962280216669184 ◽

2016 ◽

Vol 27 (6) ◽

pp. 1806-1817 ◽

Cited By ~ 2

Author(s):

Douglas J Lorenz ◽

Steven Levy ◽

Somnath Datta

Keyword(s):

Marginal Distribution ◽

Clustered Data ◽

Dental Fluorosis ◽

Correlation Coefficients ◽

Continuous Variable ◽

Early Age ◽

Marginal Analysis ◽

Bivariate Correlation ◽

Marginal Association ◽

Cluster Level

In the marginal analysis of clustered data, where the marginal distribution of interest is that of a typical observation within a typical cluster, analysis by reweighting has been introduced as a useful tool for estimating parameters of these marginal distributions. Such reweighting methods have foundation in within-cluster resampling schemes that marginalize potential informativeness due to cluster size or within-cluster covariate distribution, to which reweighting methods are asymptotically equivalent. In this paper, we introduce a reweighting scheme for the marginal analysis of clustered data that generalizes prior reweighting methods, with a particular application to measuring bivariate correlation in unpaired clustered data, in which observations of two random variables are not naturally paired at the within-cluster level. We develop unpaired clustered data analogs of well-known product moment correlation coefficients (Pearson, Spearman, phi), as well as the polyserial coefficient for measuring correlation between one discrete and one continuous variable. We evaluate the performance of these coefficients via a simulation study and demonstrate their use by finding no statistically significant association between dental caries at an early age and dental fluorosis at age 13 using a large dental dataset.

Download Full-text