Variance estimation in tests of clustered categorical data with informative cluster size

In the analysis of clustered data, inverse cluster size weighting has been shown to be resistant to the potentially biasing effects of informative cluster size, where the number of observations within a cluster is associated with the outcome variable of interest. The method of inverse cluster size reweighting has been implemented to establish clustered data analogues of common tests for independent data, but the method has yet to be extended to tests of categorical data. Many variance estimators have been implemented across established cluster-weighted tests, but potential effects of differing methods on test performance has not previously been explored. Here, we develop cluster-weighted estimators of marginal proportions that remain unbiased under informativeness, and derive analogues of three popular tests for clustered categorical data, the one-sample proportion, goodness of fit, and independence chi square tests. We construct these tests using several variance estimators and show substantial differences in the performance of cluster-weighted tests based on variance estimation technique, with variance estimators constructed under the null hypothesis maintaining size closest to nominal. We illustrate the proposed tests through an application to a data set of functional measures from patients with spinal cord injuries participating in a rehabilitation program.

Download Full-text

Risk prediction in multicentre studies when there is confounding by cluster or informative cluster size

BMC Medical Research Methodology ◽

10.1186/s12874-021-01321-x ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Menelaos Pavlou ◽

Gareth Ambler ◽

Rumana Z. Omar

Keyword(s):

Risk Prediction ◽

Cluster Size ◽

Linear Models ◽

Prediction Models ◽

Predictive Accuracy ◽

Clustered Data ◽

Predictor Variable ◽

Simulated Data ◽

Predictive Ability ◽

Informative Cluster Size

Abstract Background Clustered data arise in research when patients are clustered within larger units. Generalised Estimating Equations (GEE) and Generalised Linear Models (GLMM) can be used to provide marginal and cluster-specific inference and predictions, respectively. Methods Confounding by Cluster (CBC) and Informative cluster size (ICS) are two complications that may arise when modelling clustered data. CBC can arise when the distribution of a predictor variable (termed ‘exposure’), varies between clusters causing confounding of the exposure-outcome relationship. ICS means that the cluster size conditional on covariates is not independent of the outcome. In both situations, standard GEE and GLMM may provide biased or misleading inference, and modifications have been proposed. However, both CBC and ICS are routinely overlooked in the context of risk prediction, and their impact on the predictive ability of the models has been little explored. We study the effect of CBC and ICS on the predictive ability of risk models for binary outcomes when GEE and GLMM are used. We examine whether two simple approaches to handle CBC and ICS, which involve adjusting for the cluster mean of the exposure and the cluster size, respectively, can improve the accuracy of predictions. Results Both CBC and ICS can be viewed as violations of the assumptions in the standard GLMM; the random effects are correlated with exposure for CBC and cluster size for ICS. Based on these principles, we simulated data subject to CBC/ICS. The simulation studies suggested that the predictive ability of models derived from using standard GLMM and GEE ignoring CBC/ICS was affected. Marginal predictions were found to be mis-calibrated. Adjusting for the cluster-mean of the exposure or the cluster size improved calibration, discrimination and the overall predictive accuracy of marginal predictions, by explaining part of the between cluster variability. The presence of CBC/ICS did not affect the accuracy of conditional predictions. We illustrate these concepts using real data from a multicentre study with potential CBC. Conclusion Ignoring CBC and ICS when developing prediction models for clustered data can affect the accuracy of marginal predictions. Adjusting for the cluster mean of the exposure or the cluster size can improve the predictive accuracy of marginal predictions.

Download Full-text

Causal inference for recurrent event data using pseudo-observations

Biostatistics ◽

10.1093/biostatistics/kxaa020 ◽

2020 ◽

Author(s):

Chien-Lin Su ◽

Robert W Platt ◽

Jean-François Plante

Keyword(s):

Goodness Of Fit ◽

Variance Estimation ◽

Recurrent Event ◽

Event Data ◽

Robust Estimators ◽

Recurrent Event Data ◽

Finite Sample ◽

Data Set ◽

Asymptotically Normal ◽

Doubly Robust

Summary Recurrent event data are commonly encountered in observational studies where each subject may experience a particular event repeatedly over time. In this article, we aim to compare cumulative rate functions (CRFs) of two groups when treatment assignment may depend on the unbalanced distribution of confounders. Several estimators based on pseudo-observations are proposed to adjust for the confounding effects, namely inverse probability of treatment weighting estimator, regression model-based estimators, and doubly robust estimators. The proposed marginal regression estimator and doubly robust estimators based on pseudo-observations are shown to be consistent and asymptotically normal. A bootstrap approach is proposed for the variance estimation of the proposed estimators. Model diagnostic plots of residuals are presented to assess the goodness-of-fit for the proposed regression models. A family of adjusted two-sample pseudo-score tests is proposed to compare two CRFs. Simulation studies are conducted to assess finite sample performance of the proposed method. The proposed technique is demonstrated through an application to a hospital readmission data set.

Download Full-text

Review of methods for handling confounding by cluster and informative cluster size in clustered data

Statistics in Medicine ◽

10.1002/sim.6277 ◽

2014 ◽

Vol 33 (30) ◽

pp. 5371-5387 ◽

Cited By ~ 29

Author(s):

Shaun Seaman ◽

Menelaos Pavlou ◽

Andrew Copas

Keyword(s):

Cluster Size ◽

Clustered Data ◽

Informative Cluster Size

Download Full-text

Optimal two-stage sampling for mean estimation in multilevel populations when cluster size is informative

Statistical Methods in Medical Research ◽

10.1177/0962280220952833 ◽

2020 ◽

pp. 096228022095283

Author(s):

Francesco Innocenti ◽

Math JJM Candel ◽

Frans ES Tan ◽

Gerard JP van Breukelen

Keyword(s):

Cluster Size ◽

Simple Random Sampling ◽

Equal Probability ◽

Outcome Variable ◽

Two Stage ◽

Mean Estimation ◽

Informative Cluster Size ◽

The Mean ◽

Two Stages ◽

Number Of Individuals

To estimate the mean of a quantitative variable in a hierarchical population, it is logistically convenient to sample in two stages (two-stage sampling), i.e. selecting first clusters, and then individuals from the sampled clusters. Allowing cluster size to vary in the population and to be related to the mean of the outcome variable of interest (informative cluster size), the following competing sampling designs are considered: sampling clusters with probability proportional to cluster size, and then the same number of individuals per cluster; drawing clusters with equal probability, and then the same percentage of individuals per cluster; and selecting clusters with equal probability, and then the same number of individuals per cluster. For each design, optimal sample sizes are derived under a budget constraint. The three optimal two-stage sampling designs are compared, in terms of efficiency, with each other and with simple random sampling of individuals. Sampling clusters with probability proportional to size is recommended. To overcome the dependency of the optimal design on unknown nuisance parameters, maximin designs are derived. The results are illustrated, assuming probability proportional to size sampling of clusters, with the planning of a hypothetical survey to compare adolescent alcohol consumption between France and Italy.

Download Full-text

Inference on the marginal distribution of clustered data with informative cluster size

Statistical Papers ◽

10.1007/s00362-013-0504-3 ◽

2013 ◽

Vol 55 (1) ◽

pp. 71-92 ◽

Cited By ~ 12

Author(s):

Jaakko Nevalainen ◽

Somnath Datta ◽

Hannu Oja

Keyword(s):

Cluster Size ◽

Marginal Distribution ◽

Clustered Data ◽

Informative Cluster Size

Download Full-text

Simple Index to Assess the Calibration Quality of Safety Performance Functions Based on Multiple Goodness-of-Fit Metrics

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211008896 ◽

2021 ◽

pp. 036119812110088

Author(s):

Raul E. Avelar ◽

Karen Dixon ◽

Boniphace Kutela ◽

Sam Klump ◽

Beth Wemple ◽

...

Keyword(s):

Goodness Of Fit ◽

Synthetic Data ◽

Calibration Procedure ◽

Safety Performance ◽

Absolute Deviation ◽

Data Set ◽

Safety Database ◽

Simple Index ◽

Safety Performance Functions

The calibration of safety performance functions (SPFs) is a mechanism included in the Highway Safety Manual (HSM) to adjust SPFs in the HSM for use in intended jurisdictions. Critically, the quality of the calibration procedure must be assessed before using the calibrated SPFs. Multiple resources to aid practitioners in calibrating SPFs have been developed in the years following the publication of the HSM 1st edition. Similarly, the literature suggests multiple ways to assess the goodness-of-fit (GOF) of a calibrated SPF to a data set from a given jurisdiction. This paper uses the calibration results of multiple intersection SPFs to a large Mississippi safety database to examine the relations between multiple GOF metrics. The goal is to develop a sensible single index that leverages the joint information from multiple GOF metrics to assess overall quality of calibration. A factor analysis applied to the calibration results revealed three underlying factors explaining 76% of the variability in the data. From these results, the authors developed an index and performed a sensitivity analysis. The key metrics were found to be, in descending order: the deviation of the cumulative residual (CURE) plot from the 95% confidence area, the mean absolute deviation, the modified R-squared, and the value of the calibration factor. This paper also presents comparisons between the index and alternative scoring strategies, as well as an effort to verify the results using synthetic data. The developed index is recommended to comprehensively assess the quality of the calibrated intersection SPFs.

Download Full-text

Dark energy survey internal consistency tests of the joint cosmological probes analysis with posterior predictive distributions

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stab526 ◽

2021 ◽

Vol 503 (2) ◽

pp. 2688-2705

Author(s):

C Doux ◽

E Baxter ◽

P Lemos ◽

C Chang ◽

A Alarcon ◽

...

Keyword(s):

Dark Energy ◽

Internal Consistency ◽

Goodness Of Fit ◽

Small Scale ◽

P Value ◽

Posterior Predictive Distribution ◽

Data Set ◽

Dark Energy Survey ◽

Data Vector ◽

Energy Survey

ABSTRACT Beyond ΛCDM, physics or systematic errors may cause subsets of a cosmological data set to appear inconsistent when analysed assuming ΛCDM. We present an application of internal consistency tests to measurements from the Dark Energy Survey Year 1 (DES Y1) joint probes analysis. Our analysis relies on computing the posterior predictive distribution (PPD) for these data under the assumption of ΛCDM. We find that the DES Y1 data have an acceptable goodness of fit to ΛCDM, with a probability of finding a worse fit by random chance of p = 0.046. Using numerical PPD tests, supplemented by graphical checks, we show that most of the data vector appears completely consistent with expectations, although we observe a small tension between large- and small-scale measurements. A small part (roughly 1.5 per cent) of the data vector shows an unusually large departure from expectations; excluding this part of the data has negligible impact on cosmological constraints, but does significantly improve the p-value to 0.10. The methodology developed here will be applied to test the consistency of DES Year 3 joint probes data sets.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Keep it simple - A case study of model development in the context of the Dynamic Stocks and Flows (DSF) task

Journal of Artificial General Intelligence ◽

10.2478/v10229-011-0008-2 ◽

2010 ◽

Vol 2 (2) ◽

pp. 38-51 ◽

Cited By ~ 1

Author(s):

Marc Halbrügge

Keyword(s):

Goodness Of Fit ◽

Model Development ◽

Cognitive Model ◽

Training Data ◽

Sequence Matching ◽

Data Set ◽

Depth Analysis ◽

Stocks And Flows ◽

Matching Techniques

Keep it simple - A case study of model development in the context of the Dynamic Stocks and Flows (DSF) taskThis paper describes the creation of a cognitive model submitted to the ‘Dynamic Stocks and Flows’ (DSF) modeling challenge. This challenge aims at comparing computational cognitive models for human behavior during an open ended control task. Participants in the modeling competition were provided with a simulation environment and training data for benchmarking their models while the actual specification of the competition task was withheld. To meet this challenge, the cognitive model described here was designed and optimized for generalizability. Only two simple assumptions about human problem solving were used to explain the empirical findings of the training data. In-depth analysis of the data set prior to the development of the model led to the dismissal of correlations or other parametric statistics as goodness-of-fit indicators. A new statistical measurement based on rank orders and sequence matching techniques is being proposed instead. This measurement, when being applied to the human sample, also identifies clusters of subjects that use different strategies for the task. The acceptability of the fits achieved by the model is verified using permutation tests.

Download Full-text

Models for Tree Taper Form: The Gompertz and Vasicek Diffusion Processes Framework

Symmetry ◽

10.3390/sym12010080 ◽

2020 ◽

Vol 12 (1) ◽

pp. 80 ◽

Cited By ~ 3

Author(s):

Martynas Narmontas ◽

Petras Rupšys ◽

Edmundas Petrauskas

Keyword(s):

Goodness Of Fit ◽

Diffusion Processes ◽

Parameters Estimation ◽

Data Set ◽

Mixed Effect ◽

Fit Statistics ◽

Stem Taper ◽

Longitudinal Measurements ◽

High Level ◽

Taper Models

In this work, we employ stochastic differential equations (SDEs) to model tree stem taper. SDE stem taper models have some theoretical advantages over the commonly employed regression-based stem taper modeling techniques, as SDE models have both simple analytic forms and a high level of accuracy. We perform fixed- and mixed-effect parameters estimation for the stem taper models by developing an approximated maximum likelihood procedure and using a data set of longitudinal measurements from 319 mountain pine trees. The symmetric Vasicek- and asymmetric Gompertz-type diffusion processes used adequately describe stem taper evolution. The proposed SDE stem taper models are compared to four regression stem taper equations and four volume equations. Overall, the best goodness-of-fit statistics are produced by the mixed-effect parameters SDEs stem taper models. All results are obtained in the Maple computer algebra system.

Download Full-text