Understanding between-cluster variation in prevalence and limits for how much variation is plausible

In clinical trials and observational studies of clustered binary data, understanding between-cluster variation is essential: in sample size and power calculations of cluster randomised trials, for example, the intra-cluster correlation coefficient is often specified. However, quantifications of between-cluster variation can be unintuitive, and an intra-cluster correlation coefficient as low as 0.04 may correspond to surprisingly large between-cluster differences. We suggest that understanding is improved through visualising the implied distribution of true cluster prevalences – possibly by assuming they follow a beta distribution – or by calculating their standard deviation, which is more readily interpretable than the intra-cluster correlation coefficient. Even so, the bounded nature of binary data complicates the interpretation of variances as primary measures of uncertainty, and entropy offers an attractive alternative. Appealing to maximum entropy theory, we propose the following rule of thumb: that plausible intra-cluster correlation coefficients and standard deviations of true cluster prevalences are both bounded above by the overall prevalence, its complement, and one third. We also provide corresponding bounds for the coefficient of variation, and for a different standard deviation and intra-cluster correlation defined on the log odds scale. Using previously published data, we observe the quantities defined on the log odds scale to be more transportable between studies with different outcomes with different prevalences than the intra-cluster correlation and coefficient of variation. The latter increase and decrease, respectively, as prevalence increases from 0% to 50%, and the same is true for our bounds. Our work will help clinical trialists better understand between-cluster variation and avoid specifying implausibly high values for the intra-cluster correlation in sample size and power calculations.

Download Full-text

Sample Size and Power Calculations with Correlated Binary Data

Controlled Clinical Trials ◽

10.1016/s0197-2456(01)00131-3 ◽

2001 ◽

Vol 22 (3) ◽

pp. 211-227 ◽

Cited By ~ 56

Author(s):

Wei Pan

Keyword(s):

Sample Size ◽

Binary Data ◽

Correlated Binary Data ◽

Power Calculations

Download Full-text

Reproducibility of Retinal Nerve Fiber Layer Measurements with Manual and Automated Centration in Healthy Subjects Using Spectralis Spectral-Domain Optical Coherence Tomography

ISRN Ophthalmology ◽

10.5402/2012/860819 ◽

2012 ◽

Vol 2012 ◽

pp. 1-6 ◽

Cited By ~ 2

Author(s):

Alex P. Lange ◽

Reza Sadjadi ◽

Fiona Costello ◽

Ivo Guber ◽

Anthony L. Traboulsee

Keyword(s):

Standard Deviation ◽

Nerve Fiber ◽

Correlation Coefficient ◽

Retinal Nerve Fiber Layer ◽

Coefficient Of Variation ◽

Healthy Subjects ◽

Fiber Layer ◽

High Reproducibility ◽

Nerve Fiber Layer ◽

Sd Oct

Objective. The aim of this study was to test the reproducibility of the Heidelberg Spectralis SD-OCT and to determine if provided software retest function for follow-up exam is superior to manual centration. Design. Prospective, cross-sectional study. Participants. 20 healthy subjects. Methods. All subjects underwent SD-OCT testing to determine retinal nerve fiber layer (RNFL) measurements sequentially on two different days and with two different centration techniques. Within-subject standard deviation, coefficient of variation, and intraclass correlation coefficient were used to assess reproducibility. Results. RNFL measurements showed high reproducibility, low within-subject standard deviation (1.3), low coefficient of variation (0.63%), and low intra-class correlation coefficient (0.98 (95% CI 0.97–0.99)) in the automated centration and manual centration groups for average RNFL Thickness. Quadrants showed slightly higher variability in the manual group compared to the automated group (within-subject standard deviation 2.5–5.3 versus 1.1–2.4, resp.). Conclusions. SD-OCT provides high-resolution RNFL measurements with high reproducibility and low variability. The re-test function allows for easier recentration for longitudinal examinations with similar results in average RNFL, but less variability in quadrant RNFL. SD-OCT high reproducibility and low variability is a promising fact and should be further evaluated in longitudinal studies of RNFL.

Download Full-text

Soft Statistics with Respect to Utility and Application to Human Trafficking

New Mathematics and Natural Computation ◽

10.1142/s1793005717400117 ◽

2017 ◽

Vol 13 (03) ◽

pp. 289-310 ◽

Cited By ~ 3

Author(s):

Santanu Acharjee ◽

John N. Mordeson

Keyword(s):

Quantitative Analysis ◽

Standard Deviation ◽

Human Trafficking ◽

Correlation Coefficient ◽

Set Theory ◽

Coefficient Of Variation ◽

Utility Theory ◽

Choice Behavior ◽

Correlation Coefficients ◽

The World

It is well known that statistics deals with quantitative analysis. Thus, there is a lack of approach to do quantitative analysis in the presence of qualitative attributes. Soft set theory has the freedom to deal with attributes, [0, 1], etc. along with quantity. Thus, we introduce some fundamental ideas of soft statistics. Here, soft mean, soft standard deviation, soft coefficient of variation, soft correlation coefficient are introduced and some theorems are proved with respect to utility. Utility theory provides an analysis of choice behavior. As an application of our notions and results, we find soft correlation coefficients between vulnerability and government responses of various regions across the world. The data from “The Global Slavery Index 2016” are considered for application purposes.

Download Full-text

Marginal modeling in community randomized trials with rare events: Utilization of the negative binomial regression model

Clinical Trials ◽

10.1177/17407745211063479 ◽

2022 ◽

pp. 174077452110634

Author(s):

Philip M Westgate ◽

Debbie M Cheng ◽

Daniel J Feaster ◽

Soledad Fernández ◽

Abigail B Shoben ◽

...

Keyword(s):

Correlation Coefficient ◽

Coefficient Of Variation ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Cluster Randomized Trial ◽

Regression Parameter ◽

Binomial Regression ◽

Cluster Randomized ◽

Cluster Correlation ◽

Overdispersion Parameter

Background/aims This work is motivated by the HEALing Communities Study, which is a post-test only cluster randomized trial in which communities are randomized to two different trial arms. The primary interest is in reducing opioid overdose fatalities, which will be collected as a count outcome at the community level. Communities range in size from thousands to over one million residents, and fatalities are expected to be rare. Traditional marginal modeling approaches in the cluster randomized trial literature include the use of generalized estimating equations with an exchangeable correlation structure when utilizing subject-level data, or analogously quasi-likelihood based on an over-dispersed binomial variance when utilizing community-level data. These approaches account for and estimate the intra-cluster correlation coefficient, which should be provided in the results from a cluster randomized trial. Alternatively, the coefficient of variation or R coefficient could be reported. In this article, we show that negative binomial regression can also be utilized when communities are large and events are rare. The objectives of this article are (1) to show that the negative binomial regression approach targets the same marginal regression parameter(s) as an over-dispersed binomial model and to explain why the estimates may differ; (2) to derive formulas relating the negative binomial overdispersion parameter k with the intra-cluster correlation coefficient, coefficient of variation, and R coefficient; and (3) analyze pre-intervention data from the HEALing Communities Study to demonstrate and contrast models and to show how to report the intra-cluster correlation coefficient, coefficient of variation, and R coefficient when utilizing negative binomial regression. Methods Negative binomial and over-dispersed binomial regression modeling are contrasted in terms of model setup, regression parameter estimation, and formulation of the overdispersion parameter. Three specific models are used to illustrate concepts and address the third objective. Results The negative binomial regression approach targets the same marginal regression parameter(s) as an over-dispersed binomial model, although estimates may differ. Practical differences arise in regard to how overdispersion, and hence the intra-cluster correlation coefficient is modeled. The negative binomial overdispersion parameter is approximately equal to the ratio of the intra-cluster correlation coefficient and marginal probability, the square of the coefficient of variation, and the R coefficient minus 1. As a result, estimates corresponding to all four of these different types of overdispersion parameterizations can be reported when utilizing negative binomial regression. Conclusion Negative binomial regression provides a valid, practical, alternative approach to the analysis of count data, and corresponding reporting of overdispersion parameters, from community randomized trials in which communities are large and events are rare.

Download Full-text

Sample sizes for cluster-randomised trials with continuous outcomes: Accounting for uncertainty in a single intra-cluster correlation estimate

Statistical Methods in Medical Research ◽

10.1177/09622802211037073 ◽

2021 ◽

pp. 096228022110370

Author(s):

Jen Lewis ◽

Steven A Julious

Keyword(s):

Sample Size ◽

Correlation Coefficient ◽

Randomised Trial ◽

Sample Size Calculation ◽

Coefficient Estimate ◽

Randomised Trials ◽

Main Trial ◽

Cluster Randomised Trials ◽

Cluster Randomised ◽

Cluster Correlation

Sample size calculations for cluster-randomised trials require inclusion of an inflation factor taking into account the intra-cluster correlation coefficient. Often, estimates of the intra-cluster correlation coefficient are taken from pilot trials, which are known to have uncertainty about their estimation. Given that the value of the intra-cluster correlation coefficient has a considerable influence on the calculated sample size for a main trial, the uncertainty in the estimate can have a large impact on the ultimate sample size and consequently, the power of a main trial. As such, it is important to account for the uncertainty in the estimate of the intra-cluster correlation coefficient. While a commonly adopted approach is to utilise the upper confidence limit in the sample size calculation, this is a largely inefficient method which can result in overpowered main trials. In this paper, we present a method of estimating the sample size for a main cluster-randomised trial with a continuous outcome, using numerical methods to account for the uncertainty in the intra-cluster correlation coefficient estimate. Despite limitations with this initial study, the findings and recommendations in this paper can help to improve sample size estimations for cluster randomised controlled trials by accounting for uncertainty in the estimate of the intra-cluster correlation coefficient. We recommend this approach be applied to all trials where there is uncertainty in the intra-cluster correlation coefficient estimate, in conjunction with additional sources of information to guide the estimation of the intra-cluster correlation coefficient.

Download Full-text

Sample size estimation to substantiate freedom from disease for clustered binary data with a specific risk profile

Epidemiology and Infection ◽

10.1017/s0950268812001938 ◽

2012 ◽

Vol 141 (6) ◽

pp. 1318-1327

Author(s):

P. KOSTOULAS ◽

S. S. NIELSEN ◽

W. J. BROWNE ◽

L. LEONTIDES

Keyword(s):

Sample Size ◽

Binary Data ◽

Simulation Method ◽

Risk Profile ◽

Sample Size Estimation ◽

Size Estimation ◽

Critical Control Points ◽

Freedom From Disease ◽

Size Estimates ◽

Cluster Correlation

SUMMARYDisease cases are often clustered within herds or generally groups that share common characteristics. Sample size formulae must adjust for the within-cluster correlation of the primary sampling units. Traditionally, the intra-cluster correlation coefficient (ICC), which is an average measure of the data heterogeneity, has been used to modify formulae for individual sample size estimation. However, subgroups of animals sharing common characteristics, may exhibit excessively less or more heterogeneity. Hence, sample size estimates based on the ICC may not achieve the desired precision and power when applied to these groups. We propose the use of the variance partition coefficient (VPC), which measures the clustering of infection/disease for individuals with a common risk profile. Sample size estimates are obtained separately for those groups that exhibit markedly different heterogeneity, thus, optimizing resource allocation. A VPC-based predictive simulation method for sample size estimation to substantiate freedom from disease is presented. To illustrate the benefits of the proposed approach we give two examples with the analysis of data from a risk factor study on Mycobacterium avium subsp. paratuberculosis infection, in Danish dairy cattle and a study on critical control points for Salmonella cross-contamination of pork, in Greek slaughterhouses.

Download Full-text

Association of intracluster correlation measures with outcome prevalence for binary outcomes in cluster randomised trials

Statistical Methods in Medical Research ◽

10.1177/09622802211026004 ◽

2021 ◽

pp. 096228022110260

Author(s):

Ariane M Mbekwe Yepnang ◽

Agnès Caille ◽

Sandra M Eldridge ◽

Bruno Giraudeau

Keyword(s):

Sample Size ◽

Correlation Coefficient ◽

Binary Data ◽

Tetrachoric Correlation ◽

Binary Outcome ◽

Intracluster Correlation ◽

Randomised Trials ◽

Cluster Randomised Trials ◽

Cluster Randomised ◽

Correlation Measures

In cluster randomised trials, a measure of intracluster correlation such as the intraclass correlation coefficient (ICC) should be reported for each primary outcome. Providing intracluster correlation estimates may help in calculating sample size of future cluster randomised trials and also in interpreting the results of the trial from which they are derived. For a binary outcome, the ICC is known to be associated with its prevalence, which raises at least two issues. First, it questions the use of ICC estimates obtained on a binary outcome in a trial for sample size calculations in a subsequent trial in which the same binary outcome is expected to have a different prevalence. Second, it challenges the interpretation of ICC estimates because they do not solely depend on clustering level. Other intracluster correlation measures proposed for clustered binary data settings include the variance partition coefficient, the median odds ratio and the tetrachoric correlation coefficient. Under certain assumptions, the theoretical maximum possible value for an ICC associated with a binary outcome can be derived, and we proposed the relative deviation of an ICC estimate to this maximum value as another measure of the intracluster correlation. We conducted a simulation study to explore the dependence of these intracluster correlation measures on outcome prevalence and found that all are associated with prevalence. Even if all depend on prevalence, the tetrachoric correlation coefficient computed with Kirk’s approach was less dependent on the outcome prevalence than the other measures when the intracluster correlation was about 0.05. We also observed that for lower values, such as 0.01, the analysis of variance estimator of the ICC is preferred.

Download Full-text

Conservative confidence intervals for the intraclass correlation coefficient for clustered binary data

Journal of Applied Statistics ◽

10.1080/02664763.2021.1910939 ◽

2021 ◽

pp. 1-15

Author(s):

Guogen Shan

Keyword(s):

Correlation Coefficient ◽

Confidence Intervals ◽

Intraclass Correlation Coefficient ◽

Binary Data ◽

Intraclass Correlation ◽

Clustered Binary Data

Download Full-text

Intra-cluster correlations from the CLustered OUtcome Dataset bank to inform the design of longitudinal cluster trials

Clinical Trials ◽

10.1177/17407745211020852 ◽

2021 ◽

pp. 174077452110208

Author(s):

Elizabeth Korevaar ◽

Jessica Kasza ◽

Monica Taljaard ◽

Karla Hemming ◽

Terry Haines ◽

...

Keyword(s):

Sample Size ◽

Discrete Time ◽

Correlation Coefficients ◽

Time Decay ◽

Randomised Trials ◽

Cluster Randomised Trials ◽

Cluster Randomised ◽

Sample Size Calculations ◽

Correlation Structures ◽

Cluster Correlation

Background: Sample size calculations for longitudinal cluster randomised trials, such as crossover and stepped-wedge trials, require estimates of the assumed correlation structure. This includes both within-period intra-cluster correlations, which importantly differ from conventional intra-cluster correlations by their dependence on period, and also cluster autocorrelation coefficients to model correlation decay. There are limited resources to inform these estimates. In this article, we provide a repository of correlation estimates from a bank of real-world clustered datasets. These are provided under several assumed correlation structures, namely exchangeable, block-exchangeable and discrete-time decay correlation structures. Methods: Longitudinal studies with clustered outcomes were collected to form the CLustered OUtcome Dataset bank. Forty-four available continuous outcomes from 29 datasets were obtained and analysed using each correlation structure. Patterns of within-period intra-cluster correlation coefficient and cluster autocorrelation coefficients were explored by study characteristics. Results: The median within-period intra-cluster correlation coefficient for the discrete-time decay model was 0.05 (interquartile range: 0.02–0.09) with a median cluster autocorrelation of 0.73 (interquartile range: 0.19–0.91). The within-period intra-cluster correlation coefficients were similar for the exchangeable, block-exchangeable and discrete-time decay correlation structures. Within-period intra-cluster correlation coefficients and cluster autocorrelations were found to vary with the number of participants per cluster-period, the period-length, type of cluster (primary care, secondary care, community or school) and country income status (high-income country or low- and middle-income country). The within-period intra-cluster correlation coefficients tended to decrease with increasing period-length and slightly decrease with increasing cluster-period sizes, while the cluster autocorrelations tended to move closer to 1 with increasing cluster-period size. Using the CLustered OUtcome Dataset bank, an RShiny app has been developed for determining plausible values of correlation coefficients for use in sample size calculations. Discussion: This study provides a repository of intra-cluster correlations and cluster autocorrelations for longitudinal cluster trials. This can help inform sample size calculations for future longitudinal cluster randomised trials.

Download Full-text

Energy Expenditure in Playground Games in Primary School Children Measured by Accelerometer and Heart Rate Monitors

International Journal of Sport Nutrition and Exercise Metabolism ◽

10.1123/ijsnem.2016-0122 ◽

2017 ◽

Vol 27 (5) ◽

pp. 467-474 ◽

Cited By ~ 7

Author(s):

Jorge Cañete García-Prieto ◽

Vicente Martinez-Vizcaino ◽

Antonio García-Hermoso ◽

Mairena Sánchez-López ◽

Natalia Arias-Palencia ◽

...

Keyword(s):

Physical Activity ◽

Heart Rate ◽

Energy Expenditure ◽

Primary School ◽

Correlation Coefficient ◽

School Children ◽

Indirect Calorimetry ◽

Coefficient Of Variation ◽

Primary School Children ◽

Heart Rate Monitors

The aim of this study was to examine the energy expenditure (EE) measured using indirect calorimetry (IC) during playground games and to assess the validity of heart rate (HR) and accelerometry counts as indirect indicators of EE in children´s physical activity games. 32 primary school children (9.9 ± 0.6 years old, 19.8 ± 4.9 kg · m-2 BMI and 37.6 ± 7.2 ml · kg-1 · min-1 VO2max). Indirect calorimetry (IC), accelerometry and HR data were simultaneously collected for each child during a 90 min session of 30 playground games. Thirty-eight sessions were recorded in 32 different children. Each game was recorded at least in three occasions in other three children. The intersubject coefficient of variation within a game was 27% for IC, 37% for accelerometry and 13% for HR. The overall mean EE in the games was 4.2 ± 1.4 kcals · min-1 per game, totaling to 375 ± 122 kcals/per 90 min/session. The correlation coefficient between indirect calorimetry and accelerometer counts was 0.48 (p = .026) for endurance games and 0.21 (p = .574) for strength games. The correlation coefficient between indirect calorimetry and HR was 0.71 (p = .032) for endurance games and 0.48 (p = .026) for strength games. Our data indicate that both accelerometer and HR monitors are useful devices for estimating EE during endurance games, but only HR monitors estimates are accurate for endurance games.

Download Full-text