Clustered Binary Data Research Articles

Generalized estimating equations are commonly used to fit logistic regression models to clustered binary data from cluster randomized trials. A commonly used correlation structure assumes that the intracluster correlation coefficient does not vary by treatment arm or other covariates, but the consequences of this assumption are understudied. We aim to evaluate the effect of allowing variation of the intracluster correlation coefficient by treatment or other covariates on the efficiency of analysis and show how to account for such variation in sample size calculations. We develop formulae for the asymptotic variance of the estimated difference in outcome between treatment arms obtained when the true exchangeable correlation structure depends on the treatment arm and the working correlation structure used in the generalized estimating equations analysis is: (i) correctly specified, (ii) independent, or (iii) exchangeable with no dependence on treatment arm. These formulae require a known distribution of cluster sizes; we also develop simplifications for the case when cluster sizes do not vary and approximations that can be used when the first two moments of the cluster size distribution are known. We then extend the results to settings with adjustment for a second binary cluster-level covariate. We provide formulae to calculate the required sample size for cluster randomized trials using these variances. We show that the asymptotic variance of the estimated difference in outcome between treatment arms using these three working correlation structures is the same if all clusters have the same size, and this asymptotic variance is approximately the same when intracluster correlation coefficient values are small. We illustrate these results using data from a recent cluster randomized trial for infectious disease prevention in which the clusters are groups of households and modest in size (mean 9.6 individuals), with intracluster correlation coefficient values of 0.078 in the control arm and 0.057 in an intervention arm. In this application, we found a negligible difference between the variances calculated using structures (i) and (iii) and only a small increase (typically ) for the independent correlation structure (ii), and hence minimal effect on power or sample size requirements. The impact may be larger in other applications if there is greater variation in the ICC between treatment arms or with an additional covariate. The common approach of fitting generalized estimating equations with an exchangeable working correlation structure with a common intracluster correlation coefficient across arms likely does not substantially reduce the power or efficiency of the analysis in the setting of a large number of small or modest-sized clusters, even if the intracluster correlation coefficient varies by treatment arm. Our formulae, however, allow formal evaluation of this and may identify situations in which variation in intracluster correlation coefficient by treatment arm or another binary covariate may have a more substantial impact on power and hence sample size requirements.

Read full abstract

In clinical trials and observational studies of clustered binary data, understanding between-cluster variation is essential: in sample size and power calculations of cluster randomised trials, for example, the intra-cluster correlation coefficient is often specified. However, quantifications of between-cluster variation can be unintuitive, and an intra-cluster correlation coefficient as low as 0.04 may correspond to surprisingly large between-cluster differences. We suggest that understanding is improved through visualising the implied distribution of true cluster prevalences - possibly by assuming they follow a beta distribution - or by calculating their standard deviation, which is more readily interpretable than the intra-cluster correlation coefficient. Even so, the bounded nature of binary data complicates the interpretation of variances as primary measures of uncertainty, and entropy offers an attractive alternative. Appealing to maximum entropy theory, we propose the following rule of thumb: that plausible intra-cluster correlation coefficients and standard deviations of true cluster prevalences are both bounded above by the overall prevalence, its complement, and one third. We also provide corresponding bounds for the coefficient of variation, and for a different standard deviation and intra-cluster correlation defined on the log odds scale. Using previously published data, we observe the quantities defined on the log odds scale to be more transportable between studies with different outcomes with different prevalences than the intra-cluster correlation and coefficient of variation. The latter increase and decrease, respectively, as prevalence increases from 0% to 50%, and the same is true for our bounds. Our work will help clinical trialists better understand between-cluster variation and avoid specifying implausibly high values for the intra-cluster correlation in sample size and power calculations.

Read full abstract

Clustered Binary Data Research Articles

Related Topics

Articles published on Clustered Binary Data

Analyzing Matched 2 × 2 Tables from all Corners

Evaluating the Efficiency of Restricted Pseudo Likelihood Estimation in Balanced and Unbalanced Clustered Binary Data Models

Modeling clustered binary data with nonparametric unobserved heterogeneity: An application to stock crash analysis

Effects of an educational physical activity intervention in young women with newly diagnosed breast cancer: Findings from the Young and Strong Study.

An adjusted scale binomial Beta H-Likelihood estimation method for unbalanced clustered

Power and sample size calculations for cluster randomized trials with binary outcomes when intracluster correlation coefficients vary by treatment arm.

Analysis of GEE with a mixture working correlation matrix for diverging number of covariates

GEECORR: A SAS macro for regression models of correlated binary responses and within-cluster correlation using generalized estimating equations

Association of intracluster correlation measures with outcome prevalence for binary outcomes in cluster randomised trials.

Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm.

Conservative confidence intervals for the intraclass correlation coefficient for clustered binary data

Residual-based tree for clustered binary data

Understanding between-cluster variation in prevalence and limits for how much variation is plausible.

A robust adjustment to McNemar test when the data are clustered

Letter to the Editor: A novel confidence interval for a single proportion in the presence of clustered binary outcome data (SMMR, 2019)

Bootstrap ICC estimators in analysis of small clustered binary data

A novel confidence interval for a single proportion in the presence of clustered binary outcome data.

Estimating marginal proportions and intraclass correlations with clustered binary data.

Classified mixed logistic model prediction

Sensitivity and Reproducibility of Automated Feeding Artery Detection Software during Transarterial Chemoembolization of Hepatocellular Carcinoma

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Clustered Binary Data Research Articles

Related Topics

Articles published on Clustered Binary Data

Analyzing Matched 2 × 2 Tables from all Corners

Evaluating the Efficiency of Restricted Pseudo Likelihood Estimation in Balanced and Unbalanced Clustered Binary Data Models

Modeling clustered binary data with nonparametric unobserved heterogeneity: An application to stock crash analysis

Effects of an educational physical activity intervention in young women with newly diagnosed breast cancer: Findings from the Young and Strong Study.

An adjusted scale binomial Beta H-Likelihood estimation method for unbalanced clustered

Power and sample size calculations for cluster randomized trials with binary outcomes when intracluster correlation coefficients vary by treatment arm.

Analysis of GEE with a mixture working correlation matrix for diverging number of covariates

GEECORR: A SAS macro for regression models of correlated binary responses and within-cluster correlation using generalized estimating equations

Association of intracluster correlation measures with outcome prevalence for binary outcomes in cluster randomised trials.

Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm.

Conservative confidence intervals for the intraclass correlation coefficient for clustered binary data

Residual-based tree for clustered binary data

Understanding between-cluster variation in prevalence and limits for how much variation is plausible.

A robust adjustment to McNemar test when the data are clustered

Letter to the Editor: A novel confidence interval for a single proportion in the presence of clustered binary outcome data (SMMR, 2019)

Bootstrap ICC estimators in analysis of small clustered binary data

A novel confidence interval for a single proportion in the presence of clustered binary outcome data.

Estimating marginal proportions and intraclass correlations with clustered binary data.

Classified mixed logistic model prediction

Sensitivity and Reproducibility of Automated Feeding Artery Detection Software during Transarterial Chemoembolization of Hepatocellular Carcinoma