Dependent binary responses, such as health outcomes in twin pairs or siblings, frequently arise in perinatal epidemiologic research. This gives rise to correlated data, which must be taken into account during analysis to avoid erroneous statistical and biological inferences. An analysis of perinatal mortality (fetal deaths plus deaths within the first 28 days) in twins in relation to cluster-varying (those that are unique to each fetus within a twin pregnancy such as birthweight) and cluster-constant (those that are identical for both twins within a sibship such as maternal smoking status) risk factors is presented. Marginal (ordinary logistic regression [OLR] and logistic regression using generalized estimating equations [GEE]) and cluster-specific (conditional and random-intercept logistic regression models) regression models are fit and their results contrasted. The United States "matched multiple data" file of twin births (1995-1997), which includes 285,226 twins from 142,613 pregnancies, was used to examine the implications of ignoring of clustering on regression inferences. The OLR models provide variance estimates for cluster constant covariates that ranged from 7% to 71% smaller than those from GEE-based models. This underestimation is even more pronounced for some cluster-varying covariates, ranging from 21% to 198%. Ignoring the cluster dependency is likely to affect the precision of covariate effects and consequently interpretation of results. With widespread availability of appropriate software, statistical methods for taking the intracluster dependency into account are easily implemented and necessary.
Read full abstract