Abstract

BackgroundClustering of observations is a common phenomenon in epidemiological and clinical research. Previous studies have highlighted the importance of using multilevel analysis to account for such clustering, but in practice, methods ignoring clustering are often employed. We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions.MethodsWe simulated data following the random-intercept model specification under different scenarios of clustering of a continuous outcome and a single continuous or binary explanatory variable. We fitted random-intercept (RI) and ordinary least squares (OLS) models and compared effect estimates with the “true” value that had been used in simulation. We also assessed the relative precision of effect estimates, and explored the extent to which coverage by 95% confidence intervals and Type I error rates were appropriate.ResultsWe found that effect estimates from both types of regression model were on average unbiased. However, deviations from the “true” value were greater when the outcome variable was more clustered. For a continuous explanatory variable, they tended also to be greater for the OLS than the RI model, and when the explanatory variable was less clustered. The precision of effect estimates from the OLS model was overestimated when the explanatory variable varied more between than within clusters, and was somewhat underestimated when the explanatory variable was less clustered. The cluster-unadjusted model gave poor coverage rates by 95% confidence intervals and high Type I error rates when the explanatory variable was continuous. With a binary explanatory variable, coverage rates by 95% confidence intervals and Type I error rates deviated from nominal values when the outcome variable was more clustered, but the direction of the deviation varied according to the overall prevalence of the explanatory variable, and the extent to which it was clustered.ConclusionsIn this study we identified circumstances in which application of an OLS regression model to clustered data is more likely to mislead statistical inference. The potential for error is greatest when the explanatory variable is continuous, and the outcome variable more clustered (intraclass correlation coefficient is ≥ 0.01).

Highlights

  • Clinical and epidemiological research often uses some form of regression analysis to explore the relationship of an outcome variable to one or more explanatoryNtani et al BMC Med Res Methodol (2021) 21:139“clustered” can be quantified by the intra-class correlation coefficient (ICC), which is defined as the ratio of its variance between clusters to its total variance [1].Clustering has implications for statistical inference from regression analysis if the outcome variable is clustered after the effects of all measured explanatory variables are taken into account

  • With a continuous explanatory variable, divergence from the “true” value tended to be greater for the ordinary least squares (OLS) than for the RI model, especially for higher ICC and for lower dispersion of the mean value of xij across the clusters within a sample (Supplementary Table 1)

  • With a binary explanatory variable, divergence from the nominal value was again greatest for high ICCs, but there was no strong relationship to dispersion of the mean prevalence of xij across clusters, and average divergence differed less between the two models

Read more

Summary

Introduction

Clinical and epidemiological research often uses some form of regression analysis to explore the relationship of an outcome variable to one or more explanatoryNtani et al BMC Med Res Methodol (2021) 21:139“clustered” can be quantified by the intra-class correlation coefficient (ICC), which is defined as the ratio of its variance between clusters to its total variance (both between and within clusters) [1].Clustering has implications for statistical inference from regression analysis if the outcome variable is clustered after the effects of all measured explanatory variables are taken into account. If allowance is not made for such clustering as part of the analysis, parameter estimates and/or their precision may be biased Even if the distribution of noise exposures in each city were similar, so that the regression coefficient was unbiased, its precision (the inverse of its variance) would be underestimated, since variance would be inflated by failure to allow for the differences between clusters (at the intercept) (Fig. 1C). We used simulated data to explore the circumstances in which failure to account for clustering in linear regression could lead to importantly erroneous conclusions

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call