Evaluation of approaches for multiple imputation of three-level data

Rushani Wijesuriya,Katherine J Lee,Margarita Moreno-Betancur,John B Carlin

doi:10.1186/s12874-020-01079-8

Rushani Wijesuriya, Katherine J Lee + Show 2 more

Open Access

https://doi.org/10.1186/s12874-020-01079-8

Copy DOI

Abstract

BackgroundThree-level data arising from repeated measures on individuals who are clustered within larger units are common in health research studies. Missing data are prominent in such longitudinal studies and multiple imputation (MI) is a popular approach for handling missing data. Extensions of joint modelling and fully conditional specification MI approaches based on multilevel models have been developed for imputing three-level data. Alternatively, it is possible to extend single- and two-level MI methods to impute three-level data using dummy indicators and/or by analysing repeated measures in wide format. However, most implementations, evaluations and applications of these approaches focus on the context of incomplete two-level data. It is currently unclear which approach is preferable for imputing three-level data.MethodsIn this study, we investigated the performance of various MI methods for imputing three-level incomplete data when the target analysis model is a three-level random effects model with a random intercept for each level. The MI methods were evaluated via simulations and illustrated using empirical data, based on a case study from the Childhood to Adolescence Transition Study, a longitudinal cohort collecting repeated measures on students who were clustered within schools. In our simulations we considered a number of different scenarios covering a range of different missing data mechanisms, missing data proportions and strengths of level-2 and level-3 intra-cluster correlations.ResultsWe found that all of the approaches considered produced valid inferences about both the regression coefficient corresponding to the exposure of interest and the variance components under the various scenarios within the simulation study. In the case study, all approaches led to similar results.ConclusionResearchers may use extensions to the single- and two-level approaches, or the three-level approaches, to adequately handle incomplete three-level data. The two-level MI approaches with dummy indicator extension or the MI approaches based on three-level models will be required in certain circumstances such as when there are longitudinal data measured at irregular time intervals. However, the single- and two-level approaches with the DI extension should be used with caution as the DI approach has been shown to produce biased parameter estimates in certain scenarios.

Highlights

Three-level data arising from repeated measures on individuals who are clustered within larger units are common in health research studies
We begin with a brief description of the case study and the research question that motivated our study, which aimed to estimate the effect of early depressive symptoms on academic performance. This is followed by a description of the multiple imputation (MI) approaches we have identified for imputing incomplete three-level data resulting from longitudinal repeated measures across individuals clustered within larger clusters
Simulation study The comparative performance of the MI approaches were very similar for the missing completely at random (MCAR), missing at random (MAR)-Childhood to Adolescence Transition Study (CATS) and MAR-inflated scenarios so we focus on the results from the MAR-CATS scenario

Summary

Introduction

Three-level data arising from repeated measures on individuals who are clustered within larger units are common in health research studies. Clustered data have a naturally hierarchical structure where lower level units are nested within higher level units and there can be multiple levels in this data hierarchy [1]; in particular here we focus on three-level data resulting from the clustering of repeated measures on individuals within larger units such as schools [2]. One such example is provided by the Childhood to Adolescence Transition Study (CATS), a longitudinal study of a cohort of young people recruited just before puberty from schools in Victoria, Australia, and followed up at multiple waves with data collected on a range of mental health outcomes [3]. In the context of an analysis model that is a linear mixed model (LMM), ignoring the multilevel structure during the imputation stage may lead to biased estimates of the regression coefficients and their standard errors, especially when the missing data proportion is large, and may severely bias estimates of the variance components [9, 11,12,13]

Methods

Results

Discussion

Conclusion