Abstract
BackgroundMissing data in a large scale survey presents major challenges. We focus on performing multiple imputation by chained equations when data contain multiple incomplete multi-item scales. Recent authors have proposed imputing such data at the level of the individual item, but this can lead to infeasibly large imputation models.MethodsWe use data gathered from a large multinational survey, where analysis uses separate logistic regression models in each of nine country-specific data sets. In these data, applying multiple imputation by chained equations to the individual scale items is computationally infeasible. We propose an adaptation of multiple imputation by chained equations which imputes the individual scale items but reduces the number of variables in the imputation models by replacing most scale items with scale summary scores. We evaluate the feasibility of the proposed approach and compare it with a complete case analysis. We perform a simulation study to compare the proposed method with alternative approaches: we do this in a simplified setting to allow comparison with the full imputation model.ResultsFor the case study, the proposed approach reduces the size of the prediction models from 134 predictors to a maximum of 72 and makes multiple imputation by chained equations computationally feasible. Distributions of imputed data are seen to be consistent with observed data. Results from the regression analysis with multiple imputation are similar to, but more precise than, results for complete case analysis; for the same regression models a 39 % reduction in the standard error is observed. The simulation shows that our proposed method can perform comparably against the alternatives.ConclusionsBy substantially reducing imputation model sizes, our adaptation makes multiple imputation feasible for large scale survey data with multiple multi-item scales. For the data considered, analysis of the multiply imputed data shows greater power and efficiency than complete case analysis. The adaptation of multiple imputation makes better use of available data and can yield substantively different results from simpler techniques.Electronic supplementary materialThe online version of this article (doi:10.1186/s13104-016-1853-5) contains supplementary material, which is available to authorized users.
Highlights
Missing data in a large scale survey presents major challenges
Missing data is ubiquitous in research, and survey data is prone to incomplete responses
Assumptions must be made about the mechanism of missingness; no analysis with missing data is free of such assumptions
Summary
Missing data in a large scale survey presents major challenges. We focus on performing multiple imputation by chained equations when data contain multiple incomplete multi-item scales. Missing data is ubiquitous in research, and survey data is prone to incomplete responses. Data may be missing completely at random (MCAR), where the probability of missing data is not dependent on either the observed or unobserved data. When data is missing at random (MAR), the probability of the data being missing does not depend upon the unobserved data, but Plumpton et al BMC Res Notes (2016) 9:45 missingness may be related to the observed data. Data may be missing not at random (MNAR), whereby missingness is dependent upon the values of the unobserved data, conditional on the observed data [1,2,3]. It is acknowledged that a gap still exists between techniques recommended by methodological literature and those employed in practice; traditional ad-hoc techniques such as deletion and single imputation techniques are still applied routinely [3, 5, 6]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.