Abstract

Multiple imputation is a technique for handling data sets with missing values. The method fills in the missing values several times, creating several completed data sets for analysis. Each data set is analyzed separately using techniques designed for complete data, and the results are then combined in such a way that the variability due to imputation may be incorporated. Methods of imputing the missing values can vary from fully parametric to nonparametric. In this paper, we compare partially parametric and fully parametric regression-based multiple-imputation methods. The fully parametric method that we consider imputes missing regression outcomes by drawing them from their predictive distribution under the regression model, whereas the partially parametric methods are based on imputing outcomes or residuals for incomplete cases using values drawn from the complete cases. For the partially parametric methods, we suggest a new approach to choosing complete cases from which to draw values. In a Monte Carlo study in the regression setting, we investigate the robustness of the multiple-imputation schemes to misspecification of the underlying model for the data. Sources of model misspecification considered include incorrect modeling of the mean structure as well as incorrect specification of the error distribution with regard to heaviness of the tails and heteroscedasticity. The methods are compared with respect to the bias and efficiency of point estimates and the coverage rates of confidence intervals for the marginal mean and distribution function of the outcome. We find that when the mean structure is specified correctly, all of the methods perform well, even if the error distribution is misspecified. The fully parametric approach, however, produces slightly more efficient estimates of the marginal distribution function of the outcome than do the partially parametric approaches. When the mean structure is misspecified, all of the methods still perform well for estimating the marginal mean, although the fully parametric method shows slight increases in bias and variance. For estimating the marginal distribution function, however, the fully parametric method breaks down in several situations, whereas the partially parametric methods maintain their good performance. In an application to AIDS research in a setting that is similar to although slightly more complicated than that of the Monte Carlo study, we examine how estimates for the distribution of the time from infection with HIV to the onset of AIDS vary with the method used to impute the residual time to AIDS for subjects with right-censored data. The fully parametric and partially parametric techniques produce similar results, suggesting that the model selection used for fully parametric imputation was adequate. Our application provides an example of how multiple imputation can be used to combine information from two cohorts to estimate quantities that cannot be estimated directly from either one of the cohorts separately.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call