The application of nonparametric data augmentation and imputation using classification and regression trees within a large-scale panel study

Solange Goßmann

doi:10.20378/irbo-50998

Abstract

Generally, multiple imputation is the recommended method for handling item nonresponse in surveys. Usually it is applied as chained equations approach based on parametric models. As Burgette & Reiter (2010) have shown classification and regression trees (CART) are a good alternative replacing the parametric models as conditional models especially when complex models occur, interactions and nonlinear models have to be handled and the amount of variables is very large. In large-scale panel studies many types of data sets with special data situations have to be handled. Based on the study of Burgette & Reiter (2010), this thesis intends to further assess the suitability of CART in combination with multiple imputation and data augmentation on some of these special situations. Unit nonresponse, panel attrition in particular, is a problem with high impact on survey quality in social sciences. The first application aims at imputing missing values by CART to generate a proper data base for the decision whether weighting has to be considered. This decision was based on auxiliary information about respondents and nonrespondents. Both, auxiliary information and the participation status as response indicator, contained missing values that had to be imputed. The described situation originated in a school survey. The schools were asked to transmit auxiliary information about their students without knowing if they participated in the survey or not. In the end both information, auxiliary information and the participation status, should have been combined by their identification number by the survey research institute. Some data were collected and transmitted correctly, some were not. Due to those errors four data situations were distinguished and handled in different ways. 1) Complete cases, that is no missing values neither for the participation status, nor the auxiliary information. That means that the information whether the student participated were available and the auxiliary information were completely observed and correctly merged. 2) The participation status was missing, but the auxiliary information were complete. That happened when the school transmitted the auxiliary data of a student completely, but the combination with the survey participation information failed. 3) The participation status was available, but there were missings in the auxiliary information and 4) there were missings in participation status as well as in the auxiliary information. The procedure to handle the complete data situation 1) was a standard probit analysis. A Probit Forecast Draw was applied in situations 2) and 4) which was based on a Metropolis-Hasting algorithm that used the available information of the maximum number of participants conditional on an auxiliary variable. In practice, the amount of male and female students that participated in the survey was known. This number was used as a maximum when the auxiliary information were combined with a probable participation status. All missings in auxiliary information, that was situations 3) and 4), were augmented by CART. That means that the imputation values were drawn via Bayesian Bootstrap from final nodes of the classification and regression trees. Both, the imputation and the probit model with the response indicator as the dependent variable resulted in a data augmentation approach. All steps were chained to use as much information as possible for the analysis. The application shows that CART can flexibly be combined with data augmentation resulting in a Markov chain Monte Carlo method or more precisely a Gibbs sampler. The results of the analysis of the (meta-)data showed a selectivity due to nonparticipation which could be explained by the variable sex. Female students tended to participate more likely than male students. The results based on the usage of CART differed clearly from those of the complete cases analysis ignoring the second level random effect as well as from those outcomes of the complete cases analysis including the second level random effect. Surveys based on flexible filtering offer the opportunity to adjust the questionnaire to the respondents' situation. Hence, data quality can be increased and response burden can be decreased. Therefore, filters are often implemented in large-scale surveys resulting in a complex data structure, that has to be considered when imputing. The second study of this thesis shows how a data set containing many filters and a high filter-depth that limits the admissible range of values for multiple imputation can be handled by using CART. To get more into detail, a very large and complex data set contained variables that were used for the analysis of household net income. The variables were distributed over modules. Modules are blocks of questions referring to certain topics which are partially steered by filters. Additionally, within those modules the survey was steered by filter questions. As a consequence the number of respondents on each variable differed. It can be assumed that due to the structure of the survey missing values were mainly produced by filters or caused by the respondent intentionally and only a minor part were missing e.g. by interviewers overseeing them. The second application shows that the described procedure is able to consider the complex data structure as the draws from CART are flexibly limited due to the changing filter structure which is generated by imputed filter steering values as well. Regarding the amount of 213 chosen variables for the household net income imputation, CART in contrast to other approaches obviously leads to time savings as no model specification is needed for each variable that has to be imputed. Still, there is a need to get some feedback concerning the suitability of CART-based imputation. Therefore, as third application of this thesis, a simulation study was conducted to show the performance of CART in a combination with multiple imputation by chained equations (MICE) on cross-sectional data. Additionally, it was checked whether a change of settings improves the performance for the given data. There were three different data generating functions of Y . The first was a typical linear model with a normally distributed error term. The second included a chi-squared error term. The third included a non-linear (logarithmic) term. The rate of missing values was set to 60% steered by a missing at random mechanism. Regression parameters, mean, quantiles and correlations were calculated and combined. The quality of the estimation for before deletion, complete cases and the imputed data was measured by coverage, i.e. the proportion of 95%-confidence intervals for the estimated parameters that contain the true value. Additionally, bias and mean squared error were calculated. Then, the settings were changed for the first type of data set, that was the ordinary linear model. First, the initialization was changed to a tree-based initialization instead of draws from the unconditional empirical distribution. Second, the iterations of the tree-based MI approach were increased from 20 to 50. Third, the number of imputed data sets that were combined for the confidence intervals was doubled from 15 to 30. CART-based MICE showed a good performance (88.8% to 91.8%) for all three data sets. Additionally, it was not worthwhile changing the settings of CART for the partitioning of the simulated data. Moreover, the third application shows some insights about the performance and the settings of CART-based MICE. There were many default settings and peculiarities that had to be considered when using CART-based MICE. The results suggest that the default settings and the performance of CART in general lead to sufficient results when conducted on cross-sectional data. Respective the settings, changing the initialization from tree-based draws to draws from the unconditional empirical distribution is recommendable for typical survey data, that is data with missing values in large parts of the data. The fourth application gives some insights into the performance of CART-based MICE on panel data. Therefore, the first simulated data set was extended to panel data containing information from two waves. Four data situations were distinguished, that was three random effects models with different combinations of time-variant and time-invariant variables and a fixed effects model. The last was defined by an intercept that is correlated to a regressor, the missingness steering variable X1. CART-based MICE showed a good performance (89.0% to 91.4%) for all four data sets. CART chose the variables from the correct wave for each of the four data situations and waves. That means that only first wave information was used for the imputation of the first wave variable Yt=1, respectively only second wave information was used for the second wave variable Yt=2. This is crucial as the data generation for each of both waves was conducted as either independent of the other wave or the variables were time-variant for all four data situations. This thesis demonstrates that CART can be used as a highly flexible imputation component which can be recommended with constraints for large-scale panel studies. Missing values in cross-sectional data as well as panel data can both be handled with CART-based MICE. Of course, the accuracy depends on the availability of explanatory power and correlations for both, cross-sectional and panel data. The combination of CART with data augmentation and the extension concerning the filtering of the data are both feasible and promising. In addition, further research about the performance of CART is highly recommended, for example by extending the current simulation study by changes of the variables over time based on past values of the same variable, more waves or different data generation processes.

Full Text