Abstract

The problem of missing data is a common feature in any study, and a single imputation method is often applied to deal with this problem. The first contribution of this paper is to analyse the empirical performance of some traditional single imputation methods when they are applied to the estimation of the Gini index, a popular measure of inequality used in many studies. Various methods for constructing confidence intervals for the Gini index are also empirically evaluated. We consider several empirical measures to analyse the performance of estimators and confidence intervals, allowing us to quantify the magnitude of the non-response bias problem. We find extremely large biases under certain non-response mechanisms, and this problem gets noticeably worse as the proportion of missing data increases. For a large correlation coefficient between the target and auxiliary variables, the regression imputation method may notably mitigate this bias problem, yielding appropriate mean square errors. We also find that confidence intervals have poor coverage rates when the probability of data being missing is not uniform, and that the regression imputation method substantially improves the handling of this problem as the correlation coefficient increases.

Highlights

  • We consider several empirical measures to analyse the performance of estimators and confidence intervals, allowing us to quantify the magnitude of the non-response bias problem

  • For a low Gini index (Figure 1), we observe that Complete Case Analysis (CCA) and Regression imputation method (Reg) yield satisfactory values for Relative Bias (RB) under an MCAR mechanism and for the various values of p

  • Reg and Nearest Neighbour Imputation (NN I) perform better than Random Hot Deck (RHD) and CCA in terms of RB

Read more

Summary

Introduction

We consider several empirical measures to analyse the performance of estimators and confidence intervals, allowing us to quantify the magnitude of the non-response bias problem. We find that confidence intervals have poor coverage rates when the probability of data being missing is not uniform, and that the regression imputation method substantially improves the handling of this problem as the correlation coefficient increases. Note that it is quite common for individuals to choose not to answer sensitive questions, such as those related to income, wealth, drugs use, etc This distinction between unit and item non-response is important when it comes to handling the problem of missing data. According to Rubin’s theory, non-response is viewed as a random process where each unit has a certain probability of being missing. This process is termed a nonresponse mechanism, and is unknown in real applications.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call