Abstract

Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.

Highlights

  • Research is the driving force behind any development of a Nation

  • It is worth noting that when part of a data is missing from a given survey and missing data is ignored by and using only the available sample, the result so yielded may not be representative of the population under study; after all there are some of its characteristics missing

  • According to [16], list wise deletion method is regarded as the most common and easiest method of dealing with missing data, it is called complete case analysis according to [11]. This approach there- fore leads to a reduction in sample size which in turn translates into reduced statistical power bringing into question the how representative the remaining sample is of the population being studied

Read more

Summary

Introduction

Research is the driving force behind any development of a Nation. Any endeavourer in this area requires that the people concerned with the research arm themselves with the right kind of tools that shall help them get accurate and relevant information from the survey being undertaken. Missing data is a big challenge in many areas of research, especially in social research. It is worth noting that when part of a data is missing from a given survey and missing data is ignored by and using only the available sample, the result so yielded may not be representative of the population under study; after all there are some of its characteristics missing. For a detailed review of these approaches. [25]

EM Algorithm
List Wise Deletion
Mean Substitution
Regression Imputation
Multiple Imputations
Hot Decking
Median Imputation
Regression for Complete Data Set
Methods
Parameter Estimation
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call