Abstract

The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” <i>R J.</i> , vol. 10, no. 1, pp. 140–154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big- <inline-formula><tex-math notation="LaTeX">$n$</tex-math></inline-formula> ) or high-dimensionality (big- <inline-formula><tex-math notation="LaTeX">$p$</tex-math></inline-formula> ) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

Highlights

  • I NCOMPLETE data problem has been pandemic in most scientific and engineering domains

  • The report by the American Psychological Association strongly discourages the use of removal of missing values, which seriously biases sample statistics [5]

  • To transform the FHDI into the big data-oriented imputation software, we developed a first version of the parallel fractional hot-deck imputation program which inherits all the advantages of the serial FHDI and overcomes its computational limitations by leveraging algorithm-oriented parallel computing techniques

Read more

Summary

INTRODUCTION

I NCOMPLETE data problem has been pandemic in most scientific and engineering domains. A relatively better simple strategy is to replace the missing values by the conditional expected values obtained from a probabilistic model of incomplete data, which is subsequently fed into particular learning models [6] Theoretical approaches such as a model based method [7] or the use of an imputation theory have received great attention. The FHDI slightly outperforms the PMM and the FCM imputations This relative performance of the FHDI depends on the adopted incomplete data, this result along with similar prior investigation [2] underpins the positive impact of the P-FHDI on big-data oriented machine learning and statistical learning.

KEY ALGORITHMS OF THE SERIAL FHDI
Cell construction
Estimation of cell probability
Imputation
Variance estimation
PARALLEL ALGORITHMS FOR THE FHDI
Parallel cell construction and estimation of cell probability
6: Perform cell collapsing on z described in Table 1
Parallel variance estimation
Parallel imputation
COST ANALYSIS AND SCALABILITY
VARIABLE REDUCTION FOR BIG-p DATASETS
FUTURE RESEARCH
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call