Abstract
The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” <i>R J.</i> , vol. 10, no. 1, pp. 140–154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big- <inline-formula><tex-math notation="LaTeX">$n$</tex-math></inline-formula> ) or high-dimensionality (big- <inline-formula><tex-math notation="LaTeX">$p$</tex-math></inline-formula> ) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.
Highlights
I NCOMPLETE data problem has been pandemic in most scientific and engineering domains
The report by the American Psychological Association strongly discourages the use of removal of missing values, which seriously biases sample statistics [5]
To transform the FHDI into the big data-oriented imputation software, we developed a first version of the parallel fractional hot-deck imputation program which inherits all the advantages of the serial FHDI and overcomes its computational limitations by leveraging algorithm-oriented parallel computing techniques
Summary
I NCOMPLETE data problem has been pandemic in most scientific and engineering domains. A relatively better simple strategy is to replace the missing values by the conditional expected values obtained from a probabilistic model of incomplete data, which is subsequently fed into particular learning models [6] Theoretical approaches such as a model based method [7] or the use of an imputation theory have received great attention. The FHDI slightly outperforms the PMM and the FCM imputations This relative performance of the FHDI depends on the adopted incomplete data, this result along with similar prior investigation [2] underpins the positive impact of the P-FHDI on big-data oriented machine learning and statistical learning.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Knowledge and Data Engineering
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.