Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Yicheng Yang,Jae Kwang Kim,In Ho Cho

doi:10.1109/tkde.2020.3029146

Yicheng Yang, Jae Kwang Kim + Show 1 more

Open Access

https://doi.org/10.1109/tkde.2020.3029146

Copy DOI

Abstract

The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI J. Im, I. Cho, and J. K. Kim, “An R package for fractional hot deck imputation,” <i>R J.</i> , vol. 10, no. 1, pp. 140–154, 2018 holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big- <inline-formula><tex-math notation="LaTeX">$n$</tex-math></inline-formula> ) or high-dimensionality (big- <inline-formula><tex-math notation="LaTeX">$p$</tex-math></inline-formula> ) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

Highlights

I NCOMPLETE data problem has been pandemic in most scientific and engineering domains
The report by the American Psychological Association strongly discourages the use of removal of missing values, which seriously biases sample statistics [5]
To transform the FHDI into the big data-oriented imputation software, we developed a first version of the parallel fractional hot-deck imputation program which inherits all the advantages of the serial FHDI and overcomes its computational limitations by leveraging algorithm-oriented parallel computing techniques

Summary

INTRODUCTION

I NCOMPLETE data problem has been pandemic in most scientific and engineering domains. A relatively better simple strategy is to replace the missing values by the conditional expected values obtained from a probabilistic model of incomplete data, which is subsequently fed into particular learning models [6] Theoretical approaches such as a model based method [7] or the use of an imputation theory have received great attention. The FHDI slightly outperforms the PMM and the FCM imputations This relative performance of the FHDI depends on the adopted incomplete data, this result along with similar prior investigation [2] underpins the positive impact of the P-FHDI on big-data oriented machine learning and statistical learning.

KEY ALGORITHMS OF THE SERIAL FHDI

Cell construction

Estimation of cell probability

Imputation

Variance estimation

PARALLEL ALGORITHMS FOR THE FHDI

Parallel cell construction and estimation of cell probability

6: Perform cell collapsing on z described in Table 1

Parallel variance estimation

Parallel imputation

COST ANALYSIS AND SCALABILITY

VARIABLE REDUCTION FOR BIG-p DATASETS

FUTURE RESEARCH

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Oct 8, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Similar Papers

A scalable approach to simultaneous evolutionary instance and feature selection
Nicolás García-Pedrajas ... Javier Pérez-Rodríguez
Information Sciences | VOL. 228
Nicolás García-Pedrajas, et. al.Nicolás García-Pedrajas ... Javier Pérez-Rodríguez
25 Oct 2012
Information Sciences | VOL. 228

Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets Via Generative Models
Yasin Yilmaz ... Alfred Hero
IEEE Transactions on Signal Processing | VOL. 69
Yasin Yilmaz, et. al.Yasin Yilmaz ... Alfred Hero
01 Jan 2020
IEEE Transactions on Signal Processing | VOL. 69

MRPR: A MapReduce solution for prototype reduction in big data classification
Isaac Triguero ... Francisco Herrera
Neurocomputing | VOL. 150
Isaac Triguero, et. al.Isaac Triguero ... Francisco Herrera
15 Oct 2014
Neurocomputing | VOL. 150

A Parallel Implementation of Relief Algorithm Using Mapreduce Paradigm
Jamila Yazidi ... Waad Bouaguel
-
Jamila Yazidi, et. al.Jamila Yazidi ... Waad Bouaguel
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering