Cluster-Based Best Match Scanning for Large-Scale Missing Data Imputation

Weiqing Yu,Ting Zhao,He Liu,Guangyi Liu,Bowen Kan,Wendong Zhu

doi:10.1109/bigcom.2017.48

Abstract

High-quality data are the prerequisite for analyzing and using big data to guarantee the value of the data. Missing values in data is a common yet challenging problem in data analytics and data mining, especially in the era of big data. Amount of missing values directly affects the data quality. Therefore, it is critical to properly recover missing values in the dataset. This paper presents a new imputation algorithm called Cluster-based Best Match Scanning (CBMS) designed for Big Data. It is a modification of k-NN imputation. CBMS focuses on recovering continuous numeric missing values, and aims at balancing computational complexity and accuracy. As an imputation algorithm, it can potentially reduce the time complexity of k-NN from O(n^2*d) to O(n^1.5*d), and also reduce the space/memory usage, while perform no worse than k-NN imputation. On top of that CBMS is highly parallelizable.Simulation of CBMS is conducted on smart meter reading data. Data is manually divided into training set and testing set, and testing accuracy is evaluated by computing the mean absolute deviation. Comparison with linear interpolation and k-NN imputation is made to demonstrate the power and effectiveness of our proposed CBMS algorithm.

Full Text