Abstract

High-quality data are the prerequisite for analyzing and using big data to guarantee the value of the data. Missing values in data is a common yet challenging problem in data analytics and data mining, especially in the era of big data. Amount of missing values directly affects the data quality. Therefore, it is critical to properly recover missing values in the dataset. This paper presents a new imputation algorithm called Cluster-based Best Match Scanning (CBMS) designed for Big Data. It is a modification of k-NN imputation. CBMS focuses on recovering continuous numeric missing values, and aims at balancing computational complexity and accuracy. As an imputation algorithm, it can potentially reduce the time complexity of k-NN from O(n^2*d) to O(n^1.5*d), and also reduce the space/memory usage, while perform no worse than k-NN imputation. On top of that CBMS is highly parallelizable.Simulation of CBMS is conducted on smart meter reading data. Data is manually divided into training set and testing set, and testing accuracy is evaluated by computing the mean absolute deviation. Comparison with linear interpolation and k-NN imputation is made to demonstrate the power and effectiveness of our proposed CBMS algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.