Abstract

In the era of big data, a significant amount of data is produced in many applications areas. However due to various reasons including sensor failures, communication failures, environmental disruptions, and human errors, missing values are found frequently These missing data in the observed data make a challenge for other data mining approaches, requiring the missed data to be handled at the preprocessing stage of data mining. Several approaches for handling the missing data have been proposed in the past. These approaches consider the whole dataset for making a prediction, making the whole imputation approach to be cumbersome. This paper proposes the procedure which makes use of the local similarity structure of the dataset for making an Imputation. The K-means clustering technique along with the weighted KNN makes efficient imputation of the missed value. The results are compared against imputations by mean substitution and Fuzzy C Means (FCM). The proposed imputation technique shows that it performs better than other imputation procedures.

Highlights

  • Since the age of big data began, the collection of data from various sources, and the resultant amount of data has risen to the greatest extent [1]

  • Multivariate datasets are prevalent in several real-world applications, such as electrical system analysis, meteorological or economical strategy planning, security control, and plenty more

  • Multiple sensors are deployed to produce datasets, and they typically have one target to generate the data as activity occurs

Read more

Summary

Introduction

Since the age of big data began, the collection of data from various sources, and the resultant amount of data has risen to the greatest extent [1]. Multiple sensors are deployed to produce datasets, and they typically have one target to generate the data as activity occurs. In a power grid application several sensors diagnosing the state of power transformers, produce the data by monitoring the state of gases over time [2]. In the era of IoT, a vast number of sensors are utilized for generating the multivariate environmental conditions, for example, the air or water pollution [3]. One major issue handled in the preprocessing step is missed value. The raw dataset generated by the sensor network typically includes missing values due to the rough working conditions or uncontrolled variables such as adverse weather conditions, malfunctions of the infrastructure, or unstable signals. The problem of missing data is quite prevalent in many applications. The outcome is that the data observed cannot be evaluated due to the incompleteness of the datasets

Objectives
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.