Abstract

Abstract Background: Internet of Things (IoT), earth observation and big scientific experiments are sources of extensive amounts of sensor big data today. We are faced with large amounts of data with low measurement costs. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. In the past, very little attention was given to the most time-consuming part of the data mining process, i.e. data pre-processing. Objectives: In this paper we propose an algorithm for data cleaning, which can be applied to real-world streaming big data. Methods/Approach: We use the short-term prediction method based on the Kalman filter to detect admissible intervals for future measurements. The model can be adapted to the concept drift and is useful for detecting random additive outliers in a sensor data stream. Results: For datasets with low noise, our method has proven to perform better than the method currently commonly used in batch processing scenarios. Our results on higher noise datasets are comparable. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data

Highlights

  • Big Data is a term that is used for datasets that are too large in size and complexity to be handled with the current methodologies (Fan et al, 2013)

  • Data points are a subject of noise, 1% of data points have been considered as candidates for an additive outlier

  • As presented in the previous section our algorithm achieves the best performance with a typical stream of sensor data, as we can find in Internet of Things

Read more

Summary

Introduction

Big Data is a term that is used for datasets that are too large in size and complexity to be handled with the current methodologies (Fan et al, 2013). Translating the data analysis into a streaming on-line process is always considered a good approach Stream mining exposes another benefit of the methodology - real-time responsiveness of the system, which has been identified as desirable by many different authors regarding reporting (Belfo et al, 2015), intrusion detection (Al Quhtani, 2017) and others. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call