Abstract

Real-time anomaly detection of massive data streams is an important research topic nowadays due to the fact that a lot of data is generated in continuous temporal processes. There is a broad research area, covering mathematical, statistical, information theory methodologies for anomaly detection. It addresses various problems in a lot of domains such as health, education, finance, government, etc. In this paper, we analyze the state-of-the-art of data streams anomaly detection techniques and algorithms for anomaly detection in data streams (time series data). Critically surveying the techniques’ performances under the challenge of real-time anomaly detection of massive high-velocity streams, we conclude that the modeling of the normal behavior of the stream is a suitable approach. We evaluate Holt-Winters (HW), Taylor’s Double Holt-Winters (TDHW), Hierarchical temporal memory (HTM), Moving Average (MA), Autoregressive integrated moving average (ARIMA) forecasting models, etc. Holt-Winters (HW) and Taylor’s Double Holt-Winters (TDHW) forecasting models are used to predict the normal behavior of the periodic streams, and to detect anomalies when the deviations of observed and predicted values exceeded some predefined measures. In this work, we propose an enhancement of this approach and give a short description about the algorithms and then they are categorized by type of pre-diction as: predictive and non-predictive algorithms. We implement the Genetic Algorithm (GA) to periodically optimize HW and TDHW smoothing parameters in addition to the two sliding windows parameters that improve Hyndman’s MASE measure of deviation, and value of the threshold parameter that defines no anomaly confidence interval [1]. We also propose a new optimization function based on the input training datasets with the annotated anomaly intervals, in order to detect the right anomalies and minimize the number of false ones. The proposed method is evaluated on the known anomaly detection benchmarks NUMENTA and Yahoo datasets with annotated anomalies and real log data generated by the National education information system (NEIS)1 in Macedonia.

Highlights

  • Detection in real-time massive data streams is one of the important research topics nowadays due to the fact that the most of the world data generation is a continuous temporal process

  • The optimal values of the parameters are determined on the training set and they are verified on the test set

  • Our proposed algorithm (HW Genetic Algorithm (GA)) with GA optimized parameters (α, β, γ, δ, k, n) and with improved MASEt(α,β,γ,δ,k,n) is compared with Autoregressive integrated moving average (ARIMA), Moving Average (MA), Hierarchical Temporal Memory (HTM) [5] algorithm, HW where smoothing parameters are calculated by formula and default Mean Absolute Scaled Error (MASE)

Read more

Summary

Introduction

Detection in real-time massive data streams (practically infinite flow of data, pouring in as time goes, each piece of data having its own timestamp) is one of the important research topics nowadays due to the fact that the most of the world data generation is a continuous temporal process. Real-time data processing, requests continual input, time-critical manner processing, and instant output (e.g. alarm) if anomaly happened. Instead of searching for the unknown anomalies we can, in advance, model a normal behavior of the data stream and compare it to the observed one. The development of the model of normal behavior must adapt to these challenges to maintain detection accuracy: be iterative, use only a part of the stream (even before it is permanently stored), and be implemented as a positive feedback in the learning process (e.g. repeated anomalies labeling in the supervised process). The most intensively developed anomaly detection methods that consider underlined challenges are based on machine learning, neural networks, predictive and statistical time series forecasting models

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.