Anomalies Detection Using Isolation in Concept-Drifting Data Streams

Maurras Ulbricht Togbe,Maroua Bahri,Raja Chiky,Mariam Barry,Yousra Chabchoub,Aliou Boly

doi:10.3390/computers10010013

Abstract

Detecting anomalies in streaming data is an important issue for many application domains, such as cybersecurity, natural disasters, or bank frauds. Different approaches have been designed in order to detect anomalies: statistics-based, isolation-based, clustering-based, etc. In this paper, we present a structured survey of the existing anomaly detection methods for data streams with a deep view on Isolation Forest (iForest). We first provide an implementation of Isolation Forest Anomalies detection in Stream Data (IForestASD), a variant of iForest for data streams. This implementation is built on top of scikit-multiflow (River), which is an open source machine learning framework for data streams containing a single anomaly detection algorithm in data streams, called Streaming half-space trees. We performed experiments on different real and well known data sets in order to compare the performance of our implementation of IForestASD and half-space trees. Moreover, we extended the IForestASD algorithm to handle drifting data by proposing three algorithms that involve two main well known drift detection methods: ADWIN and KSWIN. ADWIN is an adaptive sliding window algorithm for detecting change in a data stream. KSWIN is a more recent method and it refers to the Kolmogorov–Smirnov Windowing method for concept drift detection. More precisely, we extended KSWIN to be able to deal with n-dimensional data streams. We validated and compared all of the proposed methods on both real and synthetic data sets. In particular, we evaluated the F1-score, the execution time, and the memory consumption. The experiments show that our extensions have lower resource consumption than the original version of IForestASD with a similar or better detection efficiency.

Highlights

Data stream mining is a sub-domain of data mining that continuously processes incoming data on the fly
In an unsupervised mode where there is no information or knowledge about the data, providing a u parameter becomes problematic, and IForestASD’s ability to detect drift depends on it. In this extended version, we propose improving IForestASD by proposing two approaches based on a drift detection that can be done at two levels: model drift detection ADWIN (ADaptive WINdowing) or data drift detection Kolmogorov– Smirnov WINdowing (KSWIN) (Kosmolgorov Simirnov WINdows)
Because half-space trees (HST) is always faster than IForestASD (IFA), for both training and testing time, by an order of 400 for the worst case, we reported the ratio between running time of the two models: IFA over HST

Summary

Introduction

Data stream mining is a sub-domain of data mining that continuously processes incoming data on the fly. Data stream mining imposes several challenges, such as the evolving nature of data and its huge size, which is potentially infinite. These challenges require efficient and optimized mining methods, in terms of processing time and memory usage, which are adapted to the stream setting. The stream processing must be performed under the one-pass, aka. The continuous arrival of data streams in rapid, time-varying, and possibly unbounded way may raise new fundamental research problems. As research in the data stream environment keeps progressing, the problem of how to efficiently alert when facing an abnormal behavior, data, or patterns Stream anomaly detection algorithms refer to the methods that are able to extract enough knowledge from the data in order to compute the anomaly scores while dealing with evolving data streams

Objectives

Methods

Results

Discussion

Conclusion