Abstract

The detection of local outliers over high-volume data streams is critical for diverse real-time applications in the real world, where the distributions in different subsets of the data tend to be skewed. However, existing methods are not scalable to large-scale high-volume data streams owing to the high complexity of the re-detection of data updates. In this work, we propose a top-n local outlier detection method based on Kernel Density Estimation (KDE) over large-scale high-volume data streams. First, we define a KDE-based Outlier Factor (KOF) to measure the local outlierness score for the data points. Then, we propose the upper bounds of the KOF and an upper-bound-based pruning strategy to quickly eliminate the majority of the inlier points by leveraging the upper bounds without computing the expensive KOF scores. Moreover, we design an Upper-bound pruning-based top-nKOF detection method (UKOF) over data streams to efficiently address the data updates in a sliding window environment. Furthermore, we propose a Lazy update method of UKOF (LUKOF) for bulk updates in high-speed large-scale data streams to further minimize the computation cost. Our comprehensive experimental study demonstrates that the proposed method outperforms the state-of-the-art methods by up to 3,600 times in speed, while achieving the best performance in detecting local outliers over data streams.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call