Distinct element counting in distributed dynamic data streams

Wenji Chen,Yong Guan

doi:10.1109/infocom.2015.7218625

Abstract

We consider a new type of distinct element counting problem in dynamic data streams, where (1) insertions and deletions of an element can appear not only in the same data stream but also in two or more different streams, (2) a deletion of a distinct element cancels out all the previous insertions of this element, and (3) a distinct element can be re-inserted after it has been deleted. Our goal is to count the number of distinct elements that were inserted but have not been deleted in a continuous data stream. We also solve this new type of distinct element counting problem in a distributed setting. This problem is motivated by several network monitoring and attack detection applications where network traffic can be modelled as single or distributed dynamic streams and the number of distinct elements in the data streams, such as unsuccessful TCP connection setup requests, is calculated to be used as an indicator to detect certain network events such as service outage and DDoS attacks. Although there are known tight bounds for distinct element counting in insertion-only data streams, no good bounds are known for it in dynamic data streams, neither for this new type of problem. None of the existing solutions for distinct element counting can solve our problem. In this paper, we will present the first solution to this problem, using a space-bounded data structure with a computation-efficient probabilistic data streaming algorithm to estimate the number of distinct elements in single or distributed dynamic data streams. We have done both theoretical analysis and experimental evaluations, using synthetic and real data traces, of our algorithm to show its effectiveness.

Full Text