Guest editor introduction: special section on online analysis and querying of continuous data streams

R Rastogi

doi:10.1109/tkde.2003.1198386

Abstract

IN a number of application domains, data arrives continuously in the form of a stream and needs to be processed in an online fashion. For example, in the network installations of large Telecom and Internet service providers, detailed usage information (e.g., Call Detail Records or CDRs, IP traffic statistics due to SNMP/RMON polling, etc.) from different parts of the network needs to be continuously collected and analyzed for interesting trends. Other applications that generate rapid, continuous, and large volumes of stream data include transactions in retail chains, ATM, and credit card operations in banks, weather measurements, sensor networks, etc. Further, for many mission-critical tasks such as fraud/anomaly detection in Telecom networks, it is important to be able to answer queries in real time and infer interesting patterns online. As a result, recent years have witnessed an increasing interest in designing single-pass algorithms for querying and mining data streams that examine each element in the stream only once. The large volumes of stream data, real-time response requirements of streaming applications, and architecture of modern computers impose two additional constraints on algorithms for querying streams: 1) The time for processing each stream element must be small, and 2) the amount of memory available to the query processor is limited. Thus, the challenge is to develop algorithms that can summarize data streams in a concise, but reasonably accurate, synopsis that can be stored in the allotted (small) amount of memory and can be used to provide approximate answers to user queries with some guarantees on the approximation error. Given the plethora of streaming applications and the nontrivial computational challenges they pose, the timing for a special issue on the topic could not have been better. This special issue of Transactions on Knowledge and Data Engineering presents five papers that propose novel synopses structures and fundamental algorithmic techniques for analyzing and querying continuous data streams. Of the five papers, four explore the space/accuracy trade off of stream processing algorithms for important problems like clustering and distinct value estimation, and one addresses issues related to the semantics of query operators on (infinite) streams. The first paper by Guha et al. is illustrative of a general class of streaming algorithms based on the principle of divideand-conquer. Conceptually, the algorithm proposed in the paper partitions the input stream into chunks and computes a succinct summary for each chunk. Then, in subsequent steps, it repeatedly combines chunk summaries from the previous step to compute new summaries until the final desired summary for the stream is obtained. Guha et al. show how this divide-and-conquer approach can be used to compute k centers for a stream, where each intermediate summary is a set of OðkÞ centers. The end result is a deterministic constant-factor approximation algorithm for clustering data streams. In the second paper, Cormode et al. exploit properties of p-stable distributions to estimate, with high probability, the number of distinct elements in a stream. Essentially, given a vector of random variables from a p-stable distribution, the Lp norm of a stream can be computed by summing the variables, after weighting each variable with the frequency of the corresponding stream element. Thus, choosing random variables from a p-stable distribution with a small p yields the number of distinct values in the stream. Wavelet transforms have been shown to be effective for approximating the frequency distribution of data. In their paper, Gilbert et al. present a randomized “sketch”-based method for estimating, in a streaming environment, the top few Wavelet coefficients with the highest energy. A key contribution of the paper is a special construction (based on second-order Reed-Muller codes) of the random variables used in sketching, so that Wavelet coefficients (and arbitrary range-sum queries) can be obtained very fast. Furthermore, the random sketch synopses considered in the paper can also be used to estimate join sizes, histograms, quantiles, and frequent elements in a stream. Interesting questions arise, especially for infinitely long data streams, when we consider the semantics of query operators like joins of two or more streams or group-by operators on a single stream. The paper by Tucker et al. refers to the first category of operators as unbounded stateful operators which are those that need to maintain state with no upper bound in its size and, so, may run out of memory. The latter class of operators belong to the category of blocking operators, which are those that need to read the entire input before emitting a single output and might never produce a result (if the stream is infinite). In order to address the abovementioned problems posed by unbounded stateful and blocking operators, Tucker et al. enhance the streaming IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 3, MAY/JUNE 2003 513

Full Text