A General Method for Estimating Correlated Aggregates Over a Data Stream

Srikanta Tirthapura,David P Woodruff

doi:10.1007/s00453-014-9917-1

Abstract

On a stream \({\fancyscript{S}}\) of two dimensional data items \((x,y)\) where \(x\) is an item identifier and \(y\) is a numerical attribute, a correlated aggregate query \(C(\sigma ,AGG,{\fancyscript{S}})\) asks to first apply a selection predicate \(\sigma \) along the \(y\) dimension, followed by an aggregation \(AGG\) along the \(x\) dimension. For selection predicates of the form \((y < c)\) or \((y > c)\), where parameter \(c\) is provided at query time, we present new streaming algorithms and lower bounds for estimating correlated aggregates. Our main result is a general method that reduces the estimation of a correlated aggregate \(AGG\) to the streaming computation of \(AGG\) over an entire stream, for an aggregate that satisfies certain conditions. This results in the first sublinear space algorithms for the correlated estimation of a large family of statistics, including frequency moments. Our experimental validation shows that the memory requirements of these algorithms are significantly smaller than existing linear storage solutions, and that these achieve a fast per-record processing time. We also study the setting when items have weights. In the case when weights can be negative, we give a strong space lower bound which holds even if the algorithm is allowed up to a logarithmic number of passes over the data. We complement this with a small space algorithm which uses a logarithmic number of passes.

Full Text