Abstract
On a stream of two dimensional data items $(x,y)$ where $x$ is an item identifier, and $y$ is a numerical attribute, a correlated aggregate query requires us to first apply a selection predicate along the second ($y$) dimension, followed by an aggregation along the first ($x$) dimension. For selection predicates of the form $(y , c)$, where parameter $c$ is provided at query time, we present new streaming algorithms and lower bounds for estimating statistics of the resulting sub stream of elements that satisfy the predicate. We provide the first sub linear space algorithms for a large family of statistics in this model, including frequency moments. We experimentally validate our algorithms, showing that their memory requirements are significantly smaller than existing linear storage schemes for large datasets, while simultaneously achieving fast per-record processing time. We also study the problem when the items have weights. Allowing negative weights allows for analyzing values which occur in the symmetric difference of two datasets. We give a strong space lower bound which holds even if the algorithm is allowed up to a logarithmic number of passes over the data(before the query is presented). We complement this with a small space algorithm which uses a logarithmic number of passes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.