Sketching Sampled Data Streams

Florin Rusu,Alin Dobra

doi:10.1109/icde.2009.31

Abstract

Sampling is used as a universal method to reduce the running time of computations -- the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream.In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling -- Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines -- and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only $10\%$ or less of the dataset size. This is equivalent to a speed-up factor of at least $10$ when updating the sketch.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sketching Sampled Data Streams

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Enhancing the Computational Intelligence of Smart Fog Gateway with Boundary-Constrained Dynamic Time Warping Based Imputation and Data Reduction
S Balasubramanian ... T Meyyappan
-
S Balasubramanian, et. al.S Balasubramanian ... T Meyyappan
01 Jul 2019
01 Jul 2019

A Multi-Tier Stacked Ensemble Algorithm to Reduce the Regret of Incremental Learning for Streaming Data
R Pari ... M Sandhya
IEEE Access | VOL. 6
R Pari, et. al.R Pari ... M Sandhya
01 Jan 2018
IEEE Access | VOL. 6

Lightweight Metric Computation for Distributed Massive Data Streams
Emmanuelle Anceaume ... Yann Busnel
-
Emmanuelle Anceaume, et. al.Emmanuelle Anceaume ... Yann Busnel
01 Jan 2017
01 Jan 2017

Visual analytics of anomaly detection in large data streams
Ming C Hao ... Abhay Mehta
-
Ming C Hao, et. al.Ming C Hao ... Abhay Mehta
18 Jan 2009
18 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sketching Sampled Data Streams

Abstract

Talk to us

Similar Papers