Abstract
We present UDDSketch (Uniform DDSketch), a novel sketch for fast and accurate tracking of quantiles in data streams. This sketch is heavily inspired by the recently introduced DDSketch, and is based on a novel bucket collapsing procedure that allows overcoming the intrinsic limits of the corresponding DDSketch procedures. Indeed, the DDSketch bucket collapsing procedure does not allow the derivation of formal guarantees on the accuracy of quantile estimation for data which does not follow a sub-exponential distribution. On the contrary, UDDSketch is designed so that accuracy guarantees can be given over the full range of quantiles and for arbitrary distribution in input. Moreover, our algorithm fully exploits the budgeted memory adaptively in order to guarantee the best possible accuracy over the full range of quantiles. Extensive experimental results on both synthetic and real datasets confirm the validity of our approach.
Highlights
A data stream σ can be thought as a sequence of n items drawn from a universe U
The main contributions of this paper are the following ones: (i) we provide a novel collapsing procedure for the DDSketch algorithm; (ii) we formally provide an error bound, modeling the relationship between accuracy and space occupied by the sketch for arbitrary input distributions; (iii) we show that our algorithm fully exploits the budgeted memory adaptively in order to guarantee the best possible accuracy over the full range of quantiles; (iv) we extensively compare the behaviour and the accuracy of DDSketch and UDDSketch over a huge set of both synthetic and real input datasets; (v) we provide freely available C implementations of both DDSketch and UDDSketch for full reproducibility of results
We have introduced UDDSketch (Uniform DDSketch), a novel sketch for fast and accurate tracking of quantiles in data streams
Summary
A data stream σ can be thought as a sequence of n items drawn from a universe U. DDSketch (Distributed Distribution Sketch) [5] is a recent sketch data structure providing relative accuracy for tracking quantiles in data streams whose underlying distribution is heavy-tailed This sketch is conceptually very simple and can be implemented either using an unlimited number of buckets or fixing a desired maximum number of buckets to be used. The main contributions of this paper are the following ones: (i) we provide a novel collapsing procedure for the DDSketch algorithm; (ii) we formally provide an error bound, modeling the relationship between accuracy and space occupied by the sketch for arbitrary input distributions; (iii) we show that our algorithm fully exploits the budgeted memory adaptively in order to guarantee the best possible accuracy over the full range of quantiles; (iv) we extensively compare the behaviour and the accuracy of DDSketch and UDDSketch over a huge set of both synthetic and real input datasets; (v) we provide freely available C implementations of both DDSketch and UDDSketch for full reproducibility of results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.