Determining Exact Quantiles with Randomized Summaries

Ziling Chen,Haoquan Guan,Xiangdong Huang,Chen Wang,Jianmin Wang,Shaoxu Song

doi:10.1145/3639280

Abstract

Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the data in multiple passes. The idea is to gradually shrink the range of the queried quantile till it is small enough to fit in memory for ranking the result. Existing methods use deterministic sketches to determine the exact range of quantile, known as deterministic filter, which could be inefficient in range shrinking. In this study, we propose to shrink the ranges more aggressively, using randomized summaries such as KLL sketch. That is, with a high probability the quantile lies in a smaller range, namely probabilistic filter, determined by the randomized sketch. Specifically, we estimate the expected passes for determining the exact quantiles with probabilistic filters, and select a proper probability that can minimize the expected passes. Analyses show that our exact quantile determination method can terminate in P passes with 1-δ confidence, storing O(N 1/P logP-1/2P (1/δ)) items, close to the lower bound Ømega(N1/P) for a fixed δ. The approach has been deployed as a function in an LSM-tree based time-series database Apache IoTDB. Remarkably, the randomized sketches can be pre-computed for the immutable SSTables in LSM-tree. Moreover, multiple quantile queries could share the data passes for probabilistic filters in range estimation. Extensive experiments on real and synthetic datasets demonstrate the superiority of our proposal compared to the existing methods with deterministic filters. On average, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Determining Exact Quantiles with Randomized Summaries

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ACM on Management of Data

Lead the way for us

Journal: Proceedings of the ACM on Management of Data	Publication Date: Mar 12, 2024
License type: cc-by

Similar Papers

Z-align: An Exact and Parallel Strategy for Local Biological Sequence Alignment in User-Restricted Memory Space
Rodolfo Batista ... Alba Magalhaes Alves De Melo
-
Rodolfo Batista, et. al.Rodolfo Batista ... Alba Magalhaes Alves De Melo
01 Jan 2006
01 Jan 2006

A parallel strategy for biological sequence alignment in restricted memory space
Rodolfo Bezerra Batista ... Alba Cristina Magalhaes Alves De Melo
Journal of Parallel and Distributed Computing | VOL. 68
Rodolfo Bezerra Batista, et. al.Rodolfo Bezerra Batista ... Alba Cristina Magalhaes Alves De Melo
17 Sep 2007
Journal of Parallel and Distributed Computing | VOL. 68

Query-Based Outlier Detection in Heterogeneous Information Networks.
Jonathan Kuck ... Jiawei Han
Advances in database technology : proceedings. International Conference on Extending Database Technology | VOL. 2015
Jonathan Kuck, et. al.Jonathan Kuck ... Jiawei Han
01 Mar 2015
Advances in database technology : proceedings. International Conference on Extending Database Technology | VOL. 2015

Mining compressed frequent itemsets over data stream in sliding windows
Li Zhao ... Mengdong Chen
-
Li Zhao, et. al.Li Zhao ... Mengdong Chen
01 Nov 2009
01 Nov 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Determining Exact Quantiles with Randomized Summaries

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ACM on Management of Data