PSATop-[formula omitted]: Approximate range top-[formula omitted] computation on big data

Hongjie Guo,Jianzhong Li,Hong Gao,Kaiqi Zhang

doi:10.1016/j.knosys.2021.107614

Abstract

Approximate top-k query returns a list of k tuples that have approximate largest scores with respect to the user given query. However, existing algorithms cannot effectively process the approximate top-k queries on big data, because they either restrict the class of ranking functions, or fail to take selection conditions into consideration. In this paper, a novel algorithm PSATop-k, which combines partitioning and sampling techniques, is proposed to answer approximate range top-k query efficiently. PSATop-k is suitable for queries with selection conditions and arbitrary ranking functions. PSATop-k first determines the sampling size that meets the accuracy requirement, then draws sufficient random tuples to return result by accessing a subset of the partitioned data. The experimental results on the real-life and synthetic datasets demonstrate that PSATop-k performs much better than the existing algorithms. Specially, as result set size varies from 10 to 80, PSATop-k runs 12.22 to 14.30 times faster than TA-based and Coreset-based methods on average. The speedup ratios are 11.43 to 23.34 and 55.86 to 649.06 in the experiments of error bound and tuple number, respectively.

Full Text