Abstract

Approximate top-k query returns a list of k tuples that have approximate largest scores with respect to the user given query. However, existing algorithms cannot effectively process the approximate top-k queries on big data, because they either restrict the class of ranking functions, or fail to take selection conditions into consideration. In this paper, a novel algorithm PSATop-k, which combines partitioning and sampling techniques, is proposed to answer approximate range top-k query efficiently. PSATop-k is suitable for queries with selection conditions and arbitrary ranking functions. PSATop-k first determines the sampling size that meets the accuracy requirement, then draws sufficient random tuples to return result by accessing a subset of the partitioned data. The experimental results on the real-life and synthetic datasets demonstrate that PSATop-k performs much better than the existing algorithms. Specially, as result set size varies from 10 to 80, PSATop-k runs 12.22 to 14.30 times faster than TA-based and Coreset-based methods on average. The speedup ratios are 11.43 to 23.34 and 55.86 to 649.06 in the experiments of error bound and tuple number, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call