Hashed samples

Marios Hadjieleftheriou,Divesh Srivastava,Nick Koudas,Xiaohui Yu

doi:10.14778/1453856.1453883

Hashed samples

Marios Hadjieleftheriou, Divesh Srivastava + Show 2 more

https://doi.org/10.14778/1453856.1453883

Copy DOI

Abstract

We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.

Full Text