An empirical evaluation of set similarity join techniques

Willi Mann,Nikolaus Augsten,Panagiotis Bouros

doi:10.14778/2947618.2947620

Willi Mann, Nikolaus Augsten + Show 1 more

Open Access

https://doi.org/10.14778/2947618.2947620

Copy DOI

Abstract

Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.

Full Text