Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

Leonardo Andrade Ribeiro,Felipe Ferreira Borges,Diego Oliveira

doi:10.5753/jidm.2021.1969

Leonardo Andrade Ribeiro, Felipe Ferreira Borges + Show 1 more

Open Access

https://doi.org/10.5753/jidm.2021.1969

Copy DOI

Abstract

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

Full Text