SETJoin: a novel top-k similarity join algorithm

Hongya Wang,Lihong Yang,Yingyuan Xiao

doi:10.1007/s00500-020-04807-w

Abstract

As an important operation in data cleaning, near duplicate Web pages detection and data mining, similarity joins have received much attention recently. Existing similarity joins fall into two broad categories—the similarity-threshold-based similarity join and top-k similarity join (TopkJoin). Compared with the traditional one, TopkJoin is more suitable for cases where the similarity threshold is unknown before hand. In this paper, we focus on the performance optimization problem of TopkJoin. Particularly, we observed that the state-of-the-art TopkJoin algorithm has three serious performance issues, i.e., the inappropriate application of hash table, inefficient use of suffix filtering and unnecessary evaluation of excessive unqualified candidates. To resolve these problems, we proposed a novel algorithm, SETJoin, by combining the existing event-driven framework with three simple yet efficient optimization techniques, viz., (1) reducing the cost in hashing by rearranging the orders of the candidate filtering and hash table lookup operations; (2) maximizing the pruning capability of suffix filtering by judiciously choosing the (near) optimal recursion depth; and (3) terminating join operations earlier by setting a much tighter stop condition for iteration. The experimental results show that SETJoin achieves up to 1.26x–3.49x speedup over the state-of-the-art algorithm on several real datasets.

Full Text