Accelerating Progressive Set Similarity Join with the CPU-GPU Architecture

Lining Yu,Derong Shen,Tiezheng Nie,Yue Kou

doi:10.1016/j.bdr.2021.100267

Abstract

Set similarity join (SSJoin) is known as an important operation for searching similarity set pairs from the given database and plays a core role in data integration, data cleaning, and data mining. Different from the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time can be improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, many recent researches have shown that GPUs (Graphics Processing Units) can accelerate and improve the efficiency of similarity join operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with the CPU-GPU architecture. We propose two progressive SSJoin methods, PSSJM and PBM. PSSJM utilizes inverted indexing and PBM achieves its required functions by utilizing the counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our progressive SSJoin method to accelerate the processing. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on CPU-GPU architecture has high speedups over the CPU-base method.

Full Text