Abstract

Set similarity join (SSJoin) is known as an important operation for searching similarity set pairs from the given database and plays a core role in data integration, data cleaning, and data mining. Different from the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time can be improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, many recent researches have shown that GPUs (Graphics Processing Units) can accelerate and improve the efficiency of similarity join operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with the CPU-GPU architecture. We propose two progressive SSJoin methods, PSSJM and PBM. PSSJM utilizes inverted indexing and PBM achieves its required functions by utilizing the counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our progressive SSJoin method to accelerate the processing. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on CPU-GPU architecture has high speedups over the CPU-base method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call