Abstract

Set similarity join (SSJoin) is an important operation for searching similarity set pairs from the given database and play a core role in data integration, data cleaning, and data mining. In contrast to the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time is improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, recent research has shown that GPUs (Graphics Processing Units) can accelerate the similarity operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with GPUs. We proposes two progressive SSJoin methods, PSSJM and PBM. PSSJM uses inverted index and PBM achieves its required functions by utilizing counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our proposed progressive method to accelerate the computation. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on GPUs has high speedups over CPU-base method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call