Abstract

Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this paper, we investigate strategies to accelerate set similarity join using Graphic Processing Units (GPUs). Our approach exploits massive parallelism instead of filtering and, as a result, exhibits much better robustness to variations of threshold values and data distributions. Experimental evaluation shows that we are able to obtain up to 57x speedups over highly optimized CPU-based algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call