Parallelizing filter-and-verification based exact set similarity joins on multicores

Fabian Fier,Johann-Christoph Freytag

doi:10.1016/j.is.2021.101912

Fabian Fier, Johann-Christoph Freytag

https://doi.org/10.1016/j.is.2021.101912

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Set similarity join (SSJ) is a well studied problem with many algorithms proposed to speed up its performance. However, its scalability and performance are rarely discussed in modern multicore environments. Existing algorithms assume a single-threaded execution that leaves the abundant parallelism provided by modern machines unused, or use distributed setups that may not yield efficient runtimes and speedups that are proportional to the amount of hardware resources (e.g., CPU cores). In this paper, we focus on a widely-used family of SSJ algorithms that are based on the filter-and-verification paradigm, and study the potential of speeding them up in the context of multicore machines. We adapt state-of-the-art SSJ algorithms including PPJoin and AllPairs. Our experiments using 12 real-world datasets highlight important findings: (1) Using the exact number of hardware-provided hyperthreads leads to optimal runtimes for most experiments, (2) hand-crafted data structures do not always lead to better performance, and (3) PPJoin’s position filter is more effective in the multithreaded case compared to the single-threaded execution.

Full Text