C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join

Hangyu Li,Hong Xu,Foryu Ha,Sarana Nutanong,Chenyun Yu

doi:10.1109/tkde.2018.2836464

Hangyu Li, Hong Xu + Show 3 more

https://doi.org/10.1109/tkde.2018.2836464

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Similarity join of two datasets $P$ and $Q$ is a primitive operation that is useful in many application domains. The operation involves identifying pairs $(p,q)$ , in the Cartesian product of $P$ and $Q$ such that $(p,q)$ satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20 percent data reduction and 50 percent reduction in shuffle time.

Full Text