Abstract
Similarity join of two datasets $P$ and $Q$ is a primitive operation that is useful in many application domains. The operation involves identifying pairs $(p,q)$ , in the Cartesian product of $P$ and $Q$ such that $(p,q)$ satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as MapReduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in MapReduce and proposes a network-efficient solution called C2Net to improve the utilization of MapReduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to the state of the art, the proposed solution is able to achieve 20 percent data reduction and 50 percent reduction in shuffle time.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have