Para-Join: an efficient parallel method for string similarity join

Cairong Yan,Wenjing Guo,Jian Wang,Bin Zhu

doi:10.1504/ijhpcn.2016.10005002

Abstract

In big data area, a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose an efficient parallel method, called Para-Join which first splits the input into small sets according to the joint-frequency vector and the interval-vector of each string, and then joins the pairs for each small set in parallel. Para-RR algorithm and Para-RS algorithm are proposed to extend partion-based algorithm and adopt the multi-threading technique to implement the string similarity join within each set and between two different sets separately. We prove that Para-Join method can not only avoid reduplicate computation but also ensure the completeness of the result. We also put forward an effective pruning strategy to improve the performance. Experimental results show that our method achieves high efficiency and significantly outperforms state-of-the-art approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Para-Join: an efficient parallel method for string similarity join

Abstract

Talk to us

Similar Papers

More From: International Journal of High Performance Computing and Networking

Lead the way for us

Similar Papers

Parallel Black Hole Clustering Based on MapReduce
Chun-Wei Tsai ... Ming-Chao Chiang
-
Chun-Wei Tsai, et. al.Chun-Wei Tsai ... Ming-Chao Chiang
01 Oct 2015
01 Oct 2015

Efficient string similarity join in multi-core and distributed systems
Cairong Yan ... Qinglong Zhang
PLOS ONE | VOL. 12
Cairong Yan, et. al.Cairong Yan ... Qinglong Zhang
09 Mar 2017
PLOS ONE | VOL. 12

A Fast Spectral Clustering Method Based on Growing Vector Quantization for Large Data Sets
Xiujun Wang ... Baohua Zhao
-
Xiujun Wang, et. al.Xiujun Wang ... Baohua Zhao
01 Jan 2013
01 Jan 2013

Performance Modeling for High-Order Finite Difference Methods On the Connection Machine Cm-2
Yu-Chung Chang ... Tony F Chan
The International Journal of Supercomputer Applications and High Performance Computing | VOL. 9
Yu-Chung Chang, et. al.Yu-Chung Chang ... Tony F Chan
01 Mar 1995
The International Journal of Supercomputer Applications and High Performance Computing | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Para-Join: an efficient parallel method for string similarity join

Abstract

Talk to us

Similar Papers

More From: International Journal of High Performance Computing and Networking