QSJoin: a new string similarity join method based on Q-sample and statistical features

Xiaoxia Wang,Bo Wu,Decai Sun,Puzhao Ji

doi:10.1504/ijart.2019.100429

Abstract

Similarity joins is an essential operation in big data analytics, such as data integration and data cleaning. In this paper, we propose a new algorithm, called QSJoin, to support efficient string similarity join by reducing the shuffle cost and transmission cost in MapReduce. Our algorithm employs a filter-verify framework. In filtration, a new signature scheme based on q-sample is adopted to decrease the number of generated signatures, and then a large number of dissimilar pairs are discarded with Standard-Match filter. In verification, a multi-vector filter scheme is adopted to eliminate more dissimilar pairs with statistical features, and then the final true pairs is extracted by the verification of candidate pairs with length-aware verification method. Experimental result on real-world datasets shows that our algorithm achieves high performance and outperforms state-of-the-art approaches.

Full Text