Abstract

Similarity joins is an essential operation in big data analytics, such as data integration and data cleaning. In this paper, we propose a new algorithm, called QSJoin, to support efficient string similarity join by reducing the shuffle cost and transmission cost in MapReduce. Our algorithm employs a filter-verify framework. In filtration, a new signature scheme based on q-sample is adopted to decrease the number of generated signatures, and then a large number of dissimilar pairs are discarded with Standard-Match filter. In verification, a multi-vector filter scheme is adopted to eliminate more dissimilar pairs with statistical features, and then the final true pairs is extracted by the verification of candidate pairs with length-aware verification method. Experimental result on real-world datasets shows that our algorithm achieves high performance and outperforms state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call