String similarity join with different similarity thresholds based on novel indexing techniques

Chuitian Rong,Yasin N Silva,Chunqing Li

doi:10.1007/s11704-016-5231-1

Abstract

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

String similarity join with different similarity thresholds based on novel indexing techniques

Abstract

Talk to us

Similar Papers

More From: Frontiers of Computer Science

Lead the way for us

Journal: Frontiers of Computer Science	Publication Date: Oct 11, 2016
Citations: 3

Similar Papers

String Similarity Join with Different Thresholds
Chuitian Rong ... Xiangling Zhang
-
Chuitian Rong, et. al.Chuitian Rong ... Xiangling Zhang
01 Jan 2015
01 Jan 2015

Efficient and Scalable Processing of String Similarity Join
Chuitian Rong ... Wei Lu
IEEE Transactions on Knowledge and Data Engineering | VOL. 25
Chuitian Rong, et. al.Chuitian Rong ... Wei Lu
01 Oct 2013
IEEE Transactions on Knowledge and Data Engineering | VOL. 25

Hash$$^{ed}$$-Join: Approximate String Similarity Join with Hashing
Peisen Yuan ... Chaofeng Sha
-
Peisen Yuan, et. al.Peisen Yuan ... Chaofeng Sha
01 Jan 2014
01 Jan 2014

Approximate String Similarity Join using Hashing Techniques under Edit Distance Constraints
Peisen Yuan ... Haoyun Wang
Journal of Software | VOL. 9
Peisen Yuan, et. al.Peisen Yuan ... Haoyun Wang
10 Jan 2014
Journal of Software | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

String similarity join with different similarity thresholds based on novel indexing techniques

Abstract

Talk to us

Similar Papers

More From: Frontiers of Computer Science