Trie-join

Jiannan Wang,Guoliang Li,Jianhua Feng

doi:10.14778/1920841.1920992

Abstract

A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantages: (1) They are inefficient for the data sets with short strings (the average string length is no larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel framework called trie-join , which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on three real data sets with short strings.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Trie-join

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Sep 1, 2010
Citations: 123

Similar Papers

Trie-join: a trie-based method for efficient string similarity joins
Jianhua Feng ... Guoliang Li
The VLDB Journal | VOL. 21
Jianhua Feng, et. al.Jianhua Feng ... Guoliang Li
04 Oct 2011
The VLDB Journal | VOL. 21

A partition-based method for string similarity joins with edit-distance constraints
Guoliang Li ... Jianhua Feng
ACM Transactions on Database Systems | VOL. 38
Guoliang Li, et. al.Guoliang Li ... Jianhua Feng
01 Jun 2013
ACM Transactions on Database Systems | VOL. 38

Efficient String Edit Similarity Join Algorithm
Karam Gouda ... Metwally Rashad
Computing and Informatics | VOL. 36
Karam Gouda, et. al.Karam Gouda ... Metwally Rashad
01 Jan 2017
Computing and Informatics | VOL. 36

Pass-join
Guoliang Li ... Dong Deng
Proceedings of the VLDB Endowment | VOL. 5
Guoliang Li, et. al.Guoliang Li ... Dong Deng
01 Nov 2011
Proceedings of the VLDB Endowment | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Trie-join

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment