Abstract

String similarity join, as an essential operation in applications including data integration and data cleaning, has attracted significant attention in the research community. Previous studies focus on global similarity join. In this paper, we study local similarity join with edit distance constraints, which finds string pairs from two string collections that have similar substrings. We study two kinds of local similarity join problems: checking local similar pairs and locating local similar pairs. We first consider the case where if two strings are locally similar to each other, they must share a common gram of a certain length. We show how to do efficient local similarity verification based on a matching gram pair. We propose two pruning techniques and an incremental method to further improve the efficiency of finding matching gram pairs. Then, we devise a method to locate the longest similar substring pair for two local similar strings. We conducted a comprehensive experimental study to evaluate the efficiency of these techniques.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call