MinIL: A Simple and Small Index for String Similarity Search with Edit Distance

Zhong Yang,Guohui Li,Xianzhi Wang,Xiaofang Zhou,Bolong Zheng

doi:10.1109/icde53745.2022.00047

Abstract

The string similarity search is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. We study the problem of threshold similarity search with the edit distance, where given a set of strings, a threshold <tex>$k$</tex>, and a query string <tex>$q$</tex>, we aim to find all strings in the set whose edit distances to <tex>$q$</tex> are no larger than <tex>$k$</tex>. Extensive studies have been proposed for the threshold similarity search problem with the edit distance. However, they suffer from a huge space consumption issue when achieving only an acceptable efficiency, especially for long strings. In this paper, we propose a simple yet small index, called minIL, to eliminate this issue. First, we adopt a minhash family to capture pivot characters and to construct sketch representations for strings. Second, we develop a multi-level inverted index to search sketches with a low space consumption. Finally, we apply a novel learned index technique on top of the index that further improves the query efficiency. Extensive experiments on real-world datasets offer insight into the performance of our method and show that it substantially reduces the index size, and is capable of outperforming the baseline approaches.

Full Text