Abstract

Given a query string and a collection of strings, the top-k string similarity search is to find the k most similar strings in the collection to the query string based on edit distance. Most existing works have focused on a filter-and-verify framework to prune non-candidates with some lower bounds of edit distance. The best current implementations require more than 10 seconds answering a top-40 query for a real eBay dataset. In this paper, we propose a novel light-weight algorithm to answer the top-40 eBay queries in around 350 milliseconds. Unlike existing work, the answer is approximate, but we show that more than 95% of the final results are returned.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call