Abstract

Many applications need to compute similarity between two records, where each record is a set of elements. Supporting such similarity queries is critical to applications such as document search, data cleaning, error correction, plagiarism detection, and searching on molecule databases. In practice, when we consider billions of records each of which is composed of weighted elements, we can hardly accommodate all of the weighting information for evaluating existing in-memory query methods. It is challenging to efficiently drive filter-and-verify framework for a massive number of records with thorough consideration of token weights, especially in the context of using external storage.We address the practical problem of how to search similar records efficiently using weighted tokens. We accelerate existing methods especially for the IO-intensive environments. We present a novel inverted index, and show how to probe the segmented lengths to skip dissimilar records efficiently. By introducing a certain number of suffix tokens in the record entry on each inverted list, an in-place strategy is proposed to improve the pruning power significantly. We show how many in-place tokens are sufficient to eliminate the impact of verification, in both theoretical and practical aspects. The methods are extended for supporting different metrics. Based on large-scale datasets, the experiments with HDD and SSD show that our in-place structure offers 3X∼9X better performance compared to existing indexes, and the pruning schemes improve its performance by a factor of 5X∼27X. In total, the proposed method is 2X∼20X more efficient than the length-based method at the cost of 2X space overhead.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call