Efficient similarity queries via lossy compression

Idoia Ochoa,Amir Ingber,Tsachy Weissman

doi:10.1109/allerton.2013.6736618

Abstract

The generation of new databases and the amount of data on existing ones is growing exponentially. As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can be performed efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied and characterized in past work, how to approach these limits in practice has remained largely unexplored. In this paper, we propose an approach to this task, based in part on existing lossy compression algorithms. Specifically, we consider queries of the form: “which sequences in the database are similar to a given sequence y?”. For the case where similarity between sequences is measured via Hamming distortion, we construct schemes whose performance is close to the fundamental limits. Furthermore, we test our scheme on a sample database of real DNA sequences, and show significant compression while still allowing highly reliable query answers.

Full Text