Document retrieval on repetitive string collections

Travis Gagie,Jouni Sirén,Simon J Puglisi,Juha Kärkkäinen,Gonzalo Navarro,Kalle Karhu,Aleksi Hartikainen

doi:10.1007/s10791-017-9297-7

Abstract

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple {textsf{tf}}{textsf{-}}{textsf{idf}} model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

Highlights

Document retrieval on natural language text collections is a routine activity in web and enterprise search engines
We develop two novel ideas, interleaved longest common prefix (LCP) and precomputed document lists, that yield highly compressed indexes solving the problem of document listing, top-k document retrieval, and document counting
We show that this is correct as long as range-minimum queries (RMQs) returns the leftmost minimum in the range and that we recurse first to the left and to the right of each minimum VILCP1⁄2i0 found

Summary

Introduction

Document retrieval on natural language text collections is a routine activity in web and enterprise search engines. The inverted index has well-known limitations, : the text must be easy to parse into terms or words, and queries must be sets of words or sequences of words (phrases) Those limitations are acceptable in most cases when natural language text collections are indexed, and they enable the use of an extremely simple index organization that is efficient and scalable, and that has been the key to the success of Web-scale information retrieval. Those limitations, on the other hand, hamper the use of the inverted index in other kinds of string collections where partitioning the text into words and limiting queries to word sequences is inconvenient, difficult, or meaningless: DNA and protein sequences, source code, music streams, and even some East Asian languages. Document retrieval queries are of interest in those string collections, but the state of the art about alternatives to the inverted index is much less developed (Hon et al 2013; Navarro 2014)

Results

Discussion

Conclusion