Abstract

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,σ] is composed of D copies of a string of size n, and s edits are applied on ranges of copies. We introduce the first document listing index with size O˜(n+s), precisely O((nlg⁡σ+slg2⁡N)lg⁡D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc>0 strings where it appears in time O(mlg1+ϵ⁡N⋅ndoc), for any constant ϵ>0 (and tells in time O(mlg⁡N) if ndoc=0). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text T[1,N] with a grammar of size r, uses O(rlg⁡N) bits and counts the number of occurrences of a pattern P[1,m] in time O(m2+mlg2+ϵ⁡r), for any constant ϵ>0. We also give the first index using O(zlg⁡(N/z)lg⁡N) bits, where T is parsed by Lempel–Ziv into z phrases, counting occurrences in time O(mlg2+ϵ⁡N).

Highlights

  • Document retrieval on general string collections is an area that has recently attracted attention [24]

  • The space we might aim at for indexing is O(n lg σ + s lg2 N ) bits. They perform reasonably well in practice, none of the preceding structures for document listing on repetitive collections [8, 12] offer good worst-case time guarantees combined with space guarantees that are appropriate for repetitive collections, that is, growing with n+s rather than with N

  • 1. uses O((n lg σ + s lg2 N ) lg D) bits of space, and 2. performs document listing in time O(m2 + m lg N · ndoc), for any constant

Read more

Summary

Introduction

Document retrieval on general string collections is an area that has recently attracted attention [24]. It is a natural generalization of the basic Information Retrieval tasks carried out on search engines [1, 4], many of which are useful on Far East languages, collections of genomes, code repositories, multimedia streams, etc. It enables phrase queries on natural language texts. We are interested in highly repetitive string collections [23], which are formed by a few distinct documents and a number of near-copies of those Such collections arise, for example, when sequencing the genomes of thousands of individuals of a few species, when managing versioned collections of documents like Wikipedia, and in versioned software repositories. Document Listing on Repetitive Collections an underdeveloped area: most succinct indices for string collections are based on statistical compression, and these fail to exploit repetitiveness [19]

Our contribution
Related work
Navarro
Listing the different elements in a range
Wavelet trees
Range minimum queries on arrays with runs
Grammar compression
Grammar-based indexing
Our Document Listing Index
Structure
Document listing
Analysis in a Repetitive Scenario
Conclusions
A Proof of Correctness
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call