Document listing on repetitive collections with guaranteed performance

Gonzalo Navarro

doi:10.1016/j.tcs.2018.11.022

Abstract

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,σ] is composed of D copies of a string of size n, and s edits are applied on ranges of copies. We introduce the first document listing index with size O˜(n+s), precisely O((nlg⁡σ+slg2⁡N)lg⁡D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc>0 strings where it appears in time O(mlg1+ϵ⁡N⋅ndoc), for any constant ϵ>0 (and tells in time O(mlg⁡N) if ndoc=0). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text T[1,N] with a grammar of size r, uses O(rlg⁡N) bits and counts the number of occurrences of a pattern P[1,m] in time O(m2+mlg2+ϵ⁡r), for any constant ϵ>0. We also give the first index using O(zlg⁡(N/z)lg⁡N) bits, where T is parsed by Lempel–Ziv into z phrases, counting occurrences in time O(mlg2+ϵ⁡N).

Highlights

Document retrieval on general string collections is an area that has recently attracted attention [24]
The space we might aim at for indexing is O(n lg σ + s lg2 N ) bits. They perform reasonably well in practice, none of the preceding structures for document listing on repetitive collections [8, 12] offer good worst-case time guarantees combined with space guarantees that are appropriate for repetitive collections, that is, growing with n+s rather than with N
1. uses O((n lg σ + s lg2 N ) lg D) bits of space, and 2. performs document listing in time O(m2 + m lg N · ndoc), for any constant

Summary

Introduction

Document retrieval on general string collections is an area that has recently attracted attention [24]. It is a natural generalization of the basic Information Retrieval tasks carried out on search engines [1, 4], many of which are useful on Far East languages, collections of genomes, code repositories, multimedia streams, etc. It enables phrase queries on natural language texts. We are interested in highly repetitive string collections [23], which are formed by a few distinct documents and a number of near-copies of those Such collections arise, for example, when sequencing the genomes of thousands of individuals of a few species, when managing versioned collections of documents like Wikipedia, and in versioned software repositories. Document Listing on Repetitive Collections an underdeveloped area: most succinct indices for string collections are based on statistical compression, and these fail to exploit repetitiveness [19]

Our contribution

Related work

Navarro

Listing the different elements in a range

Wavelet trees

Range minimum queries on arrays with runs

Grammar compression

Grammar-based indexing

Our Document Listing Index

Structure

Document listing

Analysis in a Repetitive Scenario

Conclusions

A Proof of Correctness

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Theoretical Computer Science	Publication Date: Nov 30, 2018
Citations: 12	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Document listing on repetitive collections with guaranteed performance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Theoretical Computer Science

Lead the way for us

Similar Papers

Indexing Highly Repetitive String Collections, Part II
Gonzalo Navarro
ACM Computing Surveys | VOL. 54
Gonzalo NavarroGonzalo Navarro
09 Feb 2021
ACM Computing Surveys | VOL. 54

Indexing Highly Repetitive String Collections, Part I
Gonzalo Navarro
ACM Computing Surveys | VOL. 54
Gonzalo NavarroGonzalo Navarro
05 Mar 2021
ACM Computing Surveys | VOL. 54

Document retrieval on repetitive string collections
Jouni Sirén ... Aleksi Hartikainen
Information Retrieval Journal | VOL. 20
Jouni Sirén, et. al.Jouni Sirén ... Aleksi Hartikainen
01 Apr 2017
Information Retrieval Journal | VOL. 20

On searching compressed string collections cache-obliviously
Ankur Gupta ... Jeffrey Scott Vitter
-
Ankur Gupta, et. al.Ankur Gupta ... Jeffrey Scott Vitter
09 Jun 2008
09 Jun 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Document listing on repetitive collections with guaranteed performance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Theoretical Computer Science