Succinct Non-overlapping Indexing

Arnab Ganguly,Rahul Shah,Sharma V. Thankachan

doi:10.1007/s00453-019-00605-5

Abstract

Text indexing is a fundamental problem in computer science. The objective is to preprocess a text T, so that, given a pattern P, we can find all starting positions (or simply, occurrences) of P in $$T$$ efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the non-overlapping indexing problem, and the range non-overlapping indexing problem. Given a text $$T$$ having n characters, the non-overlapping indexing problem is defined as follows: pre-process $$T$$ into a data structure, such that for any pattern P, containing |P| characters, we can report a set containing the maximum number of non-overlapping occurrences of P in $$T$$. Cohen and Porat (in: Algorithms and computation, 20th international symposium, ISAAC 2009, Honolulu, Hawaii. Proceedings, 2009) showed that by maintaining a linear space index in which the suffix tree of $$T$$ is augmented with an O(n) word data structure, a query P can be answered in optimal time $$O(|P|+nocc)$$, where $$nocc$$ is the number of occurrences reported. We present the following new result. Let $$\mathsf {CSA}$$ (not necessarily a compressed suffix array) be an index of $$T$$ that can compute (i) the suffix range of P in $$\mathsf {search}(P)$$ time, and (ii) a suffix array or an inverse suffix array value in $$\mathsf {t}_\mathsf {SA}$$ time. By using $$\mathsf {CSA}$$ alone, we can answer a query P in $$\mathsf {search}(P)+\mathsf {sort}(nocc)+O(nocc\cdot \mathsf {t}_\mathsf {SA})$$ time. The function $$\mathsf {sort}(k)$$ denotes the time for sorting k numbers in $$\{1,2,\dots ,n\}$$. In the range non-overlapping indexing problem, along with the pattern P, two integers a and b, $$b \ge a$$, are provided as input. The task is to report a set containing the maximum number of non-overlapping occurrences of P that lie within the range [a, b]. For any arbitrarily small positive constant $$\epsilon $$, we present an $$O(n \log ^\epsilon n)$$ word index with $$O(|P| + nocc_{a,b})$$ query time, where $$nocc_{a,b}$$ is the number of occurrences reported. Our index improves upon the result of Cohen and Porat [6].

Full Text