Abstract

Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α,β], return a shortest substring T[i,j] of T with exactly one occurrence in [α,β]. We present an O(nlogn)-word data structure with O(logwn) query time, where w=Ω(logn) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(nlogϵn) query time, where ϵ>0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].

Highlights

  • Finding regularities in strings is one of the main topics of combinatorial pattern matching and its applications [1]

  • All of the shortest unique substrings of string T can be computed in O(n) time using the suffix tree data structure [6,7]

  • We introduce a natural generalization of the shortest unique substring problem

Read more

Summary

Introduction

Finding regularities in strings is one of the main topics of combinatorial pattern matching and its applications [1]. All of the shortest unique substrings of string T can be computed in O(n) time using the suffix tree data structure [6,7]. Given a position i of T, return a shortest unique substring of T covering i. The task is to construct a data structure over T to be able to answer the following type of online queries efficiently. Range query data structures have been considered for strings [23,24,25,26]. In the Range–LCP problem, defined by Amir et al [23], the task is to construct a data structure over T to be able to answer the following type of online queries efficiently. The state of the art is an O(n)-word data structure supporting O(logO(1) n)-time (polylogarithmic-time) queries [25] (see [26,32])

Main Problem and Main Results
Paper Organization
Proof of Lemma 1
Final Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call