Abstract
Let s = s 1 . . s n be a text (or sequence) on a finite alphabet Σ of size σ . A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s i . . s j is a maximal location for a fingerprint f ∈ F (denoted by 〈 i , j 〉 ) if the alphabet of s i . . s j is f and s i − 1 , s j + 1 , if defined, are not in f . The set of maximal locations in s is L (it is easy to see that | L | ⩽ n σ ). Two maximal locations 〈 i , j 〉 and 〈 k , l 〉 such that s i . . s j = s k . . s l are named copies , and the quotient set of L according to the copy relation is denoted by L C . We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F ; (2) given f as a set of distinct characters in Σ , to answer if f represents a fingerprint in F ; (3) given f , to find all maximal locations of f in s . As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O ( n + | L C | log σ ) worst-case time (in this paper all logarithms are intended as base two logarithms) using O ( ( n + | L C | + | F | log σ ) log n ) bits of space, or in O ( n + | L | log σ ) randomized expected time using O ( ( n + | F | log σ ) log n ) bits of space. Problem 2 is solved either in O ( | f | ) expected time if only O ( | f | log n ) bits of working space for queries is allowed, or in worst-case O ( | f | / ϵ ) time if a working space of O ( σ ϵ log n ) bits is allowed (with ϵ a constant satisfying 0 < ϵ < 1 ). These algorithms use a data structure that occupies | F | ( 2 log σ + log 2 e ) ( 1 + o ( 1 ) ) bits. Problem 3 is solved with the same time complexity as Problem 2, but with the addition of an occ term to each of the complexities, where occ is the number of maximal locations corresponding to the given fingerprint. Our solution of this last problem requires a data structure that occupies O ( ( n + | L C | ) log n ) bits of memory. In the second part of our paper we present a novel Monte Carlo approximate construction approach. Problem 1 is thus solved in O ( n + | L | ) expected time using O ( | F | log n ) bits of space but the algorithm is incorrect with an extremely small probability that can be bounded in advance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.