Various improvements to text fingerprinting

Djamal Belazzougui,Mathieu Raffinot,Roman Kolpakov

doi:10.1016/j.jda.2013.06.004

Abstract

Let s = s 1 . . s n be a text (or sequence) on a finite alphabet Σ of size σ . A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set F of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s i . . s j is a maximal location for a fingerprint f ∈ F (denoted by 〈 i , j 〉 ) if the alphabet of s i . . s j is f and s i − 1 , s j + 1 , if defined, are not in f . The set of maximal locations in s is L (it is easy to see that | L | ⩽ n σ ). Two maximal locations 〈 i , j 〉 and 〈 k , l 〉 such that s i . . s j = s k . . s l are named copies , and the quotient set of L according to the copy relation is denoted by L C . We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute F ; (2) given f as a set of distinct characters in Σ , to answer if f represents a fingerprint in F ; (3) given f , to find all maximal locations of f in s . As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in O ( n + | L C | log σ ) worst-case time (in this paper all logarithms are intended as base two logarithms) using O ( ( n + | L C | + | F | log σ ) log n ) bits of space, or in O ( n + | L | log σ ) randomized expected time using O ( ( n + | F | log σ ) log n ) bits of space. Problem 2 is solved either in O ( | f | ) expected time if only O ( | f | log n ) bits of working space for queries is allowed, or in worst-case O ( | f | / ϵ ) time if a working space of O ( σ ϵ log n ) bits is allowed (with ϵ a constant satisfying 0 < ϵ < 1 ). These algorithms use a data structure that occupies | F | ( 2 log σ + log 2 e ) ( 1 + o ( 1 ) ) bits. Problem 3 is solved with the same time complexity as Problem 2, but with the addition of an occ term to each of the complexities, where occ is the number of maximal locations corresponding to the given fingerprint. Our solution of this last problem requires a data structure that occupies O ( ( n + | L C | ) log n ) bits of memory. In the second part of our paper we present a novel Monte Carlo approximate construction approach. Problem 1 is thus solved in O ( n + | L | ) expected time using O ( | F | log n ) bits of space but the algorithm is incorrect with an extremely small probability that can be bounded in advance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Discrete Algorithms	Publication Date: Jun 25, 2013
Citations: 2	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Various improvements to text fingerprinting

Abstract

Talk to us

Similar Papers

More From: Journal of Discrete Algorithms

Lead the way for us

Similar Papers

Data Representation in Big data via succinct data structures
Vinesh Kumar ... Sunil Kumar
GBAMS- Vidushi | VOL. 9
Vinesh Kumar, et. al.Vinesh Kumar ... Sunil Kumar
30 Dec 2017
GBAMS- Vidushi | VOL. 9

A Faster Query Algorithm for the Text Fingerprinting Problem
Chi-Yuan Chan ... Biing-Feng Wang
-
Chi-Yuan Chan, et. al.Chi-Yuan Chan ... Biing-Feng Wang
08 Oct 2007
08 Oct 2007

Faster Text Fingerprinting
Roman Kolpakov ... Mathieu Raffinot
-
Roman Kolpakov, et. al.Roman Kolpakov ... Mathieu Raffinot
01 Jan 2008
01 Jan 2008

New Algorithms for Text Fingerprinting
Roman Kolpakov ... Mathieu Raffinot
-
Roman Kolpakov, et. al.Roman Kolpakov ... Mathieu Raffinot
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Various improvements to text fingerprinting

Abstract

Talk to us

Similar Papers

More From: Journal of Discrete Algorithms