New Algorithms for Text Fingerprinting

Roman Kolpakov,Mathieu Raffinot

doi:10.1007/11780441_31

Abstract

AbstractLet s = s 1 .. s n be a text (or sequence) on a finite alphabet Σ. A fingerprint in s is the set of distinct characters contained in one of its substrings. Fingerprinting a text consists of computing the set \({\mathcal{F}}\) of all fingerprints of all its substrings and being able to efficiently answer several questions on this set. A given fingerprint \(f \in {\mathcal{F}}\) is represented by a binary array, F, of size |Σ| named a fingerprint table. A fingerprint, \(f \in {\mathcal{F}}\), admits a number of maximal locations (i,j) in S, that is the alphabet of s i .. s j is f and s i − − 1, s j + 1, if defined, are not in f. The total number of maximal locations is \({\mathcal{L}} \leq n |\Sigma|+1.\) We present new algorithms and a new data structure for the three problems: (1) compute \({\mathcal{F}}\); (2) given F, answer if F represents a fingerprint in \({\mathcal{F}}\); (3) given F, find all maximal locations of F in s. These problems are respectively solved in \(O(({\mathcal{L}}+ n) \log |\Sigma|)\), Θ(|Σ|), and Θ(|Σ| + K) time – where K is the number of maximal locations of F.KeywordsMaximal LocationHash TableDistinct CharacterNaming AlgorithmEdge LabelThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text