Property Suffix Array with Applications in Indexing Weighted Sequences

Panagiotis Charalampopoulos,Costas S Iliopoulos,Chang Liu,Solon P Pissis

doi:10.1145/3385898

Abstract

The suffix array is one of the most prevalent data structures for string indexing; it stores the lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the suffix tree is space efficiency. In Property Indexing , we are given a string x of length n and a property Π, i.e., a set of Π- valid intervals over x so that a pattern p occurs in x if and only if x has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the valid prefixes of suffixes of x can be built in time and space O ( n ). We show here how to directly build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O ( n ). We mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability distributions over a given alphabet. Given a probability threshold 1/ z , we say that a string p of length m matches a weighted sequence x of length n at starting position i if the product of probabilities of the letters of p at positions i , …, i + m − 1 in x is at least 1/ z . Our algorithm for building the PSA can be directly applied to build an O ( nz )-sized suffix-array-like index over x in time and space O ( nz ). Finally, we present extensive experimental results showing that our new indexing data structure is well suited for real-world applications.

Full Text