Crochemore’s Partitioning on Weighted Strings and Applications

Carl Barton,Solon P Pissis

doi:10.1007/s00453-016-0266-0

Abstract

Given a string on alphabet $$\varSigma $$ the partitioning problem is to compute classes of equivalences on the set of positions of the input string. These classes implicitly memorise identical factors of the string and, hence, their efficient computation is essential for a wide range of string processing applications. We study this problem for a weighted string: for every position of the weighted string and every letter of the alphabet a probability of occurrence of this letter at this position is given. Thus a weighted string may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. In this article, we present a non-trivial generalisation of Crochemore’s partitioning algorithm (IPL, 1981) that works on weighted strings requiring time $$\mathcal {O}(\upsilon n \log \upsilon n)$$ , where n is the length of the string, $$\upsilon = \min \{z^2,zn,\sigma ^n \}$$ , $$\sigma $$ is the size of $$\varSigma $$ , and 1 / z is a cumulative weight threshold, defined as the minimal probability of occurrence of factors in the string. Our contributions can be summarised as follows: (a) we design the first algorithm to solve the partitioning problem on weighted strings for arbitrary z and $$\sigma $$ in time $$\mathcal {O}(\upsilon n \log \upsilon n)$$ and space $$\mathcal {O}(\upsilon n)$$ improving the state of the art for $$z=\mathcal {O}(1)$$ ; (b) we improve the state of the art for numerous other string processing problems; and (c) we show further combinatorial insight into the relation between weighted and indeterminate strings, that is, sequences of alphabet subsets without associated occurrence probabilities.

Full Text