New Perspectives on the Prefix Array

W F Smyth,Shu Wang

doi:10.1007/978-3-540-89097-3_14

Abstract

In this paper we consider the of a string in which and, for i > 1, iff k is the largest integer such that . The prefix array is closely related to the : an integer array [1..n] such that iff the length of the longest border of is k. Border arrays or their variants are used in many string algorithms and prefix arrays can be used directly for pattern-matching. It is well known that for regular strings provides all the information that does; we show however that for indeterminate strings (those containing entries that match a subset of the alphabet) actually provides more information, in fact still enabling all the borders of every prefix of to be specified. Since a lot of the entries of are expected to be zeros, it is natural to represent in compressed form using integer arrays and , where m is the number of nonzero entries in and iff the \(j^{\mbox{th}}\) nonzero entry in occurs in position and takes the value . The expected value of m is n/σ− 1, where σ is the alphabet size. The straightforward way of computing POS/LEN requires computing first, therefore requires O(n) extra space. We describe two Θ(n)-time algorithms PL1 & PL2 to compute POS/LEN for regular strings using only 8m bytes of storage in addition to the n bytes required for . PL1 requires about one-third the time of the standard border array algorithm MP on English-language strings; PL2 executes faster than MP on both English-language and highly periodic strings on {a,b}. For indeterminate strings, we describe an extension IPL of PL1 that computes POS/LEN in O(n 2) worst-case time (though generally much faster), still using only 8m bytes of additional storage. For both regular and indeterminate strings, the compressed form of can be used for efficient pattern-matching.

Full Text