A Fast Algorithm for Constructing Suffix Arrays for Fixed-Size Alphabets

Dong K Kim,Junha Jo,Heejin Park

doi:10.1007/978-3-540-24838-5_23

Abstract

The suffix array of a string T is basically a sorted list of all the suffixes of T. Suffix arrays have been fundamental index data structures in computational biology. If we are to search a DNA sequence in a genome sequence, we construct the suffix array for the genome sequence and then search the DNA sequence in the suffix array. In this paper, we consider the construction of the suffix array of T of length n where the size of the alphabet is fixed. It has been well-known that one can construct the suffix array of T in O(n) time by constructing suffix tree of T and traversing the suffix tree. Although this approach takes O(n) time, it is not appropriate for practical use because it uses a lot of spaces and it is complicated to implement. Recently, almost at the same time, several algorithms have been developed to directly construct suffix arrays in O(n) time. However, these algorithms are developed for integer alphabets and thus do not exploit the properties given when the size of the alphabet is fixed. We present a fast algorithm for constructing suffix arrays for the fixed-size alphabet. Our algorithm constructs suffix arrays faster than any other algorithms developed for integer or general alphabets when the size of the alphabet is fixed. For example, we reduced the time required for constructing suffix arrays for DNA sequences by 25%-38%. In addition, we do not sacrifice the space to improve the running time. The space required by our algorithm is almost equal to or even less than those required by previous fast algorithms.

Full Text