Abstract

The suffix array, describing the lexicographical order of suffixes of a given text, and the suffix tree, a path-compressed trie of all suffixes, are the two most fundamental data structures for string processing, with plethora of applications in data compression, bioinformatics, and information retrieval. For a length- text, however, they use bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length- text over an alphabet of size , these structures use only bits. Nowadays, these structures are part of the standard toolbox: modern textbooks spend dozens of pages describing their applications, and they almost completely replaced suffix arrays and suffix trees in space-critical applications. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in time [Hon et al., FOCS 2003], which is factor away from the lower bound of (following from the necessity to read the input). In this paper, we make the first in 20 years improvement in for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit -time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take bits, support SA queries and full suffix tree functionality in time per operation, and can be constructed in time using bits of working space. (For example, if , the construction time is .) We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call