Abstract

The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA ’s non-competitive real-world performance. There is a super-linear algorithm DSH , which relies on the same sorting principle and is faster than DivSufSort , the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform ( eBWT ) and a bijective version of the Burrows-Wheeler transform ( BBWT ) in linear time. We call the algorithm “generic,” since it can be used to compute the regular suffix array and the variants used for the BBWT and eBWT . Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH . Our BBWT -algorithm is faster than or competitive with all other tested BBWT construction implementations on large or repetitive data, and our eBWT -algorithm is faster than all other programs on data that is not extremely repetitive.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call