Abstract

A suffix array is a data structure that, together with the LCP array, allows solving many string processing problems in a very efficient fashion. In this article we introduce eGSA, the first external memory algorithm to construct both generalized suffix and LCP arrays for sets of strings. Our algorithm relies on a combination of buffers, induced sorting and a heap. Performance tests with real DNA sequence sets of size up to 8.5 GB showed that eGSA can indeed be applied to sets of large sequences with efficient running time on a low-cost machine. Compared to the algorithm that most closely resembles eGSA purpose, eSAIS, eGSA reduced the time spent to construct the arrays by a factor of 2.5−4.8.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call