Abstract

BackgroundSuffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory.ResultsIn this article we present and analyze mathsf {eGSA} [introduced in CPM (External memory generalized suffix and mathsf {LCP} arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, mathsf {eGSA} showed a competitive performance when compared to mathsf {eSAIS} and mathsf {SAscan}, which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm.ConclusionsThe proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.

Highlights

  • Suffix arrays [40] may be used for the solution of string processing problems in several areas, including pattern matching, data compression and information retrieval [24, 39, 47]

  • Kärkkäinen and Kempa [26] presented the LCPscan, an external memory algorithm to construct longest common prefix (LCP) arrays given the suffix array as input, and Bauer et al [6] proposed the extLCP algorithm to construct both Burrows–Wheeler transform (BWT) and LCP arrays for large collections of sized strings in external memory, and later, Cox et al [13] presented an extended version of extLCP to deal with strings with different sizes

  • Relative performance To assess the performance of eGSA we compared it to eSAIS [11], which is the fastest algorithm to date to compute both suffix and LCP arrays in external memory

Read more

Summary

Results

In this article we present and analyze eGSA [introduced in CPM Pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, eGSA showed a competitive performance when compared to eSAIS and SAscan, which are efficient algorithms for a single string according to the related literature. We show the effect of disk caching managed by the operating system on our algorithm

Conclusions
Introduction
Background
Limitations
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call