Suffix rank

Marina Barsky,Jonathan Gabor,Alex Thomo,Mariano P Consens

doi:10.14778/3407790.3407861

Abstract

We investigate the problem of building a suffix array substring index for inputs significantly larger than main memory. This problem is especially important in the context of biological sequence analysis, where biological polymers can be thought of as very large contiguous strings. The objective is to index every substring of these long strings to facilitate efficient queries. We propose a new simple, scalable, and inherently parallelizable algorithm for building a suffix array for out-of-core strings. Our new algorithm, Suffix Rank , scales to arbitrarily large inputs, using disk as a memory extension. It solves the problem in just O (log n ) scans over the disk-resident data. We evaluate the practical performance of our new algorithm, and show that for inputs significantly larger than the available amount of RAM, it scales better than other state-of-the-art solutions, such as eSAIS, SAscan , and eGSA.

Full Text