Abstract

The generalized suffix tree (GST) is a tree structure that is widely used by string-based applications such as DNA sequence pattern search, data compression and time series analysis. It can efficiently accelerate the string operations like matching approximate strings and finding the longest common substring. In the big data era, the applications processing large-scale strings (e.g., genomic sequences) are common, so it is important to design scalable approaches for constructing massive GSTs. In this paper, we introduce a distributed approach for constructing GSTs on top of Apache Spark, a general-purpose big data processing system. The framework of our approach is based on the Elastic Range Algorithm (ERA), a state-of-the-art GST construction algorithm. Different from the original ERA, our approach optimizes the structure of GST and the subtree construction, which greatly reduces the memory requirement of GST construction and storage. In addition, we propose serval optimization techniques to speed up our approach. Our experimental results show that our approach can index billion-symbol strings within 5 minutes on a 8-worker Spark cluster. Moreover, the optimization techniques get about 5x speedup on the overall indexing time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call