Abstract
The suffix tree is a fundamental data structure for string processing. It is widely used in many important scenarios such as text processing, information retrieval, and bioinformatics. With the rapid growth of data volume, constructing the suffix tree for large-scale datasets is very time-consuming. To solve this problem, a number of MPI-based parallel algorithms were proposed, but they have limitations in fault tolerance and scalability for large-scale datasets. Recently, there are ever-increasing application demands on efficient algorithms for constructing the suffix tree for large-scale datasets on distributed data-parallel platforms, such as Hadoop and Spark. In this paper, we present DGST, which is an efficient and scalable algorithm for generalized suffix tree construction on distributed data-parallel platforms. DGST consists of two major stages: parallel sub-tree partitioning and parallel sub-tree construction. We first design a novel data partitioning strategy for both two stages in the data-parallel paradigm. Then, we propose an efficient sub-tree partitioning algorithm based on parallel frequency counting. To improve the load balance and amortize the disk I/O costs, we propose an efficient Bin-Packing and Number-Partitioning based task allocation strategy for the sub-tree construction. At the sub-tree construction stage, we further propose a novel data structure LCP-Range and a multi-way LCP-Merge sorting algorithm for parallel LCP array construction. The experimental results on Apache Spark reveal that DGST outperforms the state-of-the-art ERa algorithm with approximately 3 times speedup on both the DNA and English text datasets. Furthermore, DGST achieves near-linear data and node scalability.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.