Abstract
The suffix tree is a fundamental data structure in stringology, but its space usage, though linear, is an important problem in applications like Bioinformatics. We design and implement a new compressed suffix tree (CST) targeted to highly repetitive texts, such as large genomic collections of the same species. Our first contribution is to enhance the Block Tree, a data structure that captures the repetitiveness of its input sequence, to represent the topology of trees with large repeated subtrees. Our so-called Block-Tree Compressed Topology (BT-CT) data structure augments the Block Tree nodes with data that speeds up tree navigation. Our Block-Tree CST (BT-CST), in turn, uses the BT-CT to compress the topology of the suffix tree, and also replaces the sampling of the suffix array and its inverse with grammar- and/or Block-Tree-based representations of those arrays.Our experimental results show that BT-CTs reach navigation speeds comparable to compact tree representations that are insensitive to repetitiveness, while using 2–10 times less space on the topologies of the suffix trees of repetitive collections. Our BT-CST is slightly larger than previous repetition-aware suffix trees based on grammar-compressed topologies, but outperforms them in time, often by orders of magnitude.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.