Abstract
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.
Highlights
The suffix tree [1] is one of the most powerful bioinformatic tools to answer complex queries on DNA and protein sequences [2,3,4]
The index is based on approximating the longest common subsequence (LCS) of BWTR and BWTS, where R is the reference sequence and S is the target sequence, and storing several structures based on the common subsequence
We find a binary sequence BR [1, ∣R∣], which marks the common subsequence in R, and a strictly increasing integer sequence Y, which contains the positions of the common subsequence in S
Summary
The suffix tree [1] is one of the most powerful bioinformatic tools to answer complex queries on DNA and protein sequences [2,3,4]. Rather than compressing the text directly, the current CSTs for repetitive collections [13, 15] apply grammar-based compression on the data structures that simulate the suffix tree. Structures for direct access [31, 32] and even for pattern matching [33] have been developed on top of RLZ Another approach to compressing a repetitive collection while supporting interesting queries is to build an automaton that accepts the sequences in the collection, and index the state diagram as an directed acyclic graph (DAG); see, for example, [34,35,36] for recent discussions. This dependency makes these structures vulnerable to even a small change in even one sequence to an otherwise-conserved region, which could hamper their scalability
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.