Abstract

Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.

Highlights

  • The suffix tree [1] is one of the most powerful bioinformatic tools to answer complex queries on DNA and protein sequences [2,3,4]

  • The index is based on approximating the longest common subsequence (LCS) of BWTR and BWTS, where R is the reference sequence and S is the target sequence, and storing several structures based on the common subsequence

  • We find a binary sequence BR [1, ∣R∣], which marks the common subsequence in R, and a strictly increasing integer sequence Y, which contains the positions of the common subsequence in S

Read more

Summary

INTRODUCTION

The suffix tree [1] is one of the most powerful bioinformatic tools to answer complex queries on DNA and protein sequences [2,3,4]. Rather than compressing the text directly, the current CSTs for repetitive collections [13, 15] apply grammar-based compression on the data structures that simulate the suffix tree. Structures for direct access [31, 32] and even for pattern matching [33] have been developed on top of RLZ Another approach to compressing a repetitive collection while supporting interesting queries is to build an automaton that accepts the sequences in the collection, and index the state diagram as an directed acyclic graph (DAG); see, for example, [34,35,36] for recent discussions. This dependency makes these structures vulnerable to even a small change in even one sequence to an otherwise-conserved region, which could hamper their scalability

One general CST or many individual CST s
Our contribution
BACKGROUND
Full-text indexes
Compressed text indexes
Relative Lempel–Ziv
RELATIVE FMI
Basic index
Relative select
Full functionality
Finding a bwt-invariant subsequence
RELATIVE SUFFIX TREE
Relative LCP array
EXPERIMENTS
Indexes and their sizes
Query times
Synthetic collections
Suffix tree operations
Findings
DISCUSSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.