Abstract

Finding repetitive structures in genomes is important to understand their biological functions. Many modern genomic sequence data compressors also highly rely on finding the repeats over the sequences. The notion of maximal repeats captures all the repeats in a space-efficient way. Prior works on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19–50 times as large as the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. Our technique is based on the Burrows-Wheeler Transform and wavelet trees. For genomic sequences stored using one byte per base, the space usage of our method is less than double of the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a normal computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.