Time- and space-efficient maximal repeat finding using the burrows-wheeler transform and wavelet trees

M Oguzhan Kiilekci,Jeffrey Scott Vitter,Bojian Xu

doi:10.1109/bibm.2010.5706641

Abstract

Finding repetitive structures in genomes is important to understand their biological functions. Many modern genomic sequence data compressors also highly rely on finding the repeats over the sequences. The notion of maximal repeats captures all the repeats in a space-efficient way. Prior works on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19–50 times as large as the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. Our technique is based on the Burrows-Wheeler Transform and wavelet trees. For genomic sequences stored using one byte per base, the space usage of our method is less than double of the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a normal computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours.

Full Text