Abstract
The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based methods. However, since the amount of read data is generally much larger than the size of the reference sequence, computation of the BWT of reads is not easy, and this hampers development of potential applications. For the alleviation of this problem, a new method of computing the BWT of reads in parallel is proposed. The BWT, corresponding to a sorted list of suffixes of reads, is constructed incrementally by successively including longer and longer suffixes. The working data is divided into more than 10,000 "blocks" corresponding to sublists of suffixes with the same prefixes. Thousands of groups of blocks can be processed in parallel while making exclusive writes and concurrent reads into a shared memory. Reads and writes are basically sequential, and the read concurrency is limited to two. Thus, a fine-grained parallelism, referred to as prefix parallelism, is expected to work efficiently. The time complexity for processing n reads of length l is O(nl2). On actual biological DNA sequence data of about 100 Gbp with a read length of 100 bp (base pairs), a tentative implementation of the proposed method took less than an hour on a single-node computer; i.e., it was about three times faster than one of the fastest programs developed so far.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have