Abstract

Read mapping as the foundation of computational biology is a bottleneck task under the pressure of sequencing throughput explodes. In this work, we present an efficient Burrows–Wheeler transform-based aligner for next-generation sequencing (NGS) short read. Firstly, we propose a difference-aware classification strategy to assign specific reads to the computationally more economical search modes, and present some acceleration techniques, such as a seed pruning method based on the property of maximum coverage interval to reduce the redundant locating for candidate regions, redesigning LF calculation to support fast query. Then, we propose a heuristic verification to determine the best mapping from amounts of flanking sequences. Incorporated with low-distortion string embedding, most dissimilar sequences are filtered out cheaply, and the highly similar sequences left are just right for the wavefront alignment algorithm’s preference. We provide a full spectrum benchmark with different read lengths, the results show that our method is 1.3–1.4 times faster than state-of-the-art Burrows–Wheeler transform-based methods (including bowtie2, bwa-MEM, and hisat2) over 101bp reads and has a speedup with 1.5–13 times faster over 750bp to 1000bp reads; meanwhile, our method has comparable memory usage and accuracy. However, hash-based methods (including Strobealign, Minimap2, and Accel-Align) are significantly faster, in part because Burrows–Wheeler transform-based methods calculate on the compressed space. The source code is available: https://github.com/Lilu-guo/Effaln.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call