Abstract

Correct and bias-free interpretation of the deep sequencing data is inevitably dependent on the complete mapping of all mappable reads to the reference sequence, especially for quantitative RNA-seq applications. Seed-based algorithms are generally slow but robust, while Burrows-Wheeler Transform (BWT) based algorithms are fast but less robust. To have both advantages, we developed an algorithm FANSe2 with iterative mapping strategy based on the statistics of real-world sequencing error distribution to substantially accelerate the mapping without compromising the accuracy. Its sensitivity and accuracy are higher than the BWT-based algorithms in the tests using both prokaryotic and eukaryotic sequencing datasets. The gene identification results of FANSe2 is experimentally validated, while the previous algorithms have false positives and false negatives. FANSe2 showed remarkably better consistency to the microarray than most other algorithms in terms of gene expression quantifications. We implemented a scalable and almost maintenance-free parallelization method that can utilize the computational power of multiple office computers, a novel feature not present in any other mainstream algorithm. With three normal office computers, we demonstrated that FANSe2 mapped an RNA-seq dataset generated from an entire Illunima HiSeq 2000 flowcell (8 lanes, 608 M reads) to masked human genome within 4.1 hours with higher sensitivity than Bowtie/Bowtie2. FANSe2 thus provides robust accuracy, full indel sensitivity, fast speed, versatile compatibility and economical computational utilization, making it a useful and practical tool for deep sequencing applications. FANSe2 is freely available at http://bioinformatics.jnu.edu.cn/software/fanse2/.

Highlights

  • Mapping millions of next-generation sequencing (NGS) reads accurately to reference sequences is the basis of all deep sequencing applications that utilize reference genomes or transcriptomes, including variant analysis, gene expression and isoform analysis

  • Longer seeds decrease the number of exact matches exponentially and largely accelerate the mapping: 14-nt seed decreases the number of exact matches 414–8 = 4096 folds than 8-nt seeds

  • Novoalign was unable to finish the task in 4 days (Figure 3B). These results showed that FANSe2, as a seedbased algorithm, is approaching the speed of Burrows-Wheeler Trasnformation (BWT)-based algorithms while maintaining similar or higher sensitivity when handling huge datasets

Read more

Summary

Introduction

Mapping (aligning) millions of next-generation sequencing (NGS) reads accurately to reference sequences is the basis of all deep sequencing applications that utilize reference genomes or transcriptomes, including variant analysis, gene expression and isoform analysis. Accurately mapping to large genomes is still time-consuming [5,6] Another type of algorithms based on Burrows-Wheeler Trasnformation (BWT), e.g. Bowtie and BWA, takes the advantage of the suffix/ prefix trie and reduces the computational complexity, being typically 5,20x faster than seed-based algorithms (reviewed in [2,7]). Such methods can map tens of millions of reads to human genome within one day on desktop workstations, promoting the blowout of NGS applications. In real-world benchmarks, the sensitivity of earlier BWT-based algorithms like Bowtie and SOAP2 (,80%) is still to be improved when mapping DNA sequencing reads, the sensitivity of the upgraded Bowtie is almost the same as the traditional seed-based algorithms while being more than 20x faster [6]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.