Abstract

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dp's power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.

Highlights

  • With the rapid advancement of Next-Generation Sequencing technologies, modern sequencers like Illumina HiSeq 2500 can sequence a human genome into 600 million pairs of reads of 100 bp in length in merely 27 hours

  • A simple approach to extend mismatch alignment to gapped alignment is to first identify candidate regions by exact or mismatch alignment of short substrings in the reads, use dynamic programming to perform a detailed alignment of the read to the regions

  • SOAP3-dp has been successfully deployed on Amazon EC2, NIH BioWulf and Tianhe-1A computing-cloud

Read more

Summary

Introduction

With the rapid advancement of Next-Generation Sequencing technologies, modern sequencers like Illumina HiSeq 2500 can sequence a human genome into 600 million pairs of reads of 100 bp in length (total 120 Gigabases) in merely 27 hours. By 2013 year’s end, sequencing a human genome is projected to cost less than $1,000. Bioinformatics research using sequencing data often starts with aligning the data onto a reference genome, followed by various downstream analyses. Alignment is computationally intensive; the 1000 genomes pilot paper [1] published in 2010 reported that a 1192-processor cluster was used to align the reads using MAQ [2]. This kind of computing resources is not available to most laboratories, let alone clinical settings. Ultra-fast alignment tools without relying on extensive computing resources are needed

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.