Abstract

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.

Highlights

  • With the advance in generation sequencing (NGS) technologies, whole exome sequencing (WES) and whole genome sequencing (WGS) have become compelling tools for clinical diagnosis and genetic risk prediction

  • We have developed BALSA, a lightweight total solution for next generation sequencing (NGS) secondary analysis that takes full advantage of the computational power available on a computing node equipped with a multi-core CPU and a Graphics Processing Unit (GPU) device

  • From raw reads to variants, BALSA finished in 5.49 h, whereas ISAAC finished in 11.92 h, and GATK coupled with BWAaln, BWAmem and SOAP3-dp in 88.00, 48.68 and 46.27 h, respectively

Read more

Summary

Introduction

With the advance in generation sequencing (NGS) technologies, whole exome sequencing (WES) and whole genome sequencing (WGS) have become compelling tools for clinical diagnosis and genetic risk prediction. For a typical WGS sample, SOAP3-dp can shorten the alignment time to 2–4 h, yet the follow-up analyses, which include base-score recalibration, de-duplication, realignment and variant calling procedures, still require tens of hours using a single computing node.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call