Abstract

MotivationNext-generation sequencing (NGS) technologies have become much more efficient, allowing whole human genomes to be sequenced faster and cheaper than ever before. However, processing the raw sequence reads associated with NGS technologies requires care and sophistication in order to draw compelling inferences about phenotypic consequences of variation in human genomes. It has been shown that different approaches to variant calling from NGS data can lead to different conclusions. Ensuring appropriate accuracy and quality in variant calling can come at a computational cost.ResultsWe describe our experience implementing and evaluating a group-based approach to calling variants on large numbers of whole human genomes. We explore the influence of many factors that may impact the accuracy and efficiency of group-based variant calling, including group size, the biogeographical backgrounds of the individuals who have been sequenced, and the computing environment used. We make efficient use of the Gordon supercomputer cluster at the San Diego Supercomputer Center by incorporating job-packing and parallelization considerations into our workflow while calling variants on 437 whole human genomes generated as part of large association study.ConclusionsWe ultimately find that our workflow resulted in high-quality variant calls in a computationally efficient manner. We argue that studies like ours should motivate further investigations combining hardware-oriented advances in computing systems with algorithmic developments to tackle emerging ‘big data’ problems in biomedical research brought on by the expansion of NGS technologies.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0736-4) contains supplementary material, which is available to authorized users.

Highlights

  • Recent advances in next-generation DNA sequencing (NGS) technologies have increased the efficiency, reliability, and cost-effectiveness of sequencing, leading to ever-expanding amounts of high-quality data [1]

  • We explore the influence of many factors that may impact the accuracy and efficiency of group-based variant calling, including group size, the biogeographical backgrounds of the individuals who have been sequenced, and the computing environment used

  • We argue that studies like ours should motivate further investigations combining hardware-oriented advances in computing systems with algorithmic developments to tackle emerging ‘big data’ problems in biomedical research brought on by the expansion of NGS technologies

Read more

Summary

Introduction

Recent advances in next-generation DNA sequencing (NGS) technologies have increased the efficiency, reliability, and cost-effectiveness of sequencing, leading to ever-expanding amounts of high-quality data [1]. We describe an efficient approach for obtaining high-quality variant calls and genotype assignments from a large set of whole human genomes sequenced on an Illumina HiSeq 2500 platform. Group calling leverages reads obtained from more than a single individual’s genome in order to make more confident claims about the presence of a variant allele in any single genome. This strategy can help mitigate false positive variant assignments but does have a few drawbacks, including the need to analyze individuals in a group with similar genetic backgrounds given varying allele frequencies and population-specific variants on a global scale [3]. We showcase our strategy on 437 whole human genomes sequenced to ~35× coverage and describe our implementation and results in detail

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call