VC@Scale: Scalable and high-performance variant calling on cluster environments.

Tanveer Ahmad,Zaid Al Ars,H Peter Hofstee

doi:10.1093/gigascience/giab057

Tanveer Ahmad, Zaid Al Ars + Show 1 more

Open Access

https://doi.org/10.1093/gigascience/giab057

Copy DOI

Abstract

BackgroundRecently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations.ResultsHere we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.ConclusionsWe show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

Highlights

Immense improvements in next-generation sequencing (NGS) technologies enable large amounts of high-throughput and costeffective raw genome datasets to be produced
We compare our results with other state-of-the-art frameworks for both pre-processing and variant-calling stages, followed by a detailed analysis and comparison of scalability, performance, and speed-ups with these frameworks
We performed a comparison on DeepVariant and Octopus on Chr20-HG003 Illumina whole-genome sequencing (WGS) reads publicly available from the PrecisionFDA Truth v2 Challenge, and we found that the accuracy of Octopus was almost identical to that of DeepVariant for both singlenucleotide polymorphism (SNP) and indel variants

Summary

Introduction

Immense improvements in next-generation sequencing (NGS) technologies enable large amounts of high-throughput and costeffective raw genome datasets to be produced. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Pre-processing of NGS data requires a number of steps: (i) alignment of raw FASTQ data against a reference genome, (ii) chromosome-based coordinate sorting, and (iii) PCR duplicate removal (optional, only required if data are not PCR-free or in some datasets for better accuracy). These steps are common to almost every variant-calling workflow. SAMtools [13], Picard [14], Sambamba [15], and samblaster [16] are some of the most famous and widely used tools for the purpose of indexing, sorting, and duplicate removal in SAM/BAM/CRAM files

Methods

Results

Discussion

Conclusion