SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark.

Zaid Al-Ars,Hamid Mushtaq,Saiyi Wang

doi:10.3390/genes11010053

Abstract

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

Highlights

With the development of next-generation sequencing (NGS) technologies, both DNA-seq and RNA-seq data are becoming increasingly accessible
We present the scaling up results on a single compute node, followed by the results on multiple nodes in a cluster
Compared to Halvade, SparkRA is about 1.32× faster in total, 1.4× faster for Parts 2 and 3 and 1.24× faster for Part 1. These results indicate that SparkRA makes good use of the parallel capabilities of Spark and allows for easy scalability on a single node

Summary

Introduction

With the development of next-generation sequencing (NGS) technologies, both DNA-seq and RNA-seq data are becoming increasingly accessible. Identifying variants from DNA-seq data attracted much attention from the research community, which resulted in the development of a number of tools and computational pipelines to address the problem. One of the most widely-used DNA-seq pipelines is GATK best practices [1], which recommends a sequence of tools to process DNA-seq data from raw reads all the way to variant calls. In order to improve the performance of DNA pipelines and get results faster, a number of solutions have been proposed: Either by scaling the pipelines on multiple compute nodes in a cluster, or by improving the performance on a single node. In terms of cluster solutions, Churchill [2] adapts

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genes	Publication Date: Jan 3, 2020
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes

Lead the way for us

Similar Papers

Halvade somatic: Somatic variant calling with Apache Spark.
Dries Decap ... Pascal Costanza
GigaScience | VOL. 11
Dries Decap, et. al.Dries Decap ... Pascal Costanza
12 Jan 2022
GigaScience | VOL. 11

Lineage Chain Mark Fault-Tolerant Method for Micro-Batching Monitoring Data in Distribution Power Network
Zhijian Qu ... Hanlin Wang
IEEE Access | VOL. 7
Zhijian Qu, et. al.Zhijian Qu ... Hanlin Wang
01 Jan 2019
IEEE Access | VOL. 7

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
Michael D Linderman ... Davin Chia
BMC Bioinformatics | VOL. 20
Michael D Linderman, et. al.Michael D Linderman ... Davin Chia
11 Oct 2019
BMC Bioinformatics | VOL. 20

Abstract 211: The performance characteristic of the low input tagmentation-based whole genome sequencing in high quality somatic variant calling
Liqun Jiang ... Elizabeth A Rice
Cancer Research | VOL. 83
Liqun Jiang, et. al.Liqun Jiang ... Elizabeth A Rice
04 Apr 2023
Cancer Research | VOL. 83

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes