Abstract
The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.
Highlights
With the development of next-generation sequencing (NGS) technologies, both DNA-seq and RNA-seq data are becoming increasingly accessible
We present the scaling up results on a single compute node, followed by the results on multiple nodes in a cluster
Compared to Halvade, SparkRA is about 1.32× faster in total, 1.4× faster for Parts 2 and 3 and 1.24× faster for Part 1. These results indicate that SparkRA makes good use of the parallel capabilities of Spark and allows for easy scalability on a single node
Summary
With the development of next-generation sequencing (NGS) technologies, both DNA-seq and RNA-seq data are becoming increasingly accessible. Identifying variants from DNA-seq data attracted much attention from the research community, which resulted in the development of a number of tools and computational pipelines to address the problem. One of the most widely-used DNA-seq pipelines is GATK best practices [1], which recommends a sequence of tools to process DNA-seq data from raw reads all the way to variant calls. In order to improve the performance of DNA pipelines and get results faster, a number of solutions have been proposed: Either by scaling the pipelines on multiple compute nodes in a cluster, or by improving the performance on a single node. In terms of cluster solutions, Churchill [2] adapts
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.