DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Michael D Linderman,Frank A Nothaft,Forrest Wallace,Davin Chia

doi:10.1186/s12859-019-3108-7

Michael D Linderman, Frank A Nothaft + Show 2 more

Open Access

https://doi.org/10.1186/s12859-019-3108-7

Copy DOI

Abstract

BackgroundXHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results.ResultsDECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster.ConclusionsWe describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.

Highlights

XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts
XHMM [1] is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing (WES) data, but can require hours to days of computation to complete for larger cohorts
Numerous algorithms have been developed for WES CNV discovery, including the recent CLAMMS [4] algorithm, which was designed for large cohorts

Summary

Introduction

XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. XHMM [1] is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing (WES) data, but can require hours to days of computation to complete for larger cohorts. Substantial execution time and memory footprints require users to obtain correspondingly substantial computational resources and limit opportunities to explore the configuration parameter space to obtain the best possible results. Our focus was to: 1) improve the performance of this widely used tool for its many users; and 2) report on the process of implementing a complex genome analysis for on-premises and cloud-based distributed computing environments using the ADAM framework and Apache Spark

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 11, 2019
Citations: 3	License type: open-access

R Discovery Prime

R Discovery Prime

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Genome-wide Transcriptome Profiling Reveals the Functional Impact of Rare De Novo and Recurrent CNVs in Autism Spectrum Disorders
Rui Luo ... Daniel H Geschwind
The American Journal of Human Genetics | VOL. 91
Rui Luo, et. al.Rui Luo ... Daniel H Geschwind
21 Jun 2012
The American Journal of Human Genetics | VOL. 91

Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning.
Renjie Tan ... Yufeng Shen
Nucleic acids research | VOL. 50
Renjie Tan, et. al.Renjie Tan ... Yufeng Shen
16 Sep 2022
Nucleic acids research | VOL. 50

Increased frequency of de novo copy number variants in congenital heart disease by integrative analysis of single nucleotide polymorphism array and exome sequence data.
Joseph T Glessner ...
Circulation Research | VOL. 115
Joseph T Glessner, et. al.Joseph T Glessner ...
09 Sep 2014
Circulation Research | VOL. 115

CNV Detection from Exome Sequencing Data in Routine Diagnostics of Rare Genetic Disorders: Opportunities and Limitations.
Beryl Royer-Bertrand ... Andrea Superti-Furga
Genes | VOL. 12
Beryl Royer-Bertrand, et. al.Beryl Royer-Bertrand ... Andrea Superti-Furga
16 Sep 2021
Genes | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics