Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow

Amit Kawalia,Peter Nürnberg,Holger Thiele,Susanne Motameny,Stefan Borowski,Wilfried Gunia,Ulrich Lang,Lech Nieroda,Kamel Jabbari,Stephan Wonczak,Viktor Achter,Vishal Sinha

doi:10.1371/journal.pone.0126321

Amit Kawalia, Peter Nürnberg + Show 10 more

Open Access

https://doi.org/10.1371/journal.pone.0126321

Copy DOI

Journal: PLOS ONE	Publication Date: May 5, 2015
Citations: 66	License type: CC BY 4.0

Affiliation: University of Cologne

Abstract

Next generation sequencing (NGS) has been a great success and is now a standard method of research in the life sciences. With this technology, dozens of whole genomes or hundreds of exomes can be sequenced in rather short time, producing huge amounts of data. Complex bioinformatics analyses are required to turn these data into scientific findings. In order to run these analyses fast, automated workflows implemented on high performance computers are state of the art. While providing sufficient compute power and storage to meet the NGS data challenge, high performance computing (HPC) systems require special care when utilized for high throughput processing. This is especially true if the HPC system is shared by different users. Here, stability, robustness and maintainability are as important for automated workflows as speed and throughput. To achieve all of these aims, dedicated solutions have to be developed. In this paper, we present the tricks and twists that we utilized in the implementation of our exome data processing workflow. It may serve as a guideline for other high throughput data analysis projects using a similar infrastructure. The code implementing our solutions is provided in the supporting information files.

Highlights

Generation sequencing has been a great success and is increasingly used as a research method in the life sciences
We rather present the dedicated solutions we found during its development to run it in high throughput fashion in a multiuser high performance computing (HPC) environment
We have developed a fast automated workflow for Next generation sequencing (NGS) data analysis by leveraging the power of HPC

Summary

Introduction

Generation sequencing has been a great success and is increasingly used as a research method in the life sciences. The initial preparation of the fastq input files including quality check and adapter trimming is submitted by the masterscript directly, because all other jobs depend on it After this job has completed, the alignment jobs for SNP/indel calling and SV detection are run in parallel via child processes, because they are independent from another. This way, the masterscript organizes the workflow as a series of job submissions (either directly or via child processes) and checkpoints. Once all started computations have finished, the masterscript checks whether there were any errors or failures If this is the case, the workflow is aborted with an error message, otherwise it moves on to the round of job submissions. When they indicate an error or when some of them are missing, the masterscript aborts the workflow and reports an error

Maintainability

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Perspectives of China’s HPC system development: a view from the 2009 China HPC TOP100 list
Yunquan Zhang ... Guoxing Yuan
Frontiers of Computer Science in China | VOL. 4
Yunquan Zhang, et. al.Yunquan Zhang ... Guoxing Yuan
04 Nov 2010
Perspectives of China’s HPC system development: a view from the 2009 China HPC TOP100 list
Yunquan Zhang ... Guoxing Yuan

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

Accelerating next generation sequencing data analysis with system level optimizations
Nagarajan Kathiresan ... Ramzi Temanni
Scientific Reports | VOL. 7
Nagarajan Kathiresan, et. al.Nagarajan Kathiresan ... Ramzi Temanni
22 Aug 2017
Scientific Reports | VOL. 7

Challenges and Strategies for Next Generation Sequencing (NGS) Data Analysis
Vivek Thakur
Journal of Computer Science & Systems Biology | VOL. 03
Vivek ThakurVivek Thakur
01 Jan 2009
Journal of Computer Science & Systems Biology | VOL. 03

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE