Abstract

Next generation sequencing (NGS) has been a great success and is now a standard method of research in the life sciences. With this technology, dozens of whole genomes or hundreds of exomes can be sequenced in rather short time, producing huge amounts of data. Complex bioinformatics analyses are required to turn these data into scientific findings. In order to run these analyses fast, automated workflows implemented on high performance computers are state of the art. While providing sufficient compute power and storage to meet the NGS data challenge, high performance computing (HPC) systems require special care when utilized for high throughput processing. This is especially true if the HPC system is shared by different users. Here, stability, robustness and maintainability are as important for automated workflows as speed and throughput. To achieve all of these aims, dedicated solutions have to be developed. In this paper, we present the tricks and twists that we utilized in the implementation of our exome data processing workflow. It may serve as a guideline for other high throughput data analysis projects using a similar infrastructure. The code implementing our solutions is provided in the supporting information files.

Highlights

  • Generation sequencing has been a great success and is increasingly used as a research method in the life sciences

  • We rather present the dedicated solutions we found during its development to run it in high throughput fashion in a multiuser high performance computing (HPC) environment

  • We have developed a fast automated workflow for Next generation sequencing (NGS) data analysis by leveraging the power of HPC

Read more

Summary

Introduction

Generation sequencing has been a great success and is increasingly used as a research method in the life sciences. The initial preparation of the fastq input files including quality check and adapter trimming is submitted by the masterscript directly, because all other jobs depend on it After this job has completed, the alignment jobs for SNP/indel calling and SV detection are run in parallel via child processes, because they are independent from another. This way, the masterscript organizes the workflow as a series of job submissions (either directly or via child processes) and checkpoints. Once all started computations have finished, the masterscript checks whether there were any errors or failures If this is the case, the workflow is aborted with an error message, otherwise it moves on to the round of job submissions. When they indicate an error or when some of them are missing, the masterscript aborts the workflow and reports an error

Maintainability
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.