Scalable Workflows and Reproducible Data Analysis for Genomics.

Francesco Strozzi,Joep De Ligt,Pjotr Prins,Ricardo Wurmus,George Githinji,Steffen Möller,Dominique Belhachemi,Roel Janssen,Geert Smant,Michael R Crusoe,Paolo Di Tommaso

doi:10.1007/978-1-4939-9074-0_24

Abstract

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.

Highlights

In this chapter, we show how to create a bioinformatics pipeline using four workflow systems: Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow
We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity
We show how to create a bioinformatics pipeline using four workflow systems: CWL, GWL, Snakemake, and Nextflow

Summary

Overview

We show how to create a bioinformatics pipeline using four workflow systems: CWL, GWL, Snakemake, and Nextflow. In the case of evolutionary genomics, lengthy computations are often multidimensional. Examples of such expensive calculations are Bayesian analyses, inference based on hidden Markov models, and maximum likelihood analysis, implemented, for example, by MrBayes [1], HMMER [2], and phylogenetic analysis by maximum likelihood (PAML) [3]. One example of legacy software requiring lengthy computation is Ziheng Yang’s CodeML implementation of PAML [3]. To test hundreds of alignments, e.g., different gene families, PAML is invoked hundreds of times in a serial fashion, possibly taking days on a single computer. We use PAML as an example, but the idea holds for any software program that is CPU bound, i.e., the CPU speed determines program execution time. Many legacy programs are CPU bound and do not scale by themselves

Parallelization in the Cloud

Parallelization of Applications Using a

GPU Programming

Package Software in a Container

Create a Docker Image with Debian

GNU Guix

Create a Docker Image with Bioconda

A Note on Software Licenses

Example Workflow

Common Workflow Language

Guix Workflow Language

Snakemake

Discussion

Questions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Methods in molecular biology (Clifton, N.J.)	Publication Date: Jan 1, 2019
Citations: 25	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Scalable Workflows and Reproducible Data Analysis for Genomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Methods in molecular biology (Clifton, N.J.)

Lead the way for us

Similar Papers

Author response: Simplifying the development of portable, scalable, and reproducible workflows
Stephen R Piccolo ... Andrea H Bild
-
Stephen R Piccolo, et. al.Stephen R Piccolo ... Andrea H Bild
20 Sep 2021
20 Sep 2021

Simplifying the development of portable, scalable, and reproducible workflows.
Stephen R Piccolo ... Jeffrey T Chang
eLife | VOL. 10
Stephen R Piccolo, et. al.Stephen R Piccolo ... Jeffrey T Chang
13 Oct 2021
eLife | VOL. 10

Abstract 5351: Retrospective analysis of cancer exomes with Roslin, a portable and reproducible workflow infrastructure
Shweta Chavan ... Christopher Harris
Cancer Research | VOL. 78
Shweta Chavan, et. al.Shweta Chavan ... Christopher Harris
01 Jul 2018
Cancer Research | VOL. 78

Common Workflow Language, v1.0
...
-
, et. al. ...
26 Jun 2017
26 Jun 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable Workflows and Reproducible Data Analysis for Genomics.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Methods in molecular biology (Clifton, N.J.)