Abstract
Motivation: Bioinformatic pipelines often use large numbers of components and deploying them incurs substantial configuration and maintenance burden that remains a significant barrier to reproducible research. Our aim is to define a new paradigm and best practices for developing, distributing and running pipelines encapsulated in Docker containers (lightweight virtualization), with a focus on next generation sequencing (NGS) workflows. This approach provides several advantages, namely: efficiency, portability, versioning and reproducibility. Using the NGSeasy pipeline, a user can quickly deploy any pipeline version in any environment (e.g. operating systems, workstations, clusters, clouds). While this might also be achieved with a virtual machine (VM); VMs lack portability, have substantial overhead (disk, CPU, RAM), and require allocated resources to be provisioned statically – Docker, to a large extent, solves these issues.Results: We demonstrate best practices for packaging and execution of a multicomponent pipeline for NGS using a set of container building blocks which are versioned, modular and reusable. We present a basic ”proof of concept” evaluation of a next generation sequencing pipeline in Docker containers, capable of producing meaningful results, that are comparable with public and ”best practice” workflows, with little to no impact on standard computing performance.Availability: Both versioned Dockerfiles and container images for each component are published on GitHub and Docker Hub, respectively. The pipeline and containers can be pulled from Docker Hub and executed on any environment capable of running the Docker platform with minimum hardware requirements for running an NGS pipeline.
Highlights
Bioinformatic pipelines are frequently composed of large numbers of loosely coupled pieces of software, each tool requiring substantial configuration, maintenance and management of dependencies
Docker containers are a set of processes running in a multi-tenanted Linux host kernel, so are very lightweight as there is no underlying machine to emulate
Overview of the NGSeasy pipeline A typical next generation sequencing (NGS) pipeline for variant calling and discovery involves the following steps, all of which are implemented in the current version of NGSeasy (1.0-r001): 1. Pre-alignment quality control
Summary
Bioinformatic pipelines are frequently composed of large numbers of loosely coupled pieces of software, each tool requiring substantial configuration, maintenance and management of dependencies. To facilitate packaging and reuse of pipelines, management frameworks such as Galaxy[1], Ruffus[2], and Taverna[3] have been developed. Docker containers are a set of processes running in a multi-tenanted Linux host kernel, so are very lightweight as there is no underlying machine to emulate. These containers capture the initial investment of effort to build and configure them greatly facilitating re-use, they can be extended to modify or incorporate new components and shared on private or public (Docker Hub) registries
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have