Bio-Docklets: virtualization containers for single-step execution of NGS pipelines.

Baekdoo Kim,Enis Afgan,Konstantinos Krampis,Carlos Lijeron,Thahmina Ali

doi:10.1093/gigascience/gix048

Baekdoo Kim, Enis Afgan + Show 3 more

Open Access

https://doi.org/10.1093/gigascience/gix048

Copy DOI

Journal: GigaScience	Publication Date: Jun 27, 2017
Citations: 12	License type: CC BY 4.0

Affiliation: City University of New York, Johns Hopkins University

Abstract

Processing of next-generation sequencing (NGS) data requires significant technical skills, involving installation, configuration, and execution of bioinformatics data pipelines, in addition to specialized postanalysis visualization and data mining software. In order to address some of these challenges, developers have leveraged virtualization containers toward seamless deployment of preconfigured bioinformatics software and pipelines on any computational platform. We present an approach for abstracting the complex data operations of multistep, bioinformatics pipelines for NGS data analysis. As examples, we have deployed 2 pipelines for RNA sequencing and chromatin immunoprecipitation sequencing, preconfigured within Docker virtualization containers we call Bio-Docklets. Each Bio-Docklet exposes a single data input and output endpoint and from a user perspective, running the pipelines as simply as running a single bioinformatics tool. This is achieved using a “meta-script” that automatically starts the Bio-Docklets and controls the pipeline execution through the BioBlend software library and the Galaxy Application Programming Interface. The pipeline output is postprocessed by integration with the Visual Omics Explorer framework, providing interactive data visualizations that users can access through a web browser. Our goal is to enable easy access to NGS data analysis pipelines for nonbioinformatics experts on any computing environment, whether a laboratory workstation, university computer cluster, or a cloud service provider. Beyond end users, the Bio-Docklets also enables developers to programmatically deploy and run a large number of pipeline instances for concurrent analysis of multiple datasets.

Highlights

The Galaxy server [6] provides an innovative approach for deployment of command-line software through an online Graphical User Interface (GUI), and it has had a great impact on making next-generation sequencing (NGS) data analysis tools and pipelines accessible to nonbioinformatics experts
While Galaxy workflow descriptions are standardized in eXtensible Markup Language files, allowing transfer of NGS analysis pipelines across installations at different laboratories, the bioinformatics software used in the pipelines needs to be reinstalled at each location manually or through the ToolShed
A number of other bioinformatics software development projects are utilizing Docker virtualization, including, e.g., BioShaDock [4], which provides a curated repository of prebuilt bioinformatics containers, BioContainers/BioDocker [27], which implements an aggregator and search engine across Docker repositories, bioboxes [5], which defines a standardized interface for running bioinformatics tools pre-installed in containers, and Common Workflow Language (CWL) [28], which allows command line tools to be connected into portable workflows

Summary

Objectives

Our goal is to enable easy access to NGS data analysis pipelines for nonbioinformatics experts on any computing environment, whether a laboratory workstation, university computer cluster, or a cloud service provider. Our goal is to provide an integrated solution with preconfigured data analysis pipelines that can be deployed across systems ranging from single compute servers used in a laboratory to a cluster or the cloud. Our goal is to enable researchers to run multistep data pipelines as as running as a single bioinformatics tool and perform advanced genomic data analysis without any prior technical expertise

Methods

Results

Conclusion