Abstract

As genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S. Precision Medicine Initiative, and the NIH Big Data to Knowledge Center for Translational Genomics, are using cloud-based infrastructure to both host and perform analysis across large data sets. In PCAWG, over 5,800 whole human genomes were aligned and variant called across 14 cloud and HPC environments; the processed data was then made available on the cloud for further analysis and sharing. If run locally, an operation at this scale would have monopolized a typical academic data centre for many months, and would have presented major challenges for data storage and distribution. However, this scale is increasingly typical for genomics projects and necessitates a rethink of how analytical tools are packaged and moved to the data. For PCAWG, we embraced the use of highly portable Docker images for encapsulating and sharing complex alignment and variant calling workflows across highly variable environments. While successful, this endeavor revealed a limitation in Docker containers, namely the lack of a standardized way to describe and execute the tools encapsulated inside the container. As a result, we created the Dockstore ( https://dockstore.org), a project that brings together Docker images with standardized, machine-readable ways of describing and running the tools contained within. This service greatly improves the sharing and reuse of genomics tools and promotes interoperability with similar projects through emerging web service standards developed by the Global Alliance for Genomics and Health (GA4GH).

Highlights

  • The Dockstore project has its roots in the large-scale ICGC PanCancer Analysis of Whole Genomes (PCAWG; https://dcc. icgc.org/pcawg) cancer genomics project, which necessitated the creation of highly portable and self-contained computational tools[1]

  • We found that the use of cloud Application Program Interfaces (APIs) and scripts to be a cumbersome and error prone way to move algorithms to the data

  • A relatively new lightweight virtualization technology, mitigated these issues by providing a mechanism to encapsulate tools and their dependencies in a highly portable way. This meant PCAWG workflow authors could create and set up their environments within a Docker image, including tools, library dependencies, reference files, and so forth, and copy that image from cloud to cloud for analysis of data in place

Read more

Summary

Introduction

The Dockstore project has its roots in the large-scale ICGC PanCancer Analysis of Whole Genomes (PCAWG; https://dcc. icgc.org/pcawg) cancer genomics project, which necessitated the creation of highly portable and self-contained computational tools[1]. A relatively new lightweight virtualization technology, mitigated these issues by providing a mechanism to encapsulate tools and their dependencies in a highly portable way (https://www.docker.com) This meant PCAWG workflow authors could create and set up their environments within a Docker image, including tools, library dependencies, reference files, and so forth, and copy that image from cloud to cloud for analysis of data in place. In addition to providing a mechanism to bring together Dockerbased tools and their corresponding machine-readable descriptors, the Dockstore provides a compelling and useful web-based interface, an instance of which is hosted at https://dockstore.org This allows it to serve two communities: developers who want to register and share their tools through Dockstore, and users wishing to find genomics tools packaged in Docker and ready to execute in their own systems (Figure 1). Its implementation in Dockstore, and other sites, is a key goal of the standards effort and will allow for federated searches across tool registries that implement the GA4GH API

Methods
Discussion
Thomas FR
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call