Abstract

Abstract Next-generation sequencing has led to the generation of petabytes of public data with the potential to significantly advance biomedical research. The Cancer Genome Atlas (TCGA) network alone, for example, has produced more than 2.5 petabytes of data. The logistical difficulties that researchers face while accessing such large datasets continue to present challenges, however. Downloading the complete TCGA dataset to a local data store can take several weeks or more, and, traditionally, integrated analysis has required resources available only to a limited number of researchers with access to large institutional compute clusters. In 2015, the National Cancer Institute (NCI) launched three Cancer Genomics Cloud Pilots, including the Seven Bridges Cancer Genomics Cloud (CGC; cancergenomicscloud.org), to democratize access to datasets such as TCGA by colocalizing data and computational resources in the cloud. In 2017, NCI expanded this effort to the development of an NCI Cancer Research Data Commons in which the CGC and other Cloud Pilots, now known as Cloud Resources, continue to deliver cloud-based access to petabyte-scale data and analysis resources. The Seven Bridges CGC is a customizable and scalable data access and analysis platform that connects users via the web to extensive public datasets, including multi-omic data from TCGA, the Simons Genome Diversity Project, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative, the International Cancer Genome Consortium (ICGC), the Cancer Cell Line Encyclopedia, The Cancer Imaging Archive (TCIA), and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The CGC enables collaborative, reproducible analysis across both public and private cohorts through access to customizable workspaces, a public toolkit containing more than 300 common analytical tools and workflows, and additional resources including an open-source Software Development Kit known as Rabix. Since the launch of the CGC in early 2016, more than 2500 researchers from more than 150 institutions in 30 countries have used the platform to deploy more than 5,000 applications to perform analyses representing more than 100 years of computation time. To illustrate the potential of the CGC to provide a customizable and scalable research environment, we present a collaborative project that enables unprecedented precision in detection of gene fusions and splice variants using novel statistical algorithm called Machete. We describe how this software was refactored in order to optimize deployment to the cloud for cost-effective analysis of thousands of samples at scale. We also provide the results of benchmarking that demonstrates the substantial savings in wall-clock time that can be obtained by processing large datasets on the cloud. Citation Format: Milos Jordanski, Robert Bierman, Erik Lehnert, Ana Damljanovic, Eric Freeman, Gillian Hsieh, Julia Salzman. The Seven Bridges Cancer Genomics Cloud: Enabling reproducible and cost-effective analysis in the cloud [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 5386.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call