Abstract

Abstract The advent of next generation sequencing has accelerated the generation of genomic data and created a need for methodologies to organize, share, and analyze large volumes of data. To date, petabytes of multi-dimensional information from thousands of patients have been collected. Access and analysis of this information becomes increasingly challenging as the amount of data grows. This difficulty is exemplified when we consider data generated by the efforts of The Cancer Genomics Atlas (TCGA) network, which encompasses more than 2.5 petabytes. Historically, downloading the complete TCGA repository can require several weeks with a highly optimized network connection and access to large institutional compute clusters to perform integrated analysis, which is out of reach for many researchers. The Cancer Cloud Pilot project seeks to directly address these challenges by co-localizing data with the computational resources to analyze it where researchers can access it securely and easily. The project was born out of the recognition that conducting biological research is increasingly computationally-intensive and new approaches are required to support effective data discovery, storage, computation, and collaboration. Funded by the National Cancer Institute, the Cancer Genomics Cloud (CGC) enables researchers to leverage the power of cloud computing to gain actionable insights about cancer biology and human genetics from massive public datasets including TCGA and the Cancer Cell Line Encyclopedia on the CGC. Our approach to create a cancer cloud platform includes collaborative tools, security permissions, data harmonization, and making the data easier to query through the use of metadata curation, resource description frameworks, and visual tools. Additionally, we implemented the Common Workflow Language, an emerging standard for describing computational workflows, to support computational reproducibility. To date, more than 1200 researchers have accessed and analyzed TCGA and analyzed more than 50000 samples on the CGC since its launch in February 2016. In addition to the motivation, inception, and development of the CGC, we will present a case study on the application of unsupervised learning methods to identify individual cell types within tumors using RNA Sequencing data from TCGA cohorts. We will demonstrate how these computationally-intensive methods are benefited by the cloud and how researchers can apply open pipelines to interrogate cancer subtypes and mixed cell populations from TCGA data on their own data. Citation Format: Gaurav Kaushik, Yilong Li, Erik Lehnert, Zeynep Onder, Devin Locke, Brandi N. Davis-Dusenbery, Deniz Kural. Enabling petabyte-scale cancer genomics with the NCI Cancer Cloud Pilots [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 2595. doi:10.1158/1538-7445.AM2017-2595

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call