Abstract

The goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).

Highlights

  • The goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine

  • The GRCh38 major human genome assembly was released by Genome Reference Consortium (GRC) on Dec 2013 with GenBank assembly accession GCA_000001405.15

  • The rapid decrease in sequencing costs has led to a rapid increase in the resources needed for storage and computation

Read more

Summary

Introduction

The goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC)[1,2] currently contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET, https://ocg.cancer.gov/programs/target). Each of these projects contains a variety of processed and unprocessed molecular data types, including genomics, epigenomics, proteomics, imaging, clinical, and others. Different higher level data generation pipelines utilize the GDC aligned data to derive summary results, such as somatic mutations or gene expression. We discuss the general considerations, implementation details and quality comparisons of a large-scale uniform genomics data analysis that was used by the NCI’s Genomic Data Commons for processing and harmonizing the cancer genomics data that it shares with its users

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call