Abstract

Next Generation Sequencing technologies have produced a substantial increase of publicly available genomic data and related clinical/biospecimen information. New models and methods to easily access, integrate and search them effectively are needed. An effort was made by the Genomic Data Commons (GDC), which defined strict procedures for harmonizing genomic and clinical data of cancer, and created the GDC data portal with its application programming interface (API). In this work, we enhance GDC harmonization by applying a state of the art data model (called Genomic Data Model) made of two components: the genomic data, in Browser Extensible Data (BED) format, and the related metadata, in a tab-delimited key-value format. Furthermore, we extend the GDC genomic data with information extracted from other public genomic databases (e.g., GENCODE, HGNC and miRBase). For metadata, we implemented automatic procedures to extract and normalize them, recognizing and eliminating redundant ones, from both Clinical/Biospecimen Supplements and GDC Data Model, that are present on the two sources of GDC (i.e., data portal and API). We developed and released the OpenGDC software, which is able to extract, integrate, extend, and standardize genomic and clinical data of The Cancer Genome Atlas (TCGA) from the GDC. Additionally, we created a publicly accessible repository, containing such homogenized and enhanced TCGA data (resulting in about 1.3 TB). Our approach, implemented in the OpenGDC software, provides a step forward to the effective and efficient management of big genomic and clinical data of cancer. The strong usability of our data model and utility of our work is demonstrated through the application of the GenoMetric Query Language (GMQL) on the transformed TCGA data from the GDC, achieving promising results, facilitating information retrieval and knowledge discovery analyses.

Highlights

  • The large amount of genomic data generated by Generation Sequencing (NGS)technologies [1,2] and their related clinical data brings significant value for medical research, especially for cancer studies [3]

  • We implemented automatic procedures for converting the original Genomic Data Commons (GDC) genomic data into such free-Browser Extensible Data (BED) format; to index our BED output files, we introduce opengdc_id, an extension of the aliquot Universal Unique Identifier (UUID, that is the unit of analysis for GDC genomic data identifying a sample analyzed portion)

  • We illustrate the FTP repository where we provide the standardized genomic data and metadata obtained by applying OpenGDC to the GDC data of the

Read more

Summary

Introduction

The large amount of genomic data generated by Generation Sequencing (NGS). Technologies [1,2] and their related clinical data brings significant value for medical research, especially for cancer studies [3]. Thanks to NGS techniques, different types of experimental data are. Sci. 2020, 10, 6367 produced, whose storage and analysis can be very demanding [4,5,6]. More and more often researchers have to face big biological data [7,8], frequently lacking integrated data models and accessible schema representations. Storing, retrieving, integrating, comparing, and analyzing heterogeneous biomedical data becomes a major challenge

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call