Abstract Gabriella Miller Kids First Pediatric Research Program (GMKF) is a nation-wide, multi-year initiative focused on the integration of large-scale clinically annotated genomic data for childhood cancers and structural birth defects supported by the NIH Common Fund. Awarded by GMKF, the Kids First Data Resource Center (DRC) is tasked to build infrastructure and workflows for data intaking, harmonization, integration and access authorization to empower collaborative discoveries across the GMKF and other integrated datasets. A key challenge for uniform analyses and empowered discovery of large-scale genomic data relates to the diverse genomic processing workflows and methods employed across the sequencing and bioinformatics community. The DRC genomic harmonization team aims to provide “analysis ready” datasets that are “functionally equivalent” across the Kids First datasets and other large-scale genomic data initiatives in order to accelerate the discovery process. Paired with the cloud-based workspace environments of the DRC, such harmonized dataset provide unprecedented opportunities for shared, reproducible discovery by a diverse, collaborative network of researchers. As such, DRC initial pipelines are developed with BWA-MEM alignment on genome build GRCh38 followed by the GATK best practices for germline variant calling and joint genotyping. Common Workflow Language (CWL) is used as the main workflow specification, while Docker technology has been applied to containerize all the tools used by the workflow. Our current workflows are tasked with data harmonization across a number of different experimental platforms including whole genome sequencing, exome sequencing, and RNA-seq. The data processing is done via CAVATICA, an Amazon Web Services (AWS) based cloud computing platform associated with the Kids First DRC Portal co-developed by Seven Bridges Genomics, where workflows feature scatter-gather parallelization and AWS resource optimization. By utilizing such a framework, the DRC team has harmonized over 10,000 WGS and 1,000 RNA-Seq samples across 12 study cohorts within 8 months. This dataset in its current release includes samples from 40 pediatric brain cancers as well as 8 childhood birth defects with the outcome of delivering 150TB harmonized CRAM and 60TB gVCF. With a highly optimized bioinformatics pipeline powered by an efficient cloud-based execution workflow, The DRC platform processes one genome in about 11 hours with an average compute cost of $15 for whole genome alignment and germline variant calling. Here we present our observed challenges and identified opportunities in the analysis and integration of multi-disease pediatric genomic data on a large scale. Citation Format: Yuankun Zhu, Miguel Brown, Batsal Devkota, Bailey Farrow, Bogdan Gavrilovic, Allison Heath, Kyle Hernandez, Avi Kelman, Parimala Killada, Meen Chul Kim, Daniel Kolbman, Mateusz Koptyra, Milan Kovacevic, Maarten Leerkes, Alex Lubneuski, Michele Mattioni, Pichai Raman, Adam Resnick, Nikola Skundric, Deanne Taylor, Junjun Zhang, Bo Zhang, Phillip B. Storm. Genomic harmonization of the Data Resource Center for Gabriella Miller Kids First Pediatric Research Program [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 2465.
Read full abstract