Abstract The goal of the International Cancer Genome Consortium (ICGC) is to analyze the cancer genomes of at least 500 tumour samples with matched controls from 50 different cancer types and subtypes, building a comprehensive catalogue of somatic abnormalities for the benefit of the research community. The amount of data ICGC members will generate is close to that of 50,000 human genome projects and, to date, has received commitments for 107 projects to study more than 27,000 tumor genomes. The ICGC Data Coordination Center (DCC) is responsible for collecting, curating, aggregating, and disseminating the data generated by the consortium’s member projects. Given the size and the complexity of the ICGC data, these tasks represent significant scientific and technological challenges that require a performant, robust software infrastructure. Key to this infrastructure is the ability to scale as data grows. Using state-of-the-art Big Data, bioinformatics and cloud computing technologies, we developed a suite of web-based applications and microservices that enable member projects to first submit their data and validate their submissions according to the rules defined in the submission specification. Following validation, the data is processed, annotated and loaded into the data portal using a modular Extract-Transform-Load (ETL) pipeline. Submission, ETL and portal systems are built using scalable and distributed technologies such as Hadoop, Spark, MongoDB and ElasticSearch. Spark is used to validate, join, index, and harmonize annotations on submitted variants while ElasticSearch powers our variant query engine, API and portal displays. Here we present the ICGC Data Portal and describe both the current features and capabilities accessible to users along with the architecture of the underlying infrastructure. The portal provides scientists with powerful and unique tools for exploring and visualizing the millions of variants and annotations available. These include sophisticated, faceted search capabilities making data exploration extremely fast and easy, a suite of interactive Javascript components for in-depth analysis and visualization of specific genomic features, embedded genome and pathway browsers, synthetic cohorts comparisons and a streaming data download service. The portal integrates a large variety of annotations such as variant consequences and frequencies, functional impact factors and druggability. The portal also offers cloud-based tools for searching a catalog of raw ICGC data files stored in worldwide repositories and compute clouds. All source code is open to the community under the GPLv3 license. Citation Format: Junjun Zhang, Bob Tiernay, Dusan Andric, Phuong-My Do, Sid Joshi, Vitalii Slobodianyk, Chang Wang, Shane Wilson, Andy Yang, Vincent Ferretti. The ICGC data portal and its underlying open source software architecture [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 2602. doi:10.1158/1538-7445.AM2017-2602
Read full abstract