Abstract The Multi-Center Mutation Calling in Multiple Cancers (MC3) data set provides consistent variant calling and filtering across the 10K patients in The Cancer Genome Atlas (TCGA). The MC3 was a collaborative science effort, driven by a consortium of researchers across multiple institutions, to form a TCGA capstone project focused on cross-tumor type automated analysis. This dataset covers 33 different cancer types using an ensemble of 7 advanced mutation-calling algorithms with scoring and artifact filtering, implemented for sharing in a reproducible, portable, standardized workflow. The resulting dataset represents several million core-hours of computational time on over 400 TB of short read data using the current state-of-the-art variant calling and filtering methods. In the past decade, the precipitous drop in sequencing cost from $10M to $1000 has allowed for larger cohorts of data to be analyzed. In 2016 there were an estimated 1.6M new cancer diagnoses in the United States. Scaling computational systems and genomic analysis to work at this scale requires the coordination of many institutions, many experiments and many computational techniques. Aside from problems of scale, there are several issues that prevent large analyses: 1) deployment of reproducible computing methods in new computing environments, 2) the ability to deploy methods without manual intervention, 3) the biases of single methods and the need for consensus, and 4) the large amount of noise and false positives that come from data including both germline sequencing, heterogeneous tumor sequencing, technical artifacts, and low variant allele fraction of observed reads. The MC3 project dealt with these issues by 1) modeling the pipeline using Common Workflow Language (CWL) format with the required software packages deployed using Docker software containerization technology, ensuring the ability to deploy analysis on new computer systems; 2) crafting the entire pipeline using best practice parameters and applying them consistently, across the entire cohort; we have also analyzed validation data to determine particular cancer type where these parameters do not fit; 3) deploying an ensemble of 7 variant calling methods including MuTect, MuSE, Radia, Somatic Sniper, Pindel, Indelocator and Varscan2 (both indel and SNP calling); and 4) applying a robust set of additional filters, as well as applying meta-calling based on results of multiple callers, to reduce false-positive rates. Over 20 million variants were produced and the data generated by this work have formed the basis of the somatic exome variant analysis presented in the other papers from the TCGA PanCanAtlas project. A set of over 3 million high-quality variants from a public release of this data has been made available so that researchers may easily use this open resource. Citation Format: Kyle Ellrott, Mathew Bailey, Gordon Saksena, Kyle Covington, Cyriac Kandoth, Chip Stewart, Michael McLellan, Heidi Sofia, Carolyn Hutter, Gad Getz, David Wheeler, Li Ding. Multi-Center Mutation Calling in Multiple Cancers: The MC3 Project [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 926.
Read full abstract