Abstract Analysis of somatic protein-coding mutations, particularly via tumor exome sequencing, has driven discovery of novel cancer genes in recent years. However, ~97% of the human genome is non-coding and not addressable by exome-targeting methods. Tumor whole genome sequencing, by contrast, allows for genome-wide discovery of somatic events and may identify novel non-coding driver mutations. In conjunction, flexible tools employing principled statistical methods are needed for discovery of recurrent somatic driver mutations from tumor whole genomes. Here, we developed a computational toolset that addresses these needs and describe its application to a novel cohort of lung adenocarcinoma tumors generated by the APOLLO (Applied Proteogenomics OrganizationaL Learning and Outcomes) consortium. Our toolset flexibly computes feature-wise and hotspot somatic mutation enrichment statistics in both protein-coding and non-coding (e.g. promoters, enhancers) genomic regions. Additionally, we incorporated an affinity propagation-based clustering procedure that groups test regions by user-specified genomic covariates that can influence local mutation properties. We applied our toolset to several TCGA datasets of tumor exome sequencing. We found strong agreement between our results and those of other tools: among the top 15 genes called at FDR < 0.01 by MutSigCV, we observed a median overlap of 75.7% with genes ranked in the top 15 by our methods. We next applied our toolset to tumor whole genome sequencing data from our APOLLO lung adenocarcinoma cohort. During analysis, we simultaneously controlled for several genomic covariates, including GC/CpG content, replication timing, and proximal mRNA expression. Gene-centric analysis identified several significantly mutated genes, including KRAS, STK11, and TP53 (FDR < 0.005). We examined matched mRNA-seq data from this cohort and found significant over-expression of KRAS mRNA in samples possessing a non-silent coding mutation in this gene (p < 0.02). We additionally assessed somatic mutation enrichment in promoters, 3' UTRs, non-coding genes, and lung-specific enhancers. From these results, we found significant enrichment of somatic mutations in the body of the long non-coding RNA NEAT1. Among the tumors with somatic NEAT1 mutations, a single sample was hypermutated at this locus and corresponded with the lowest observed RNA expression for this lncRNA in our cohort (TPM of 12.2 vs. overall mean TPM of 84.2). In conclusion, we developed a flexible toolset for interrogating both coding and non-coding landscapes from tumor whole genomes. We applied our methods to a novel cohort of lung adenocarcinoma tumors and identified recurrently mutated genes and non-coding regions, including the NEAT1 lncRNA. Citation Format: Anthony R. Soltis, Coralie Viollet, Harvey B. Pollard, Christopher A. Moskaluk, Robert F. Browning, Clifton L. Dalgard, Craig D. Shriver, Matthew D. Wilkerson. Flexible discovery of recurrent coding and non-coding mutations in tumor whole genomes [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 2281.
Read full abstract