Abstract

The vast preponderance of somatic mutations in a typical cancer are either extremely rare or have never been previously recorded in available databases that track somatic mutations. These constitute a hidden genome that contrasts the relatively small number of mutations that occur frequently, the properties of which have been studied in depth. Here we demonstrate that this hidden genome contains much more accurate information than common mutations for the purpose of identifying the site of origin of primary cancers in settings where this is unknown. We accomplish this using a projection-based statistical method that achieves a highly effective signal condensation, by leveraging DNA sequence and epigenetic contexts using a set of meta-features that embody the mutation contexts of rare variants throughout the genome.

Highlights

  • We draw upon a context-based learning approach[16] in which the role of rare and unseen variants can be “learned” through their local genome and epigenome context: a set of quantifiable knowledge units obtained from the associated DNA sequence and epigenome contexts which we refer to as metafeatures. These meta-features include various topological or functional annotations of the genome and epigenome, such as singlebase substitution signatures in the trinucleotide context, regional indices mapping the topological position on the chromosome, and features of the epigenome including chromatin accessibility and histone modification that we describe in more detail

  • Our primary finding indicates that critical information for classifying tumor type resides in the noncoding somatic mutations detectable via whole genome sequencing

  • The sheer numbers of these mutations along with their relationship with discriminative epigenetic meta-features translated via local mutation burdens allow these mutations to collectively carry strong tissue-specific signals

Read more

Summary

Results

1–5 display the overall and cancer-site specific precision recall curves respectively for these classifiers across all five sequencing platforms These figures demonstrate the ability of the projected hidden genome classifier in condensing information from diverse sources: the key advantage of the proposed approach is that it provides a unifying strategy allowing integration of all these factors through appropriate meta-feature transformations within a multinomial logistic modeling framework, which in turn permits high predictive accuracies across all DNA sequencing platforms. We obtained the average normalized total mutational burden in each chromosome arm from the PCAWG whole genome dataset and performed linear regression analyses of these values in a model that included chromosome averages of each epigenomic feature (H3K35me[3], H3K4me[1], chromatin accessibility) obtained from the Encode and Epigenome Roadmap datasets, and included the average copy number log ratio for the chromosome arm, Fig. 6 Cancer site specific odds ratios of most effective individual predictors in the hidden genome model applied to the PCAWG whole genome data. The strong negative correlations of mutational burden with log H3K4me[1] are displayed for these sites in Fig. 7c, with plots for the other sites displayed in Supplementary Fig

12. Supplementary
Discussion
Methods
Code availability
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call