Abstract
Recent advances in sequencing technologies have enabled the production of massive amounts of data on somatic mutations from cancer genomes. These data have led to the detection of characteristic patterns of somatic mutations or “mutation signatures” at an unprecedented resolution, with the potential for new insights into the causes and mechanisms of tumorigenesis. Here we present new methods for modelling, identifying and visualizing such mutation signatures. Our methods greatly simplify mutation signature models compared with existing approaches, reducing the number of parameters by orders of magnitude even while increasing the contextual factors (e.g. the number of flanking bases) that are accounted for. This improves both sensitivity and robustness of inferred signatures. We also provide a new intuitive way to visualize the signatures, analogous to the use of sequence logos to visualize transcription factor binding sites. We illustrate our new method on somatic mutation data from urothelial carcinoma of the upper urinary tract, and a larger dataset from 30 diverse cancer types. The results illustrate several important features of our methods, including the ability of our new visualization tool to clearly highlight the key features of each signature, the improved robustness of signature inferences from small sample sizes, and more detailed inference of signature characteristics such as strand biases and sequence context effects at the base two positions 5′ to the mutated site. The overall framework of our work is based on probabilistic models that are closely connected with “mixed-membership models” which are widely used in population genetic admixture analysis, and in machine learning for document clustering. We argue that recognizing these relationships should help improve understanding of mutation signature extraction problems, and suggests ways to further improve the statistical methods. Our methods are implemented in an R package pmsignature (https://github.com/friend1ws/pmsignature) and a web application available at https://friend1ws.shinyapps.io/pmsignature_shiny/.
Highlights
IntroductionClassical studies of mutation patterns revealed that C > A mutations are abundant in lung cancers in patients with smoking history, and these are caused by benzo(a) pyrene included in tobacco smoke [2]
Each mutation signature may be associated with a specific kind of carcinogen, such as tobacco smoke or ultraviolet light
Identifying mutation signatures has the potential to identify new carcinogens, and yield new insights into the mechanisms and causes of cancer, In this paper, we introduce new statistical tools for tackling this important problem
Summary
Classical studies of mutation patterns revealed that C > A mutations are abundant in lung cancers in patients with smoking history, and these are caused by benzo(a) pyrene included in tobacco smoke [2]. We describe how the background mutation signature is obtained in the case where mutation features are the substitution patterns, the ±2 flanking bases, and the transcription strand. Since the majority of the data used in this paper is exome sequencing data, and since we consider transcription strand as a mutation feature, we use the exonic regions of the human genome reference sequence to obtain the background mutation signature. Assuming alternated bases are likely from each central base C and T, the frequency of each mutation feature is derived directly from those of the 5-mers and transcription strands. The probability of each mutation feature is derived by normalizing each frequency to sum to one.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.