CScape-somatic: distinguishing driver and passenger point mutations in the cancer genome.

Mark F Rogers,Tom R Gaunt,Colin Campbell,Peter Robinson

doi:10.1093/bioinformatics/btaa242

Mark F Rogers, Tom R Gaunt + Show 2 more

Open Access

https://doi.org/10.1093/bioinformatics/btaa242

Copy DOI

Abstract

MotivationNext-generation sequencing technologies have accelerated the discovery of single nucleotide variants in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral. Recently, we proposed CScape, a method for discriminating between cancer driver mutations and presumed benign variants. For the neutral class, this method relied on benign germline variants found in the 1000 Genomes Project database. Discrimination could, therefore, be influenced by the distinction of germline versus somatic, rather than neutral versus disease driver. This motivates this article in which we consider predictive discrimination between recurrent and rare somatic single point mutations based solely on using cancer data, and the distinction between these two somatic classes and germline single point mutations.ResultsFor somatic point mutations in coding and non-coding regions of the genome, we propose CScape-somatic, an integrative classifier for predictively discriminating between recurrent and rare variants in the human cancer genome. In this study, we use purely cancer genome data and investigate the distinction between minimal occurrence and significantly recurrent somatic single point mutations in the human cancer genome. We show that this type of predictive distinction can give novel insight, and may deliver more meaningful prediction in both coding and non-coding regions of the cancer genome. Tested on somatic mutations, CScape-somatic outperforms alternative methods, reaching 74% balanced accuracy in coding regions and 69% in non-coding regions, whereas even higher accuracy may be achieved using thresholds to isolate high-confidence predictions.Availability and implementationPredictions and software are available at http://CScape-somatic.biocompute.org.uk/.Contactmark.f.rogers.phd@gmail.com or C.Campbell@bristol.ac.ukSupplementary informationSupplementary data are available at Bioinformatics online.

Highlights

Generation sequencing technologies have accelerated the discovery of single nucleotide variants (SNVs) in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral
The key difference is that we wish to explore the potential for discriminating between two different classes of somatic variants: highly recurrent SNVs, which we label as positives, and rare SNVs which we label as negatives
These patterns are consistent with other features in the same groups (Supplementary Figure 2), and supports our hypothesis that by developing models focused solely on somatic variants, we may begin to tease out differences between cancer drivers and putative passenger variants

Summary

Introduction

Generation sequencing technologies have accelerated the discovery of single nucleotide variants (SNVs) in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral. Predictors have been developed for variants in both coding and non-coding regions of the human genome. In Shihab et al (2015), we developed such a predictor based on pathogenic disease-driver germline variants from the Human. Multiple types of data may be informative, so we used an integrative binary classifier which weighted component data-types according to their relative informativeness (Shihab et al, 2015). In Rogers et al (2017a) we proposed CScape, a classifier for predicting the driver-status of SNVs in the human cancer genome with a follow-on investigation of biological insights in Darbyshire et al (2019).

Methods

Results

Conclusion