Abstract

For somatic point mutations in coding and non-coding regions of the genome, we propose CScape, an integrative classifier for predicting the likelihood that mutations are cancer drivers. Tested on somatic mutations, CScape tends to outperform alternative methods, reaching 91% balanced accuracy in coding regions and 70% in non-coding regions, while even higher accuracy may be achieved using thresholds to isolate high-confidence predictions. Positive predictions tend to cluster in genomic regions, so we apply a statistical approach to isolate coding and non-coding regions of the cancer genome that appear enriched for high-confidence predicted disease-drivers. Predictions and software are available at http://CScape.biocompute.org.uk/.

Highlights

  • Generation sequencing technologies have accelerated the discovery of single nucleotide variants (SNVs) in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral

  • If we restrict prediction to highest confidence instances only balanced accuracy in leave-one-chromosome-out cross-validation (LOCO-CV) rises to 91.7% for coding regions and 76.1% for non-coding regions, with predictions confined to 17.7% and 14.8% of nucleotide positions across the genome, respectively

  • These high confidence positive predictions are typically clustered by genomic location, we further introduce a statistical method to find significant contiguous sets of such positive predictions

Read more

Summary

Introduction

Generation sequencing technologies have accelerated the discovery of single nucleotide variants (SNVs) in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral. Tested on independent data sets, CScape can achieve up to 91% balanced accuracy in coding regions and 70% in non-coding regions Most methods in this domain attempt to rank mutations to identify the most likely oncogenic examples. If we restrict prediction to highest confidence instances only (cautious classification) balanced accuracy in LOCO-CV rises to 91.7% for coding regions and 76.1% for non-coding regions, with predictions confined to 17.7% (coding) and 14.8% (non-coding) of nucleotide positions across the genome, respectively These high confidence positive (oncogenic) predictions are typically clustered by genomic location, we further introduce a statistical method to find significant contiguous sets of such positive predictions. This latter approach highlights a number of genes as potentially oncogenic via somatic point mutation

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.