Assessment of computational methods for predicting the effects of missense mutations in human cancers

Florian Gnad,Zemin Zhang,Albion Baucom,Kiran Mukhyala,Gerard Manning

doi:10.1186/1471-2164-14-s3-s7

Abstract

BackgroundRecent advances in sequencing technologies have greatly increased the identification of mutations in cancer genomes. However, it remains a significant challenge to identify cancer-driving mutations, since most observed missense changes are neutral passenger mutations. Various computational methods have been developed to predict the effects of amino acid substitutions on protein function and classify mutations as deleterious or benign. These include approaches that rely on evolutionary conservation, structural constraints, or physicochemical attributes of amino acid substitutions. Here we review existing methods and further examine eight tools: SIFT, PolyPhen2, Condel, CHASM, mCluster, logRE, SNAP, and MutationAssessor, with respect to their coverage, accuracy, availability and dependence on other tools.ResultsSingle nucleotide polymorphisms with high minor allele frequencies were used as a negative (neutral) set for testing, and recurrent mutations from the COSMIC database as well as novel recurrent somatic mutations identified in very recent cancer studies were used as positive (non-neutral) sets. Conservation-based methods generally had moderately high accuracy in distinguishing neutral from deleterious mutations, whereas the performance of machine learning based predictors with comprehensive feature spaces varied between assessments using different positive sets. MutationAssessor consistently provided the highest accuracies. For certain combinations metapredictors slightly improved the performance of included individual methods, but did not outperform MutationAssessor as stand-alone tool.ConclusionsOur independent assessment of existing tools reveals various performance disparities. Cancer-trained methods did not improve upon more general predictors. No method or combination of methods exceeds 81% accuracy, indicating there is still significant room for improvement for driver mutation prediction, and perhaps more sophisticated feature integration is needed to develop a more robust tool.

Highlights

Recent advances in sequencing technologies have greatly increased the identification of mutations in cancer genomes
SIFT identifies conserved protein residues based on multiple sequence alignment of homologous proteins, and calculates the probability for each of the 19 amino acid changes to be tolerated relative to the most frequent residue
By calculating SIFT scores for both the mutant and wild-type alleles, it identifies potential gain-of-function mutations where the mutant residue is more similar to those found in homologous proteins

Summary

Introduction

Recent advances in sequencing technologies have greatly increased the identification of mutations in cancer genomes. Various computational methods have been developed to predict the effects of amino acid substitutions on protein function and classify mutations as deleterious or benign. Non-synonymous changes (those that change protein sequences) are the most investigated group of genetic perturbations These mutations vary greatly in their functional impact, depending on their position and function in the protein and nature of the replacement amino acid. Several computational methods have been developed to predict the effect of any missense mutation on protein function, using evolutionary sequence comparison, structural constraints, and physicochemical attributes of amino acids. Machine learning methods aim to predict cancer-driving deleterious mutations, based on a wider set of attributes and training with sets of likely cancer mutations. Metapredictors that combine several methods have been developed [2]

Methods

Results

Conclusion