Abstract

BackgroundAlthough synonymous single nucleotide variants (sSNVs) do not alter the protein sequences, they have been shown to play an important role in human disease. Distinguishing pathogenic sSNVs from neutral ones is challenging because pathogenic sSNVs tend to have low prevalence. Although many methods have been developed for predicting the functional impact of single nucleotide variants, only a few have been specifically designed for identifying pathogenic sSNVs.ResultsIn this work, we describe a computational model, IDSV (Identification of Deleterious Synonymous Variants), which uses random forest (RF) to detect deleterious sSNVs in human genomes. We systematically investigate a total of 74 multifaceted features across seven categories: splicing, conservation, codon usage, sequence, pre-mRNA folding energy, translation efficiency, and function regions annotation features. Then, to remove redundant and irrelevant features and improve the prediction performance, feature selection is employed using the sequential backward selection method. Based on the optimized 10 features, a RF classifier is developed to identify deleterious sSNVs. The results on benchmark datasets show that IDSV outperforms other state-of-the-art methods in identifying sSNVs that are pathogenic.ConclusionsWe have developed an efficient feature-based prediction approach (IDSV) for deleterious sSNVs by using a wide variety of features. Among all the features, a compact and useful feature subset that has an important implication for identifying deleterious sSNVs is identified. Our results indicate that besides splicing and conservation features, a new translation efficiency feature is also an informative feature for identifying deleterious sSNVs. While the function regions annotation and sequence features are weakly informative, they may have the ability to discriminate deleterious sSNVs from benign ones when combined with other features. The data and source code are available on website http://bioinfo.ahu.edu.cn:8080/IDSV.

Highlights

  • Synonymous single nucleotide variants do not alter the protein sequences, they have been shown to play an important role in human disease

  • Because both FATHMM-MKL and CADD are designed for predicting all types of pathogenic variants, it is not easy to assess the relative importance of various features devoted exclusively to Synonymous single nucleotide variants (sSNVs)

  • Identification of a set of informative features is critical for performance boosting and subsequently can enhance our understanding in the molecular basis of deleterious sSNVs

Read more

Summary

Introduction

Synonymous single nucleotide variants (sSNVs) do not alter the protein sequences, they have been shown to play an important role in human disease. Gelfman et al presented Transcript-inferred Pathogenicity (TraP) score [13], which can be used to evaluate a sSNV’s ability to cause disease by damaging a gene’s transcripts and protein products Besides these tools designed to predict functional sSNVs, several general-purpose variant effect predictors implicated cover effects of sSNVs. For example, FATHMM-MKL [14] is an integrative approach to predict the functional consequences of both non-coding and coding sequence variants using various genomic annotations. Several splicing-specific predictors can be used to evaluate the harmfulness of sSNVs, including SPANR [16], a tool for evaluating how SNVs cause splicing mis-regulation, and MutPred Splice [17], a machine-learning approach for the identification of coding region substitutions that disrupt pre-mRNA splicing

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call