Abstract

There are more than 500 amino acid substitutions in each human genome, and bioinformatics tools irreplaceably contribute to determination of their functional effects. We have developed feature-based algorithm for the detection of mutations outside conserved functional domains (CFDs) and compared its classification efficacy with the most commonly used phylogeny-based tools, PolyPhen-2 and SIFT. The new algorithm is based on the informational spectrum method (ISM), a feature-based technique, and statistical analysis. Our dataset contained neutral polymorphisms and mutations associated with myeloid malignancies from epigenetic regulators ASXL1, DNMT3A, EZH2, and TET2. PolyPhen-2 and SIFT had significantly lower accuracies in predicting the effects of amino acid substitutions outside CFDs than expected, with especially low sensitivity. On the other hand, only ISM algorithm showed statistically significant classification of these sequences. It outperformed PolyPhen-2 and SIFT by 15% and 13%, respectively. These results suggest that feature-based methods, like ISM, are more suitable for the classification of amino acid substitutions outside CFDs than phylogeny-based tools.

Highlights

  • Generation sequencing technologies are revolutionizing genetics through enabling sequencing of whole genomes and exomes and increasing our ability to connect different genotypes to specific phenotypes

  • It contains 314 amino acid substitutions (AASs) in epigenetic regulators ASXL1, EZH2, DNMT3A, and TET2. 194 disease-associated and somatically acquired polymorphisms are labeled as mutations, while 120 germline or polymorphisms present in healthy population are labeled as single nucleotide polymorphisms (SNPs)

  • The most frequent mutations in the dataset are from acute myeloid leukemia (AML) cases (45%), and 12%, 13%, and 7% of mutations are from myelodisplastic syndromes (MDS), myeloproliferative neoplasms (MPN), and MDS/MPN, respectively

Read more

Summary

Introduction

Generation sequencing technologies are revolutionizing genetics through enabling sequencing of whole genomes and exomes and increasing our ability to connect different genotypes to specific phenotypes. The first group of methods approaching this issue from evolutionary perspective relies on the multiple sequence alignments (MSA) of homologous proteins Methods, such as PANTHER [8], PhD-SNP [9], and SIFT [10], presume that functionally important regions of a protein will be conserved throughout the evolution and assume direct connection between conservation of a residue and the functional effect of the AAS. The second strategy combines scores from MSA with structural information as well as patterns of physicochemical properties of amino acid substitutions These methods use machine learning algorithms, such as random forest— MutPred [11], neural networks—SNAP [12], or Bayesian classification—PolyPhen-2 [13]. The methods that unravel sequence periodicities encompass two steps: first, the sequence represented in alphabetic code is transformed into series of numbers by assigning to each amino acid a value of selected parameter and these series of

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call