Abstract
Protein methylation plays vital roles in many biological processes and has been implicated in various human diseases. To fully understand the mechanisms underlying methylation for use in drug design and work in methylation-related diseases, an initial but crucial step is to identify methylation sites. The use of high-throughput bioinformatics methods has become imperative to predict methylation sites. In this study, we developed a novel method that is based only on sequence conservation to predict protein methylation sites. Conservation difference profiles between methylated and non-methylated peptides were constructed by the information entropy (IE) in a wider neighbor interval around the methylation sites that fully incorporated all of the environmental information. Then, the distinctive neighbor residues were identified by the importance scores of information gain (IG). The most representative model was constructed by support vector machine (SVM) for Arginine and Lysine methylation, respectively. This model yielded a promising result on both the benchmark dataset and independent test set. The model was used to screen the entire human proteome, and many unknown substrates were identified. These results indicate that our method can serve as a useful supplement to elucidate the mechanism of protein methylation and facilitate hypothesis-driven experimental design and validation.
Highlights
Protein methylation plays vital roles in many biological processes and has been implicated in various human diseases
The polypeptide chains that are created by ribosomes undergo a series of “product-forming” steps, such as cutting, folding and posttranslational modification (PTM)
We attempted to identify distinctive positions from the far sides of a longer peptide sequence by combining position ranking via information gain (IG) and stepwise position selection via support vector machine (SVM)
Summary
Protein methylation plays vital roles in many biological processes and has been implicated in various human diseases. The model was used to screen the entire human proteome, and many unknown substrates were identified These results indicate that our method can serve as a useful supplement to elucidate the mechanism of protein methylation and facilitate hypothesis-driven experimental design and validation. Shi et al.[20] presented a method called PLMLA that incorporated protein sequence information, secondary structure and amino acid properties to predict methyllysine sites. Qiu et al.[24] developed a method called iMethyl-PseAAC by incorporating physicochemical, sequence evolutionary, and structural information into a pseudo amino composition analysis Most of these methods applied an orthogonal encoding scheme to characterize the peptide sequence information such that each amino acid is always represented by the same 20-bit binary vector, regardless of where it occurs. The source code, datasets and SVM models can be freely found at http://cic.scu.edu.cn/bioinformatics/SourceCode_and_SVMmodel.zip
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have