Abstract
BackgroundProtein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins.MethodWe proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories.ResultsIn the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN.ConclusionsThe CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
Highlights
Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function
The results indicated that Conditional Wasserstein Generative Adversarial Network (CWGAN) better solved the existing data imbalance and stabilized the training error
The CWGAN greatly improved the predictive performance in all experiments
Summary
Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. As a common occurrence in the body, protein-translational modification (PTM) plays an important role in regulating various physiological processes and functions. In the past few decades, the advancement of proteomics technology and the development of “Big Data” on protein sequences shed light on the substantial study of protein nature. High-throughput biological technology has made tremendous achievements in protein PTM identification and analysis, the conventional approaches require expensive labor but get an unsatisfactory understanding of the relationship between structures and functions. It is of paramount significance to develop reliable and efficient computational methods for predicting and analyzing modifications
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.