Positive-unlabelled learning of glycosylation sites in the human proteome

Fuyi Li,Yang Zhang,Jiangning Song,Kuo-Chen Chou,Trevor Lithgow,Geoffrey I Webb,Anthony W Purcell,Chen Li

doi:10.1186/s12859-019-2700-1

Abstract

BackgroundAs an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites).ResultsIn this study, we propose a positive unlabelled (PU) learning-based method, PA2DE (V2.0), based on the AlphaMax algorithm for protein glycosylation site prediction. The predictive performance of this proposed method was evaluated by a range of glycosylation data collected over a ten-year period based on an interval of three years. Experiments using both benchmarking and independent tests show that our method outperformed the representative supervised-learning algorithms (including support vector machines and random forests) and one-class learners, as well as currently available prediction methods in terms of F1 score, accuracy and AUC measures. In addition, we developed an online web server as an implementation of the optimized model (available at http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/) to facilitate community-wide efforts for accurate prediction of protein glycosylation sites.ConclusionThe proposed PU learning approach achieved a competitive predictive performance compared with currently available methods. This PU learning schema may also be effectively employed and applied to address the prediction problems of other important types of protein PTM site and functional sites.

Highlights

As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function
Our results suggested that the advantages of positive unlabelled (PU) learning relative to traditional supervised-learning techniques can be summarized as follows: 1) PU learning is fast and simple, is able to significantly reduce the effort and time necessary to label samples and can achieve a competitive performance compared to supervised-learning algorithms [36,37,38]; and 2) PU-learning is amenable to bioinformatics and computational biology settings, where a sizable portion of previously unidentified samples is likely mislabelled
The results showed that PA2DE (V2.0) achieved an outstanding predictive performance in terms of F1 score, accuracy (ACC), and the area under the curve (AUC) values

Summary

Introduction

As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. Based on its critical role in a wide variety of major pathways, protein glycosylation is associated with a variety of human diseases, including diabetes [13,14,15], cancers [16,17,18,19,20], and autoimmune diseases [21,22,23] In light of these strong associations with human diseases, and in the current era of precision medicine, there is an urgent need to develop computational tools to accurately predict glycosylation sites in order to prioritize potential candidates for experimental validation and elucidate their biological functions

Methods

Results

Conclusion