Abstract

Experimental protein functional annotation does not cover rapidly-expanding protein sequences. Sequence-based methods, one of the computational methods, have been developed for extending functional annotations to fast-growing sequence databases. We propose a novel sequence-based hierarchy-aware method, namely GCL-GO. GCL-GO applies a protein language model to represent sequences, applies graph contrastive learning to represent GO terms, and then predicts protein functions by combining these two features. By contrasting the GO graph and semantic features of GO terms, GCL-GO has generalizability and scalability by accurately embedding the features of GO terms while relying less on training data. We also suggest GCL-GO+, which combines a sequence similarity-based method with GCLGO, to improve performance. GCL-GO+ outperforms sequence-based competing methods on both the CAFA3 and the TALE datasets. Furthermore, GCL-GO and GCL-GO+ demonstrate functional generalization and scalability potential by having the best performance on new GO terms or on GO terms annotated infrequently in the training dataset. Our code is available in https://github.com/kch38896/GCL-GO

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call