ProtPlat: an efficient pre-training platform for protein classification based on FastText

Yuan Jin,Yang Yang

doi:10.1186/s12859-022-04604-2

Abstract

BackgroundFor the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few.ResultsIn this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public.ConclusionsTo enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat.

Highlights

For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features
Jin and Yang BMC Bioinformatics (2022) 23:66 results in function-related prediction tasks based on protein sequences, such as protein subcellular localization [22,23,24], protein structural characteristics prediction [28, 29], and protein–protein interaction prediction [30, 31]
We evaluate the performance of the platform on three downstream protein classification tasks with different data scales, namely the identification of type III secreted effectors, the prediction of protein subcellular localization, and the recognition of signal peptides

Summary

Introduction

For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. To accelerate the studies of protein function, researchers have developed a variety of machine learning methods based on the known data in large databases [3, 4]. There are no defined words in amino acid sequences, while the pre-training of embedding vectors mostly relies on language modeling, e.g., word prediction For another thing, protein sequences have a much smaller alphabet but are quite longer than natural language sentences, which brings new challenges to learning models

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 11, 2022
Citations: 6	License type: open-access

R Discovery Prime

R Discovery Prime

ProtPlat: an efficient pre-training platform for protein classification based on FastText

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Machine and Deep Learning for Prediction of Subcellular Localization.
Gaofeng Pan ... Jijun Tang
Methods in molecular biology (Clifton, N.J.) | VOL. 2361
Gaofeng Pan, et. al.Gaofeng Pan ... Jijun Tang
01 Jan 2020
Methods in molecular biology (Clifton, N.J.) | VOL. 2361

Research on Pre-training of Tibetan Natural Language Processing
Zhensong Li ... Jie Zhu
-
Zhensong Li, et. al.Zhensong Li ... Jie Zhu
16 Jul 2021
16 Jul 2021

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)
Ehsaneddin Asgari ... Mohammad R K Mofrad
Scientific Reports | VOL. 9
Ehsaneddin Asgari, et. al.Ehsaneddin Asgari ... Mohammad R K Mofrad
05 Mar 2019
Scientific Reports | VOL. 9

Protein class prediction based on Count Vectorizer and long short term memory
S R Mani Sekhar ... G M Siddesh
International Journal of Information Technology | VOL. 13
S R Mani Sekhar, et. al.S R Mani Sekhar ... G M Siddesh
11 Oct 2020
International Journal of Information Technology | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ProtPlat: an efficient pre-training platform for protein classification based on FastText

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics