Abstract
Protein tertiary structure is indispensible in revealing the biological functions of proteins. De novo perdition of protein tertiary structure is dependent on protein fold recognition. This study proposes a novel method for prediction of protein fold types which takes primary sequence as input. The proposed method, PFP-RFSM, employs a random forest classifier and a comprehensive feature representation, including both sequence and predicted structure descriptors. Particularly, we propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction. PFP-RFSM and ten representative protein fold predictors are validated in a benchmark dataset consisting of 27 fold types. Experiments demonstrate that PFP-RFSM outperforms all existing protein fold predictors and improves the success rates by 2%-14%. The results suggest sequence motifs are effective in classification and analysis of protein sequences.
Highlights
Protein structures are indispensable for revealing the regularities associated with protein functions, interactions and cell cycle [1,2,3]
We propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction
We first validate the performance of the random forest classifier, meaning that random forest classifier is compared with a variety of machine learning classifiers, including support vector machine (SVM), Kstar algorithm, Nearest Neighbour (IB1), Naïve Bayes and Multiple Logistic Regression on the same feature representation
Summary
Protein structures are indispensable for revealing the regularities associated with protein functions, interactions and cell cycle [1,2,3]. The structures of 31,509,804 protein sequences are not experimentally solved and need to be studied through computational methods. The wide and enlarging gap between known protein sequences and known protein structures with annotated biological functions motivates the development of in-silico methods for protein sequence analysis, protein tertiary structure prediction, and protein function annotation. The template-based method, in essence, is an algorithm that identifies templates, i.e., solved protein structures, for a query protein sequence. Both homology modeling [7] and threading [8] belong to template-based methods, and are successful in protein tertiary structure prediction. SCOP and CATH only provide a classification of protein domains with known structures and cannot make a classification for proteins that lack tertiary structures. The first level of the hierarchy of SCOP and CATH is OPEN ACCESS
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.