PFP-RFSM: Protein fold prediction by using random forests and sequence motifs

Junfei Li,Jigang Wu,Ke Chen

doi:10.4236/jbise.2013.612145

Abstract

Protein tertiary structure is indispensible in revealing the biological functions of proteins. De novo perdition of protein tertiary structure is dependent on protein fold recognition. This study proposes a novel method for prediction of protein fold types which takes primary sequence as input. The proposed method, PFP-RFSM, employs a random forest classifier and a comprehensive feature representation, including both sequence and predicted structure descriptors. Particularly, we propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction. PFP-RFSM and ten representative protein fold predictors are validated in a benchmark dataset consisting of 27 fold types. Experiments demonstrate that PFP-RFSM outperforms all existing protein fold predictors and improves the success rates by 2%-14%. The results suggest sequence motifs are effective in classification and analysis of protein sequences.

Highlights

Protein structures are indispensable for revealing the regularities associated with protein functions, interactions and cell cycle [1,2,3]
We propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction
We first validate the performance of the random forest classifier, meaning that random forest classifier is compared with a variety of machine learning classifiers, including support vector machine (SVM), Kstar algorithm, Nearest Neighbour (IB1), Naïve Bayes and Multiple Logistic Regression on the same feature representation

Summary

Introduction

Protein structures are indispensable for revealing the regularities associated with protein functions, interactions and cell cycle [1,2,3]. The structures of 31,509,804 protein sequences are not experimentally solved and need to be studied through computational methods. The wide and enlarging gap between known protein sequences and known protein structures with annotated biological functions motivates the development of in-silico methods for protein sequence analysis, protein tertiary structure prediction, and protein function annotation. The template-based method, in essence, is an algorithm that identifies templates, i.e., solved protein structures, for a query protein sequence. Both homology modeling [7] and threading [8] belong to template-based methods, and are successful in protein tertiary structure prediction. SCOP and CATH only provide a classification of protein domains with known structures and cannot make a classification for proteins that lack tertiary structures. The first level of the hierarchy of SCOP and CATH is OPEN ACCESS

Methods

Results

Conclusion