Improving protein fold recognition by random forest.

Taeho Jo,Jianlin Cheng

doi:10.1186/1471-2105-15-s11-s14

Abstract

BackgroundRecognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds.ResultsRF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels.ConclusionsThe good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.

Highlights

Recognizing the correct structural fold among known template protein structures for a target protein is essential for template-based protein structure modeling
Because random forest selects a random subset of input features to construct each decision tree, the average prediction of a sufficient number of decision trees is robust against the existence of irrelevant features, which partially contributes to its good accuracy
Comparison of random forest with a single decision tree We compared the random forest consisting of 500 decision trees to a single decision tree in terms of the error rate

Summary

Introduction

Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. Since the number of unique protein structures appears to be limited (e.g., several thousand) according to the structural analysis on all the tertiary protein structures in the Protein Data Bank (PDB) [6], it is possible to identify one correct template structure (fold) for a large portion of target proteins This is the case if a target protein has a significant sequence identity with one of template proteins with a known tertiary structure. Machine learning methods have been used to tackle the problem effectively by casting the fold recognition as a binary classification problem to decide whether or not a target protein shares the same structural fold with a template protein in a protein structure library [6,7,8]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 21, 2014
Citations: 95	License type: cc-by

R Discovery Prime

R Discovery Prime

Improving protein fold recognition by random forest.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

The recognition of multi-class protein folds by adding average chemical shifts of secondary structure elements
Zhenxing Feng ... Muhammad Aqeel Ashraf
Saudi Journal of Biological Sciences | VOL. 23
Zhenxing Feng, et. al.Zhenxing Feng ... Muhammad Aqeel Ashraf
11 Dec 2015
Saudi Journal of Biological Sciences | VOL. 23

Computational Methods for Protein Structure Prediction and Fold Recognition
Iwona A Cymerman ... Marcin Pawłowski
-
Iwona A Cymerman, et. al.Iwona A Cymerman ... Marcin Pawłowski
01 Jan 2008
01 Jan 2008

Transmembrane protein alignment and fold recognition based on predicted topology.
Han Wang ... Gajendra P.S Raghava
PLoS ONE | VOL. 8
Han Wang, et. al.Han Wang ... Gajendra P.S Raghava
19 Jul 2013
PLoS ONE | VOL. 8

Improving protein fold recognition with hybrid profiles combining sequence and structure evolution.
Yassine Ghouzam ... Alexandre G De Brevern
Bioinformatics | VOL. 31
Yassine Ghouzam, et. al.Yassine Ghouzam ... Alexandre G De Brevern
07 Aug 2015
Bioinformatics | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving protein fold recognition by random forest.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics