Predicting protein residue-residue contacts using random forests and deep networks

Joseph Luttrell,Tong Liu,Zheng Wang,Chaoyang Zhang

doi:10.1186/s12859-019-2627-6

Joseph Luttrell, Tong Liu + Show 2 more

Open Access

https://doi.org/10.1186/s12859-019-2627-6

Copy DOI

Abstract

BackgroundThe ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins. For example, contact prediction can be used to reduce the computational complexity of predicting the structure of proteins and even to help identify functionally important regions of proteins. These predictions are becoming especially important given the relatively low number of experimentally determined protein structures compared to the amount of available protein sequence data.ResultsHere we have developed and benchmarked a set of machine learning methods for performing residue-residue contact prediction, including random forests, direct-coupling analysis, support vector machines, and deep networks (stacked denoising autoencoders). These methods are able to predict contacting residue pairs given only the amino acid sequence of a protein. According to our own evaluations performed at a resolution of +/− two residues, the predictors we trained with the random forest algorithm were our top performing methods with average top 10 prediction accuracy scores of 85.13% (short range), 74.49% (medium range), and 54.49% (long range). Our ensemble models (stacked denoising autoencoders combined with support vector machines) were our best performing deep network predictors and achieved top 10 prediction accuracy scores of 75.51% (short range), 60.26% (medium range), and 43.85% (long range) using the same evaluation. These tests were blindly performed on targets from the CASP11 dataset; and the results suggested that our models achieved comparable performance to contact predictors developed by groups that participated in CASP11.ConclusionsDue to the challenging nature of contact prediction, it is beneficial to develop and benchmark a variety of different prediction methods. Our work has produced useful tools with a simple interface that can provide contact predictions to users without requiring a lengthy installation process. In addition to this, we have released our C++ implementation of the direct-coupling analysis method as a standalone software package. Both this tool and our RFcon web server are freely available to the public at http://dna.cs.miami.edu/RFcon/.

Highlights

The ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins
TP is the number of residue pairs that were correctly predicted to be in contact, FP is the number of residue pairs that were incorrectly predicted to be in contact, and nativePositives is the total number of residue pairs that were in contact in the native structure of the protein being evaluated
These tests were performed using the target proteins from the CASP11 experiment and ensure that none of the evaluated prediction methods were trained on proteins that exist in this testing set

Summary

Introduction

The ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins. The ability to make predictions about which residues within a protein fall within these parameters can assist researchers by providing information about the native structure and other physical properties of that protein before they expend valuable resources on physical experiments [2] This is especially true since evidence suggests that intra-molecular interactions among residues play an important role in determining the overall stability of a protein’s native structure [3]. Sequence-based contact prediction research typically utilizes machine learning methods and explores a wide variety of techniques such as support vector machines (SVMs) [7, 8], neural networks [9], random forests (RF) [10, 11], and convolutional neural networks (CNNs) [12, 13] While these methods vary in the technical details of their approach, they often share the common goal of discovering patterns in protein data that appear when residue pairs are observed to be in contact

Methods

Results

Discussion

Conclusion