RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins.

Lukasz Kurgan,Rasna R Walia,Vasant Honavar,Drena Dobbs,Katherine Wilkins,Li C Xue,Yasser El-Manzalawy

doi:10.1371/journal.pone.0097725

Abstract

Protein-RNA interactions are central to essential cellular processes such as protein synthesis and regulation of gene expression and play roles in human infectious and genetic diseases. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases and functional implications of such interactions and for developing effective approaches to rational drug design. Sequence-based computational methods offer a viable, cost-effective way to identify putative RNA-binding residues in RNA-binding proteins. Here we report two novel approaches: (i) HomPRIP, a sequence homology-based method for predicting RNA-binding sites in proteins; (ii) RNABindRPlus, a new method that combines predictions from HomPRIP with those from an optimized Support Vector Machine (SVM) classifier trained on a benchmark dataset of 198 RNA-binding proteins. Although highly reliable, HomPRIP cannot make predictions for the unaligned parts of query proteins and its coverage is limited by the availability of close sequence homologs of the query protein with experimentally determined RNA-binding sites. RNABindRPlus overcomes these limitations. We compared the performance of HomPRIP and RNABindRPlus with that of several state-of-the-art predictors on two test sets, RB44 and RB111. On a subset of proteins for which homologs with experimentally determined interfaces could be reliably identified, HomPRIP outperformed all other methods achieving an MCC of 0.63 on RB44 and 0.83 on RB111. RNABindRPlus was able to predict RNA-binding residues of all proteins in both test sets, achieving an MCC of 0.55 and 0.37, respectively, and outperforming all other methods, including those that make use of structure-derived features of proteins. More importantly, RNABindRPlus outperforms all other methods for any choice of tradeoff between precision and recall. An important advantage of both HomPRIP and RNABindRPlus is that they rely on readily available sequence and sequence-derived features of RNA-binding proteins. A webserver implementation of both methods is freely available at http://einstein.cs.iastate.edu/RNABindRPlus/.

Highlights

Protein-RNA interactions play key roles in many vital cellular processes including translation [1,2], post-transcriptional regulation of gene expression [3,4], RNA splicing [5,6], and viral replication [7,8]
Because of the cost and effort involved in the experimental determination of protein-RNA complex structures [20,21] and RNA-binding sites in proteins [22,23], considerable effort has been directed at developing reliable computational methods for predicting RNAbinding residues in proteins
Rationale for Homology-Based Approach If RNA-binding residues are conserved across homologous proteins, we can use a simple sequence homology-based approach to predict RNA-binding residues in a query protein: Identify close sequence homologs of the query protein; infer the RNA-binding residues of the query protein based on the known RNA-binding residues of homolog(s) that are aligned with the query protein

Summary

Introduction

Protein-RNA interactions play key roles in many vital cellular processes including translation [1,2], post-transcriptional regulation of gene expression [3,4], RNA splicing [5,6], and viral replication [7,8]. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases, the underlying mechanisms, and functional implications of protein-RNA interactions Such understanding is essential for the success of efforts aimed at identifying novel therapies for genetic and infectious diseases. Homology-based methods have been shown to outperform other methods whenever close sequence or structural homologs of query proteins (used as templates) can be reliably identified [48,49,53] Based on their analysis of a dataset of 261 protein-RNA complexes, Spriggs and Jones [54] concluded that RNA-binding residues are more conserved than other surface residues in RNA-binding proteins. We demonstrate that RNABindRPlus substantially outperforms existing sequence-based and structure-based methods Both HomPRIP and RNABindRPlus have been implemented in a webserver that can be used to reliably predict RNA-binding residues in proteins, even when the structure of the query protein is unavailable

Results and Discussion

Method

Method KYG

Materials and Methods

Conclusions