Abstract

Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, de novo predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNA-binding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multiple-instance learning, which only requires knowledge of the protein’s function yet identifies functionally relevant residues and need not rely on homology. We developed a new multiple-instance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.

Highlights

  • Computational tools have become indispensable in guiding, analyzing, and simulating the mechanistic details underlying experimental studies

  • We demonstrate the ability of an multiple-instance learning (MIL) algorithm to accurately predict the function of a protein using its constituent residues with two benchmark nucleic-acid binding datasets: DNA- and RNA-binding proteins

  • Conventional approaches that apply machine learning to function prediction have relied on a global representation of the sequence or structure, or a local representation of a residue’s environment on a target protein

Read more

Summary

Introduction

Computational tools have become indispensable in guiding, analyzing, and simulating the mechanistic details underlying experimental studies. Recent innovations in high-throughput experiments for function discovery have provided sufficient data to model and understand the characteristics that govern specific function using machine-learning methods. High-throughput sequence and structural genomics projects have continued to outpace corresponding functional discovery projects producing a deluge of protein data, with only a fraction having some functional annotation This annotation typically provides an indication of the general function but rarely, and when available—less reliably—provides mechanistic detail for a particular function. Systems biology research has focused on analyzing and predicting known interactions between proteins whereas pharmaceutical research requires greater knowledge in the mechanistic details of molecular function. Both efforts would benefit from machine-learning methods that can accurately classify protein function using the limited amount of training data available

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call