High‐performance prediction of functional residues in proteins with machine learning and computed input features

Srinivas Somarowthu,Huyuan Yang,Mary Jo Ondrechen,David G.C Hildebrand

doi:10.1002/bip.21589

Abstract

One of the major challenges in genomics is to understand the function of gene products from their 3D structures. Computational methods are needed for the high-throughput prediction of the function of proteins from their 3D structure. Methods that identify active sites are important for understanding and annotating the function of proteins. Traditional methods exploiting either sequence similarity or structural similarity can be unreliable and cannot be applied to proteins with novel folds or low homology with other proteins. Here, we present a machine-learning application that combines computed electrostatic, evolutionary, and pocket geometric information for high-performance prediction of catalytic residues. Input features consist of our structure-based theoretical microscopic anomalous titration curve shapes (THEMATICS) electrostatics data, enhanced with sequence-based phylogenetic information from INTREPID and topological pocket information from ConCavity. Our THEMATICS-based input features are augmented with an additional metric, the theoretical buffer range. With the integration of the three different types of input, each of which performs admirably on its own, significantly better performance is achieved than that of any of these methods by itself. This combined method achieves 86.7%, 92.5%, and 93.8% recall of annotated functional residues at 5, 8, and 10% false-positive rates, respectively.

Full Text