Abstract

BackgroundThe emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs.ResultsIn order to identify novel 3D motifs that may be associated with molecular functions, we employ an unsupervised, two-phase clustering approach that combines k-means and hierarchical clustering with knowledge-informed cluster selection and annotation methods. We applied the approach to approximately 20,000 cysteine-based protein microenvironments (3D regions 7.5 Å in radius) and identified 70 interesting clusters, some of which represent known motifs (e.g. metal binding and phosphatase activity), and some of which are novel, including several zinc binding sites. Detailed annotation results are available online for all 70 clusters at http://feature.stanford.edu/clustering/cys.ConclusionsThe use of microenvironments instead of backbone geometric criteria enables flexible exploration of protein function space, and detection of recurring motifs that are discontinuous in sequence and diverse in structure. Clustering microenvironments may thus help to functionally characterize novel proteins and better understand the protein structure-function relationship.

Highlights

  • The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins

  • Galvanized by the Protein Structure Initiative, the field of structural genomics has begun to solve the structures of proteins in high-throughput [1,2,3]

  • We have improved the clustering procedure by better defining the biological context and decreasing feature redundancy, and applied a discriminating cluster selection method to produce more coherent groups, which we annotate using external knowledge from several sources. We demonstrate that this approach, applied to a set of cysteine (CYS) residues from a subset of the Protein Data Bank (PDB), is able to rediscover known functions, distinguish between functional sub-classes, make compelling functional site predictions for individual proteins, and identify novel groups of interesting microenvironments

Read more

Summary

Introduction

The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs. Protein function and structure are inherently linked, with molecular interactions determined by the shape and energetics of the participating structures. Galvanized by the Protein Structure Initiative, the field of structural genomics has begun to solve the structures of proteins in high-throughput [1,2,3]. By solving representative structures throughout protein structure space, researchers can more fully determine the relationship between protein structure and function [4]. Many of the solved structural genomics targets, lack annotation regarding the proteins’ biological functions

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call