Identifying structural domains of proteins using clustering

Howard J Feldman

doi:10.1186/1471-2105-13-286

Abstract

BackgroundProtein structures are comprised of modular elements known as domains. These units are used and re-used over and over in nature, and usually serve some particular function in the structure. Thus it is useful to be able to break up a protein of interest into its component domains, prior to similarity searching for example. Numerous computational methods exist for doing so, but most operate only on a single protein chain and many are limited to making a series of cuts to the sequence, while domains can and do span multiple chains.ResultsThis study presents a novel clustering-based approach to domain identification, which works equally well on individual chains or entire complexes. The method is simple and fast, taking only a few milliseconds to run, and works by clustering either vectors representing secondary structure elements, or buried alpha-carbon positions, using average-linkage clustering. Each resulting cluster corresponds to a domain of the structure. The method is competitive with others, achieving 70% agreement with SCOP on a large non-redundant data set, and 80% on a set more heavily weighted in multi-domain proteins on which both SCOP and CATH agree.ConclusionsIt is encouraging that a basic method such as this performs nearly as well or better than some far more complex approaches. This suggests that protein domains are indeed for the most part simply compact regions of structure with a higher density of buried contacts within themselves than between each other. By representing the structure as a set of points or vectors in space, it allows us to break free of any artificial limitations that other approaches may depend upon.

Highlights

Protein structures are comprised of modular elements known as domains
We found that these agree only 80% of the time on number of domains over 75,500 chains that they have in common (SCOP 1.75 and CATH 3.4.0, data not shown)! Despite these problems, splitting a protein into domains is often desirable
7076 of these chains still existed in the current Protein Data Bank (PDB) and so comprised the training set used in this study for the carbon based algorithm (CA) algorithm

Summary

Introduction

Protein structures are comprised of modular elements known as domains. These units are used and re-used over and over in nature, and usually serve some particular function in the structure. As a result of these different paradigms, there still does not exist a precise definition for a protein domain, nor do experts always agree on the number or location of domains within a given structure. This makes it extremely difficult to come up with a fully automated algorithm, to assign domain boundaries. For example when performing homology modelling, one often seeks a template to model parts of the structure from In this case it makes the most sense to find and use similar domains from known structures, which may provide useful templates when searching for similarity to the entire chain may not. Many different approaches have been used to split proteins into domains, and these can be divided into sequence-based and structure-based approaches

Methods

Results

Discussion

Conclusion