Abstract

BackgroundIdentification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins.ResultsWhen considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments.ConclusionsWe show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP.

Highlights

  • Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs

  • Since the last version of SCOP (1.75), there has been a growth of about 21% (10417 to 12643) of the total number of non-redundant protein chain in the Protein Data Bank (PDB) ( VAST [5] nonredundant set for a BLAST p-value of 10−7 available at ftp://ftp.ncbi.nih.gov/mmdb/nrtable/)

  • Over the past decade there have been many attempts aiming at developing automatic classification procedures, mainly applying supervised classification methods using as labels of know 3D structures part of a reference classification

Read more

Summary

Introduction

Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. Since the last version of SCOP (1.75), there has been a growth of about 21% (10417 to 12643) of the total number of non-redundant protein chain in the PDB ( VAST [5] nonredundant set for a BLAST p-value of 10−7 available at ftp://ftp.ncbi.nih.gov/mmdb/nrtable/). Over the past decade there have been many attempts aiming at developing automatic classification procedures, mainly applying supervised classification methods using as labels of know 3D structures part of a reference classification.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call