Super paramagnetic clustering of protein sequences

Igor V Tetko,Axel Facius,Hans-Werner Mewes,Andreas Ruepp

doi:10.1186/1471-2105-6-82

Abstract

BackgroundDetection of sequence homologues represents a challenging task that is important for the discovery of protein families and the reliable application of automatic annotation methods. The presence of domains in protein families of diverse function, inhomogeneity and different sizes of protein families create considerable difficulties for the application of published clustering methods.ResultsOur work analyses the Super Paramagnetic Clustering (SPC) and its extension, global SPC (gSPC) algorithm. These algorithms cluster input data based on a method that is analogous to the treatment of an inhomogeneous ferromagnet in physics. For the SwissProt and SCOP databases we show that the gSPC improves the specificity and sensitivity of clustering over the original SPC and Markov Cluster algorithm (TRIBE-MCL) up to 30%. The three algorithms provided similar results for the MIPS FunCat 1.3 annotation of four bacterial genomes, Bacillus subtilis, Helicobacter pylori, Listeria innocua and Listeria monocytogenes. However, the gSPC covered about 12% more sequences compared to the other methods. The SPC algorithm was programmed in house using C++ and it is available at . The FunCat annotation is available at .ConclusionThe gSPC calculated to a higher accuracy or covered a larger number of sequences than the TRIBE-MCL algorithm. Thus it is a useful approach for automatic detection of protein families and unsupervised annotation of full genomes.

Highlights

Introduction to associative neural networksJ Chem Inf Comput Sci 2002, 42:717-728. 28
Following our first successful application of Super Paramagnetic Clustering (SPC) to a database of RING-finger domains [15] and our approach to project expression data to known functional modules [16], the present study further investigates the power of SPC to cluster protein sequences of two large databases, SwissProt and SCOP
We introduce an extension of this algorithm, global SPC or gSPC, which performs step-wise clustering on different levels of connectivity between points and provides significantly improved performance to the annotation of whole genomes compared to both the original SPC algorithm and TRIBE-MCL

Summary

Introduction

Introduction to associative neural networksJ Chem Inf Comput Sci 2002, 42:717-728. 28. Protein Sequence Clustering – TribeMCL [http:// www.ebi.ac.uk/research/cgg/tribe] 33. Detection of sequence homologues represents a challenging task that is important for the discovery of protein families and the reliable application of automatic annotation methods. The presence of domains in protein families of diverse function, inhomogeneity and different sizes of protein families create considerable difficulties for the application of published clustering methods. Numerous genome-sequencing projects have caused a rapid growth of the protein databases. There is a strong interest in developing reliable methods for the automatic functional classification of genome sequences employing evolutionary sequences as reflected in using sequence homology to predict functional properties. The identification of protein families, defined as set of proteins with significant sequence similarity encoding for at least related but often identical function between members, is a very important subtask to achieve this fundamental goal. The fact that proteins with high sequence similarity share a (page number not for citation purposes)

Methods

Results

Discussion

Conclusion