Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.

Juliana S Bernardes,Fabio RJ Vieira,Gerson Zaverucha,Lygia MM Costa

doi:10.1186/s12859-014-0445-4

Juliana S Bernardes, Fabio RJ Vieira + Show 2 more

Open Access

https://doi.org/10.1186/s12859-014-0445-4

Copy DOI

Abstract

BackgroundAn important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). Clustering sequences into families is at the heart of most comparative studies dealing with protein evolution, structure, and function. Many methods have been developed for this task, and they perform reasonably well (over 0.88 of F-measure) when grouping proteins with high sequence identity. However, for highly diverged proteins the performance of these methods can be much lower, mainly because a common evolutionary origin is not deduced directly from sequence similarity. To the best of our knowledge, a systematic evaluation of clustering methods over distant homologous proteins is still lacking.ResultsWe performed a comparative assessment of four clustering algorithms: Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix), considering several datasets with different levels of sequence similarity. Two types of similarity measures, required by the clustering sequence methods, were used to evaluate the performance of the algorithms: the standard measure obtained from sequence–sequence comparisons, and a novel measure based on profile-profile comparisons, used here for the first time.ConclusionsThe results reveal low clustering performance for the highly divergent datasets when the standard measure was used. However, the novel measure based on profile-profile comparisons substantially improved the performance of the four methods, especially when very low sequence identity datasets were evaluated. We also performed a parameter optimization step to determine the best configuration for each clustering method. We found that TransClust clearly outperformed the other methods for most datasets. This work also provides guidelines for the practical application of clustering sequence methods aimed at detecting accurately groups of related protein sequences.

Highlights

An important problem in computational biology is the automatic detection of protein families
We evaluate four state-of-the-art methods: Markov Clustering (MCL) [4], Transitive Clustering (TransClust) [5], Spectral Clustering of Protein Sequences (SCPS) [6] and High-Fidelity clustering of sequences (HiFix) [7]
Our results show that the traditional similarity measure based on sequence–sequence comparisons, which is often used to feed sequence-clustering methods, is not suitable for detecting remote homologous protein families and super-families

Summary

Introduction

An important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). A number of clustering methods have been proposed to detect protein families, but to the best of our knowledge, the performance of most of them have been evaluated only on datasets containing homologous sequences with high identity. This finding shows that members of the same family are so distant that members of different families seem to be closer to each other. The existing clustering methods yield adequate results for close homologs, but they are likely to fail in identifying distant evolutionary relatedness

Objectives

Methods

Results

Conclusion