The distance-profile representation and its application to detection of distantly related protein families

Chin-Jen Ku,Golan Yona

doi:10.1186/1471-2105-6-282

Abstract

BackgroundDetecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families.ResultsWe describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions.ConclusionUsing several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE.

Highlights

Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins
The last two indices are characterized by the following generic definition: the topX-Y index for a protein p counts the total number of proteins sharing the same Y SCOP denomination among the nX closest sequences of p, where nX is the total number of sequences in the library that have the same X SCOP denomination as p itself
We study a new method for remote homology detection that utilizes global information on the proximity of entities in the protein space

Summary

Introduction

Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. These methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families. Homology establishes the evolutionary relationship among different organisms, and the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Our ability to detect subtle similarities between proteins depends strongly on the representations we employ for proteins. The essential difference between the representation of a protein as a sequence of amino acids and its representation as a 3D structure traditionally dictated different methodologies, different similarity or distance (page number not for citation purposes)

Objectives

Results

Discussion

Conclusion