Abstract

Identification of homologous proteins provides a basis for protein annotation. Sequence alignment tools reliably identify homologs sharing high sequence similarity. However, identification of homologs that share low sequence similarity remains a challenge. Lowering the cutoff value could enable the identification of diverged homologs, but also introduces numerous false hits. Methods are being continuously developed to minimize this problem. Estimation of the fraction of homologs in a set of protein alignments can help in the assessment and development of such methods, and provides the users with intuitive quantitative assessment of protein alignment results. Herein, we present a computational approach that estimates the amount of homologs in a set of protein pairs. The method requires a prevalent and detectable protein feature that is conserved between homologs. By analyzing the feature prevalence in a set of pairwise protein alignments, the method can estimate the number of homolog pairs in the set independently of the alignments' quality. Using the HomoloGene database as a standard of truth, we implemented this approach in a proteome-wide analysis. The results revealed that this approach, which is independent of the alignments themselves, works well for estimating the number of homologous proteins in a wide range of homology values. In summary, the presented method can accompany homology searches and method development, provides validation to search results, and allows tuning of tools and methods.

Highlights

  • Homology detection is a key step in predicting the function of newly discovered proteins

  • A method to estimate the amount of homologous proteins in a set of protein pairs We developed a computational method to estimate the amount of homologous proteins in a set of protein pairs

  • The estimator requires two protein sets, and a protein feature X, which is prevalent in both sets, and conserved among homologous proteins

Read more

Summary

Introduction

Homology detection is a key step in predicting the function of newly discovered proteins. Different methods for homology detection are currently available, and can be divided into sequence-based and structure-based methods. Sequence-based methods rely on estimated evolutionary models that aim at reconstructing the evolutionary courses that relate the protein sequences. The structure-based approach uses protein structure data and allows searching for similar proteins over a structure classification database using structure alignment methods. Data of protein structure is a superior representation of proteins over sequence data. Several methods for searching against such databases were developed [4,5,6,7], and may detect homologous proteins unrecoverable in regular sequencebased searches. The amount of solved protein structures grows rapidly, it lags behind sequence data. The protein data bank [8] holds structure data of less than 40,000 proteins and protein fragments, while Swiss-Prot knowledgebase holds above 200,000 sequences [9]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call