Abstract

BackgroundComparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way.ResultsWe present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, and Drosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs.ConclusionVarious sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.

Highlights

  • Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation

  • The sequence comparisons within and across the four genomes provide a global view on the relationship between sequence similarity and function similarity

  • The probability is based on the number of pairs sharing the same function at a certain index level against the total pairs having any functions at the respective index level for a given sequence similarity interval

Read more

Summary

Introduction

Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. It is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way. Large-scale genome sequencing projects have discovered many new proteins. Annotation of a genome involves assignment of functions to proteins in most cases on the basis of sequence similarity. Protein function assignments based on postulated homology as recognized by sequence identity or significant expectation value of alignment are used routinely in genome analysis. Have been developed to predict function through identifying sequence similarity between a protein of unknown function and one or more proteins with experimentally characterized or computationally predicted functions. It is widely recognized that functional annotations should be transferred with caution, as the sequence similarity does not guarantee evolutionary or functional relationship. If a protein is assigned an incorrect function in a database, the error could carry over to other proteins for which functions are inferred by (page number not for citation purposes)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call