Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

Nan Song,Jacob M Joseph,George B Davis,Dannie Durand

doi:10.1371/journal.pcbi.1000063

Nan Song, Jacob M Joseph + Show 2 more

Open Access

https://doi.org/10.1371/journal.pcbi.1000063

Copy DOI

Journal: PLoS Computational Biology	Publication Date: May 16, 2008
Citations: 231	License type: CC BY 4.0

Affiliation: Carnegie Mellon University

Abstract

We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era.

Highlights

Accurate identification of homologs, sequences that share common ancestry, is essential for accuracy in function prediction and comparative genomics
In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them
Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era

Summary

Introduction

Sequences that share common ancestry, is essential for accuracy in function prediction and comparative genomics. Pairwise homology detection is an integral component of clustering approaches to protein family classification ([1,16], and work cited therein) All of these applications exploit one or both of the following properties of homologous sequences: genes that share common ancestry tend (1) to have similar structure and function, and (2) be located in homologous chromosomal regions, making them suitable markers for comparative genomics. Because of their prevalence and importance, it is desirable to incorporate multidomain sequences in such analyses: Multidomain proteins represent 40% of the metazoan proteome, with functional roles in signal transduction, cellular adhesion, tissue repair, and immune response [17]. We extend the traditional definition of homology [18] to multidomain sequences that share a common ancestral gene, providing a formalism suitable for modeling multidomain family evolution, design and validation of multidomain homology identification methods, and incorporation of multidomain sequences in genomic analyses

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology

Lead the way for us

Similar Papers

Protein comparison at the domain architecture level
Byungwook Lee ... Doheon Lee
BMC Bioinformatics | VOL. 10
Byungwook Lee, et. al.Byungwook Lee ... Doheon Lee
01 Dec 2009
BMC Bioinformatics | VOL. 10

Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors
Alinda Nagy ... Eszter Szarka
Genes | VOL. 2
Alinda Nagy, et. al.Alinda Nagy ... Eszter Szarka
13 Jul 2011
Genes | VOL. 2

DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture
B Lee ... D Lee
Nucleic Acids Research | VOL. 36
B Lee, et. al.B Lee ... D Lee
19 May 2008
Nucleic Acids Research | VOL. 36

Domain Architecture Evolution of Metazoan Proteins
László Patthy
-
László PatthyLászló Patthy
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology