Abstract

In the accompanying paper (Nagy, Szláma, Szarka, Trexler, Bányai, Patthy, Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors) we showed that in the case of UniProtKB/TrEMBL, RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species the contribution of erroneous (incomplete, abnormal, mispredicted) sequences to domain architecture (DA) differences of orthologous proteins might be greater than those of true gene rearrangements. Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa. To estimate the contribution of this type of error we have used as reference UniProtKB/Swiss-Prot sequences from protein families with well-characterized evolutionary histories. We have used two types of paralogy-group construction procedures and monitored the impact of various parameters on the separation of true paralogs from epaktologs on correctly annotated Swiss-Prot entries of multidomain proteins. Our studies have shown that, although public protein family databases are contaminated with epaktologs, analysis of the structure of sequence similarity networks of multidomain proteins provides an efficient means for the separation of epaktologs and paralogs. We have also demonstrated that contamination of protein families with epaktologs increases the apparent rate of DA change and introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences. We have shown that confusing paralogous and epaktologous multidomain proteins significantly increases the apparent rate of DA change in Metazoa and introduces a positional bias in favor of terminal over internal DA changes. Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of multidomain proteins is presented in an accompanying paper [1].

Highlights

  • Since formation of multidomain proteins with novel domain architectures (DA) is known to have played a major role in biological innovations of Metazoa [2,3] there is a growing interest in the genome-scale reconstruction of DA evolution with a view of defining the contribution of different genetic mechanisms

  • Our studies have shown that analysis of the structure of sequence similarity networks of multidomain proteins provides an efficient means for the separation of epaktologs and paralogs

  • Intra-species comparison of human Swiss-Prot sequences benefits from the fact that the dataset is of high quality and suffers only from the problem that it may be difficult to separate paralogs and epaktologs

Read more

Summary

Introduction

Since formation of multidomain proteins with novel domain architectures (DA) is known to have played a major role in biological innovations of Metazoa [2,3] there is a growing interest in the genome-scale reconstruction of DA evolution with a view of defining the contribution of different genetic mechanisms. In the present work we show that the standard procedures are much less reliable in defining groups of paralogs This is due to the problem that, in the case of multidomain proteins, the major subtypes of homology (orthology, paralogy, pseudoparalogy) do not account for all types of relationships that may hold for two homologous multidomain proteins. The (b) panel of Figure 3 illustrates the case where domain shuffling inserts the same domain-type s, into orthologs of A and X proteins independently in terminal positions, followed by tandem duplication of this domain, resulting in proteins A1* and X2* in an extant species with domain architectures s-s-s-s-a-b and x-z-s-s-s-s, respectively. We demonstrated that failure to separate epaktologs and paralogs increases the apparent rate of DA change during protein evolution and falsifies the results by introducing a positional bias in favor of terminal over internal DA changes

Datasets of Human Swiss-Prot Paralogs
Families with Paralogs and Orthologs Only
Comparison of the DA of Paralogous Human Swiss-Prot Proteins Defined through
Databases
Comparison of the Domain Architectures of Homologous Proteins
Conclusions
40. Homepage of GOLD: Genomes
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call