Abstract

BackgroundProtein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2).ResultsWe characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts.ConclusionsPartial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein’s gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-015-0656-7) contains supplementary material, which is available to authorized users.

Highlights

  • Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families

  • We chose Pfam because it is the largest contributor to the InterPro compendium of protein domain databases (Pfam annotates more than 40 million sequences of the 42 million sequences in UniProt/InterPro; the most comprehensive annotation source covers about half as many)

  • We focus on domain annotations where 50% or more of the Pfam Hidden Markov model (HMM), which defines the Pfam family, is missing at the domain annotation on the protein

Read more

Summary

Introduction

Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; almost 4% of Pfam PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. The discovery of evolutionarily mobile protein domains in the early 1980s, shortly after the recognition of eukaryotic splicing, revolutionized our understanding of protein structure. While proteins like calmodulin were known to contain repeated domains, the structural implications of modular proteins were not fully appreciated until clearly homologous domains were seen in different sequence contexts. Domains are central to our understanding of the structure, evolution, and functional roles of proteins and protein families. Conserved, structurally compact protein domains are often found in very different sequence contexts, and only by subdividing a protein into its constituent domains can one understand its evolutionary history

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call