Abstract

In order to get a comprehensive repertoire of foldable domains within whole proteomes, including orphan domains, we developed a novel procedure, called SEG-HCA. From only the information of a single amino acid sequence, SEG-HCA automatically delineates segments possessing high densities in hydrophobic clusters, as defined by Hydrophobic Cluster Analysis (HCA). These hydrophobic clusters mainly correspond to regular secondary structures, which together form structured or foldable regions. Genome-wide analyses revealed that SEG-HCA is opposite of disorder predictors, both addressing distinct structural states. Interestingly, there is however an overlap between the two predictions, including small segments of disordered sequences, which undergo coupled folding and binding. SEG-HCA thus gives access to these specific domains, which are generally poorly represented in domain databases. Comparison of the whole set of SEG-HCA predictions with the Conserved Domain Database (CDD) also highlighted a wide proportion of predicted large (length >50 amino acids) segments, which are CDD orphan. These orphan sequences may either correspond to highly divergent members of already known families or belong to new families of domains. Their comprehensive description thus opens new avenues to investigate new functional and/or structural features, which remained so far uncovered. Altogether, the data described here provide new insights into the protein architecture and organization throughout the three kingdoms of life.

Highlights

  • Domains are the modular building blocks of proteins and correspond to recurring, fundamental units of both protein structure and evolution

  • SEG-Hydrophobic Cluster Analysis (HCA) H2CD predictions applied to whole proteomes So far, analysis of HCA plots was manual and limited to small sets of protein sequences

  • The SEG-HCA procedure allows the automation of one aspect of the HCA plot analysis by delineating, from the consideration of a single protein sequence, the positions of segments having a high density in hydrophobic clusters (H2CD segments, Fig. 2)

Read more

Summary

Introduction

Domains are the modular building blocks of proteins and correspond to recurring, fundamental units of both protein structure and evolution. Information about protein domains is stored in dedicated databases, in the form of profiles or hidden Markov models (HMMs), which are constructed through sequence similarity searches. These profiles and HMMs can be searched for detecting the domain composition of proteins, starting from their amino acid sequences [5]. By this way, approximately half of the residues of proteomes can be assigned to well-classified domains, such as those stored in the PfamA classification [2]. The percentage of assigned residues increases when less wellcharacterized domain databases, such as PfamB, are searched

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call