Abstract

The protein universe corresponds to the set of all proteins found in all organisms. A way to explore it is by taking into account the domain content of the proteins. However, some part of sequences and many entire sequences remain un-annotated despite a converging number of domain families. The un-annotated part of the protein universe is referred to as the dark proteome and remains poorly characterized. In this study, we quantify the amount of foldable domains within the dark proteome by using the hydrophobic cluster analysis methodology. These un-annotated foldable domains were grouped using a combination of remote homology searches and domain annotations, leading to define different levels of darkness. The dark foldable domains were analyzed to understand what make them different from domains stored in databases and thus difficult to annotate. The un-annotated domains of the dark proteome universe display specific features relative to database domains: shorter length, non-canonical content and particular topology in hydrophobic residues, higher propensity for disorder, and a higher energy. These features make them hard to relate to known families. Based on these observations, we emphasize that domain annotation methodologies can still be improved to fully apprehend and decipher the molecular evolution of the protein universe.

Highlights

  • Matrices (PSSMs) are usually built on a protein domain alignment seed, and this seed is afterward used to find occurrences of the domain in other protein sequences

  • The un-annotated foldable domains were compared to known domains annotated using two distinct methodologies, considering domains with structural homologs detected by mapping of the Protein Data Bank (PDB)[26] as well as protein domain families stored in databases

  • In the remaining of the manuscript, references to sequences of the four groups (PDB regions, gray regions, dark regions and dark proteins) correspond to Hydrophobic Cluster Analysis (HCA) domains of these groups, except if specified otherwise, as the goal of this study is to understand why these foldable domains are not annotated, in contrast to globular domains stored in domain databases

Read more

Summary

Introduction

Matrices (PSSMs) are usually built on a protein domain alignment seed, and this seed is afterward used to find occurrences of the domain in other protein sequences. The detection of un-annotated, but foldable domains of the dark proteomes relies on the use of an automatic tool, called SEG-HCA, derived from the Hydrophobic Cluster Analysis (HCA) methodology It allowed to identify, in a comprehensive way, foldable domains from the only information of a single amino acid sequence, without the prior knowledge of homologous sequences[14]. The quality of the sequences from this dataset has the advantage to strongly limit the potential bias from artefacts coming from genome annotation assembly or prediction errors, in addition to provide a balanced set of proteins from different organisms, avoiding over-representation of some taxa This high quality dataset supports the relevance of the differences observed between un-annotated foldable domains and known domains stored in databases. The less information there is for an un-annotated domain the more important is the difference at the level of the hydrophobic amino acid topology relative to known domain families

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.