Abstract

The protein sequences found in nature represent a tiny fraction of the potential sequences that could be constructed from the 20-amino-acid alphabet. To help define the properties that shaped proteins to stand out from the space of possible alternatives, we conducted a systematic computational and experimental exploration of random (unevolved) sequences in comparison with biological proteins. In our study, combinations of secondary structure, disorder, and aggregation predictions are accompanied by experimental characterization of selected proteins. We found that the overall secondary structure and physicochemical properties of random and biological sequences are very similar. Moreover, random sequences can be well-tolerated by living cells. Contrary to early hypotheses about the toxicity of random and disordered proteins, we found that random sequences with high disorder have low aggregation propensity (unlike random sequences with high structural content) and were particularly well-tolerated. This direct structure content/aggregation propensity dependence differentiates random and biological proteins. Our study indicates that while random sequences can be both structured and disordered, the properties of the latter make them better suited as progenitors (in both in vivo and in vitro settings) for further evolution of complex, soluble, three-dimensional scaffolds that can perform specific biochemical tasks.

Highlights

  • The proteinogenic amino acid alphabet has remained largely unchanged during the past ~3 billion years of astonishing evolutionary diversification

  • Systematic characterization of the folding potential of random sequences has been attempted using tertiary structure prediction algorithms such as Rosetta, but parallel studies questioned the reliability of these algorithms for random sequences unrelated to those found in nature[13,14]

  • We generated an in silico library (104 sequences) of 100-amino-acid proteins and evaluated the occurrence of secondary structure by 5 different prediction algorithms, comparing the properties of random polypeptides with those of natural proteins

Read more

Summary

Results and Discussion

Frequencies of secondary structure motifs are similar in random sequences and biological proteins. This trend is less pronounced for natural proteins in the Uni dataset (Fig. 6C) and completely absent for the PDB proteins (the PDB dataset contains proteins that were successfully expressed and structurally studied and represents a biased sample of all extant proteins) (Fig. 6B) To better understand these differences, we performed sequence analysis of each library based on the structural content (ordered, average, and disordered). Random sequences with low structural content may represent advantageous origin points for further evolution into soluble functional proteins, as they are better tolerated in vivo and have lower aggregation scores than random sequences with structural content This is consistent with recent studies reporting that random sequences are often bioactive and can even increase fitness in vivo, as well as work suggesting that non-coding DNA translation (one of the hypotheses about de novo gene birth) gives rise to highly disordered proteins[28,29]. Our study provides rationale for this hypothesis on a protein-sequence-space scale

Methods
Author Contributions
Additional Information
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call