Abstract

The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and “compressed” English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., “key words”) and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.

Highlights

  • Anfinsen’s dogma or thermodynamic hypothesis states that the amino acid sequences of proteins are necessary and sufficient to determine their three-dimensional structures and functions that realize kinetically probable and stable free energy minimum states [1]

  • The present study shows that this linguistic approach is likely useful in decoding protein amino acid sequences, focusing on putative important sites, or ‘‘key words’’, which are defined as high-availability sites

  • We showed that at least a part of sequence information can be extracted by treating amino acid sequences of proteins as a natural language, i.e., current English

Read more

Summary

Introduction

Anfinsen’s dogma or thermodynamic hypothesis states that the amino acid sequences of proteins are necessary and sufficient to determine their three-dimensional structures and functions that realize kinetically probable and stable free energy minimum states [1]. Information extraction from amino acid sequences is a crucial step in understanding protein molecules. Structural and functional prediction from amino acid sequences largely depends on the intricate use of the accumulated experimental data in the Protein Data Bank (PDB) [2], together with the fundamental use of sequence alignments [3]. A general rule on how protein sequence information is related to three-dimensional structures and functions is largely unknown. Secondary structure predictions based on linguistic rules, i.e., grammar, have been proposed [4,9,10]. These approaches are largely based on formal language theory. There is an approach based on so-called ‘‘literary linguistics’’ including stylistics and textual analysis [4]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call