Codes of nucleotide sequences

E.N Trifonov

doi:10.1016/0025-5564(88)90080-6

Abstract

Nucleotide sequences have many properties of a language. This analogy when developed to its fullest results in an interesting linguistic description of the nucleotide sequences. Several a priori features of this language (called “Gnomic”) are discussed, based on its molecular nature. Gnomic appears to be a multicode language, with overlapping degenerate messages, each one encoding physically different specific interactions (protein-DNA, protein-RNA, protein-protein, and RNA-RNA). Several codes of the Gnomic language are discussed—the RNA-to-protein translation (triplet) code; the chromatin code responsible for DNA folding in chromatin; the framing code which secures correct triplet counting during translation; and, tentatively, the RNA loop code, presumably responsible for the formation of RNA loops with specific recognition sequences. The last code is aperiodic and involves mirror-symmetrical sequence elements, while the other codes are based on the sequence periodicities. A general technique of detection of words in continuous (no blanks) texts is discussed, based on the context contrast of the internally correlated strings.

Full Text