Recently, shifted periodicities 1 modulo 3 and 2 modulo 3 have been identified in protein (coding) genes of both prokaryotes and eukaryotes with autocorrelation functions analysing eight of 64 trinucleotides (Arquès et al., 1995). This observation suggests that the trinucleotides are associated with frames in protein genes. In order to verify this hypothesis, a distribution of the 64 trinucleotides AAA,...,TTT is studied in both gene populations by using a simple method based on the trinucleotide frequencies per frame. In protein genes, the trinucleotides can be read in three frames: the reading frame 0 established by the ATG start trinucleotide and frame 1 (resp. 2) which is the frame 0 shifted by 1 (resp. 2) nucleotide in the 5′–3′ direction. Then, the occurrence frequencies of the 64 trinucleotides are computed in the three frames. By classifying each of the 64 trinucleotides in its preferential occurrence frame, i.e. the frame associated with its highest frequency, three subsets of trinucleotides can be identified in the three frames. This approach is applied in the two gene populations. Unexpectedly, the same three subsets of trinucleotides are identified in these two gene populations: T 0= X 0∪ {AAA,TTT} with X 0= {AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC} in frame 0, T 1= X 1∪ {CCC} in frame 1 and T 2= X 2∪ {GGG} in frame 2, each subset X 0, X 1and X 2having 20 trinucleotides. Surprisingly, these three subsets have five important properties: (i) the property of maximal circular code for X 0(resp. X 1, X 2) allowing the automatical retrieval of frame 0 (resp. 1, 2) in any region of a protein gene model (formed by a series of trinucleotides of X 0) without using a start codon; (ii) the DNA complementarity property C(e.g. C(AAC) = GTT): C( T 0) = T 0, C( T 1) = T 2and C( T 2) = T 1allowing the two paired reading frames of a DNA double helix simultaneously to code for amino acids; (iii) the circular permutation property P(e.g. P(AAC) = ACA); P( X 0) = and P( X 1) = X 2implying that the two subsets X 1and X 2can be deduced from X 0; (iv) the rarity property with an occurrence probability of X 0equal to 6 × 10 −8; and (v) the concatenation property with: a high frequency (27.5%) of misplaced trinucleotides in the shifted frames, a maximum (13 nucleotides) length of the minimal window to automatically retrieve the frame and an occurrence of the four types of nucleotides in the three trinucleotides sites, in favour of an evolutionary code. In the Discussion, the identified subsets T 0, T 1and T 2replaced in the three two-letter genetic alphabets purine/pyrimidine, amino/ceto and strong/weak interaction, allow us to deduce that the RNY model (R = purine = A or G, Y = pyrimidine = C or T, N = R or Y) (Eigen & Schuster, 1978) is the closest two-letter codon model to the trinucleotides of T 0. Then, these three subsets are related to the genetic code. The trinucleotides of T 0code for 13 amino acids: Ala, Asn, Asp, Gln, Glu, Gly, Ile, Leu, Lys, Phe, Thr, Tyr, and Val. Finally, a strong correlation between the usage of the trinucleotides of T 0in protein genes and the amino acid frequencies in proteins is observed as six among seven amino acids not coded by T 0have as expected the lowest frequencies in proteins of both prokaryotes and eukaryotes.
Read full abstract