It is well known that there is a codon usage bias in genomes, that is, some codons are observed more often than others. Codons implicated in the homo-repeats regions in human proteins are no exception. In this work, we analyzed the codon usage bias for all amino acid residues in homo-repeats larger than 4 in 3753 human proteins from 20447 protein sequences from the canonically reviewed human proteome. We have discovered that almost all homo-repeats in the human proteome, most of which encode Ala, Glu, Gly, Leu, Pro, and Ser (∼80% of all homo-repeats), have a codon usage bias, i.e. are mainly encoded by one codon. Moreover, there is a strong shift in homo-repeats in favor of the content of GC rich codons. Homo-repeats with Ala, Glu, Gly, Leu, Pro, and Ser predominate in the PDB, which has both ordered and disordered status. Examining the distribution of splicing sites, we found that about 15% of homo-repeats either contain or are located within 10 nucleotides of the splicing site, and Glu and Leu predominate in these homo-repeats. Our data is important for future study of the functions of homo-repeats, protein-protein interactions, and evolutionary fitness.
Read full abstract