Abstract

We introduce here a gene evolution model which is an extension of the time-continuous stochastic IDIS model (Lèbre and Michel in J. Comput. Biol. Chem. 34:259-267, 2010) to sequence length. This new IDISL (Insertion Deletion Independent of Substitution based on sequence Length) model gives an analytical expression of the residue occurrence probability p(l) at sequence length l depending on stochastically independent processes of substitution, insertion, and deletion. Furthermore, in contrast to all mathematical models in this research field, the substitution, insertion, and deletion parameters of the IDISL model are independent of each other. For any diagonalizable substitution matrix M, the residue occurrence probability p(l) is given as a function of the eigenvalues of M, the eigenvector matrix of M, a vector r of the residue insertion rates, a deletion rate d (unlike our previous IDIS model), and a vector of the initial residue occurrence probability p(l(0)) at sequence length l(0).As another difference with the classical evolution approaches which mainly focus on sequence alignment, the IDIS class of models allows a mathematical analysis of the behavior of the residue occurrence probability according to either evolution time or sequence length. The length parameter can be associated with any nucleotide regions: genes, genomes, introns, repeats, 5' and 3' regions, etc. Three properties of the IDISL model are given in relation with the sequence length l: parameter scale, inverse evolution, and residue equilibrium distribution. Nucleotide occurrence probabilities are given in the particular case of the IDISL-HKY model, i.e. the IDISL model associated with the HKY asymmetric substitution matrix (Hasegawa et al. in J. Mol. Evol. 22:160-174, 1985).An application of the IDISL model is developed for a massive statistical analysis of GC content in all complete bacterial genomes available to date (894 non-anaerobic and anaerobic genomes). The IDISL-HKY model confirms the increase of the GC content with the genome length for two non-anaerobic taxonomic groups of bacterial genomes. Moreover, the non-linear modelling proposed by the IDISL model outperforms the most recent modelling of GC content in these bacterial genomes (Wang et al. in Biochem. Biophys. Res. Commun. 342:681-684, 2006; Musto et al. in Biochem. Biophys. Res. Commun. 347:1-3, 2006).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.