Approaches for parametrization of Markovian models of molecular evolution for protein-coding sequences

Stefan Zoller

doi:10.3929/ethz-a-010464331

Abstract

Evolution is underlying all biological processes. Molecular evolution in protein-coding sequences is most widely described by Markovian models of character substitution. These models are at the core of all applications in bioinformatics that deal with sequence data: estimating distances between sequences, building (multiple) sequence alignment and phylogenetic trees, and more. The parametrization of the defining rate matrices is of importance to ensure the quality of not only the models, but also of all methods and applications that make use of them. There have been mainly two types of Markov models for molecular evolution. On the one hand, there are empirical models, where a rate matrix has been estimated once from a large set of data and is then kept fixed in all applications. On the other hand, people use parametric models, where a few free parameters are fitted onto the data set in question to then define the rate matrix. In the first chapters of my thesis, I present a new method to formulate parameters for a semi-empirical model of molecular evolution that can describe most of the variance found in (standardized) real data. A semi-empirical model starts with a first approximation of the final rate matrix estimated from a large pool of real sequences. But in contrast to empirical models, a semi-empirical model still allows certain free parameters to be fitted in every application to capture the peculiarities of the data sets in question. I applied my new method to codon data as well as to amino acid data, and both models have been extensively tested on large data sets. Applied on sequence data that matches the taxonomic range, models generated with this method outrank all other models in the comparison. Typically, researchers use a single Markov model per data set. Different parts of the data might show different patterns of evolution; for example, different evolutionary rates, or different rates of selective pressure. This has been handled by either cutting the sequence alignments into smaller chunks and using different instances of Markov models, or by applying more complex models with more free parameters. I would like to

Full Text