Improving Phylogenetic Inference with a Semiempirical Amino Acid Substitution Model

S Zoller,A Schneider

doi:10.1093/molbev/mss229

Abstract

Amino acid substitution matrices describe the rates by which amino acids are replaced during evolution. In contrast to nucleotide or codon models, amino acid substitution matrices are in general parameterless and empirically estimated, probably because there is no obvious parametrization for amino acid substitutions. Principal component analysis has previously been used to improve codon substitution models by empirically finding the most relevant parameters. Here, we apply the same method to amino acid substitution matrices, leading to a semiempirical substitution model that can adjust the transition rates to the protein sequences under investigation. Our new model almost invariably achieves the best likelihood values in large-scale comparisons with established amino acid substitution models (JTT, WAG, and LG). In particular for longer alignments, these likelihood gains are considerably larger than what could be expected from simply having more parameters. The application of our model differs from that of mixture models (such as UL2 or UL3), as we optimize one rate matrix per alignment, whereas mixture models apply the variation per alignments site. This makes our model computationally more efficient, while the performance is comparable to that of UL3. Applied to the phylogenetic problem of the origin of placental mammals, our new model and the UL3 mixed model are the only ones of the tested models that cluster Afrotheria and Xenarthra into a clade called Atlantogenata, which would be in correspondence with recent findings using more sophisticated phylogenetic methods.

Full Text