Starless bias and parameter-estimation bias in the likelihood-based phylogenetic method.

Xuhua Xia

doi:10.3934/genet.2018.4.212

Abstract

I analyzed various site pattern combinations in a 4-OTU case to identify sources of starless bias and parameter-estimation bias in likelihood-based phylogenetic methods, and reported three significant contributions. First, the likelihood method is counterintuitive in that it may not generate a star tree with sequences that are equidistant from each other. This behaviour, dubbed starless bias, happens in a 4-OTU tree when there is an excess (i.e., more than expected from a star tree and a substitution model) of conflicting phylogenetic signals supporting the three resolved topologies equally. Special site pattern combinations leading to rejection of a star tree, when sequences are equidistant from each other, were identified. Second, fitting gamma distribution to model rate heterogeneity over sites is strongly confounded with tree topology, especially in conjunction with the starless bias. I present examples to show dramatic differences in the estimated shape parameter α between a star tree and a resolved tree. There may be no rate heterogeneity over sites (with the estimated α > 10000) when a star tree is imposed, but α < 1 (suggesting strong rate heterogeneity over sites) when an (incorrect) resolved tree is imposed. Thus, the dependence of “rate heterogeneity” on tree topology implies that “rate heterogeneity” is not a sequence-specific feature, cautioning against interpreting a small α to mean that some sites are under strong purifying selection and others not. Thirdly, because there is no existing (and working) likelihood method for evaluating a star tree with continuous gamma-distributed rate, I have implemented the method for JC69 in a self-contained R script for a four-OTU tree (star or resolved), in addition to another R script assuming a constant rate over sites. These R scripts should be useful for teaching and exploring likelihood methods in phylogenetics.

Highlights

IntroductionIf a set of aligned sequences are equidistant from each other, i.e., the number of various types of substitutions between any two sequences is exactly the same, we intuitively would expect a star tree
Sample1.fas and Sample2.fas (s003.fas and s004.fas within the Supplementary genetics-05-04-212.zip) have sequences equivalent to site pattern combinations (0, 14, 43, 4, 40) and (1, 11, 41, 5, 42), respectively (Figure 1A). They are from sequence simulation under JC69 model with a star tree using Evolver in PAML [13] to show that likelihood methods will recover a star tree if sequences evolve according to a star tree and a specific substitution model
Likelihood method recovers the true star tree when sequences evolve under a substitution model

Summary

Introduction

If a set of aligned sequences are equidistant from each other, i.e., the number of various types of substitutions between any two sequences is exactly the same, we intuitively would expect a star tree. Distance-based methods such as neighbor-joining [1] or FastME [2] will give us a star tree whenever pairwise distances are all equal. The starless bias refers to the inability of a phylogenetic method to generate a star tree with equidistant sequences. It was first alluded to in a study of potential bias in maximum likelihood method involving missing data and rate heterogeneity over sites [3], but its occurrence is more general that that. I will illustrate this bias here with four sequences, identify the source of the bias, and discuss alternative approaches relevant to the problem

Methods

Results

Conclusion