Abstract

Phylogenies provide the framework for all inferences incomparative biology, so obtaining the right tree is critical.Maximum parsimony (MP) is a non-parametric methodthat usually performs well, but certain branch-lengthcombinations can create a strong bias – called long-branchattraction – in favor of the wrong tree [1,2]. Maximumlikelihood (ML) will, in most cases, recover the true tree ifthe correct probabilistic model of the evolutionary processis used and enough data are provided. Limited researchhas been conducted on the performance of ML when theincorrect model is used. Simulation studies have shownthat when models that are less complex than the trueprocess are used ML can become subject to long-branchattraction, although the bias is not as strong as with MP[3–5]. In these studies, the true model has always beenavailable in major software packages, so the take-homemessage has been that we can be confident in ML resultsas long as we select the correct model.Real sequences, however, are subject to selectionpressures that might change over time and vary amongsites. The diverse evolutionary dynamics that result arenot modeled by current ML implementations, whichassume an identically distributed evolutionary processforallsequencesites.Ofparticularconcernisheterotachy–when evolutionary rates at specific sites differ amonglineages because of changing selective constraints. Hetero-tachy, which has been shown to occur in numerous genes[6–11], is important because accurate estimates of branchlengths for each site are key to recovering the true tree.Our recent study [12], reviewed in this issue by MikeSteele [13], showed that some heterotachous conditionscan cause ML to become strongly biased in favor of theincorrect tree, even when the best available model is used.This occurs because ML estimates branch lengths ascompromises across all sites, which makes them incorrectfor every site when heterotachy is present. When theincorrect branch lengths are used, the likelihood of theincorrect tree can be greater than that of the true tree.Under some of the conditions we examined, ML is sostrongly biased that it is outperformed by MP, which isunaffected by heterotachy. Furthermore, the deficienciesof ML in this case cannot be repaired using a better model,because a model resembling this heterogeneous evolution-ary process has not been implemented.Realism and mixed modelsSteel raises several interesting questions and caveats.First, he argues that the conditions we investigated –convergent rate changes in non-sister lineages – areunrealistic. This might well be true, but the patterns ofheterotachy in real sequence sets have not been exploredadequately to support such a statement empirically. Theideathatasitemightbereleasedfromselection(orsubjectedto novel constraints) in parallel does not seem outlandish,although it might occur rarely. Consider, for example, aprotein whose structure is stabilized by an interactionbetween the side chains on two helices; the specific sitesinvolved in the interaction might change with time,constraining sites in some lineages that were previouslyneutral, and releasing formerly constrained sites to evolvemore rapidly. If there are a finite number of sites that canparticipate in this interaction, a site might become part ofthe interaction in two separate lineages independently.Moreimportantly,severalotherformsofheterotachy,whichareprobablymorerealistic,alsocauseMLtoperformpoorly–includingsequencesinwhichevolutionaryconstraintsarereleased in a heterotachous manner in single lineages, andsequences that mix some sites containing a strong signalwith others containing pure noise (our unpublished data).Second, Steel implies that we are too pessimistic in ourdiscussion of the potential that mixed models offer forimproving ML. In our article, we developed an ML modelto accommodate heterotachy, in which each site can evolveon a mixture of two different sets of branch lengths; weshowed that this technique performs much better thanstandardMLorparsimony.Weareexcitedbythepotentialof this approach and are actively pursuing it. Models likethis have not yet been implemented in a generally usefulframework, however, and their accuracy and robustnessunder a wide range of conditions have not yet beenvalidated. Indeed, there are non-trivial computationalissues that limit the ability of current algorithms to findthe optimal parameter values for mixture models; thesewill have to be solved before the method can be used onanything larger than a toy problem. Furthermore, themethod is much more computationally intensive thanstandard ML, which might render it impractical for thelarge data sets that are usually required for phylogeneticaccuracy. We therefore feel it is appropriate to temper ouroptimism about this new strategy with caution.Selecting a good modelAn additional concern is model selection: how does oneknow how many categories should be used in a mixedmodel? If models are significantly underparameterized,the same errors that occur with homogeneous ML arelikely to be reproduced. If many categories are necessary –which can be a result of most sites having uniqueevolutionary dynamics – then the number of parametersapproaches the number of sites, and the data will be

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.