Abstract

An often-expressed concern regarding the use ofmaximum likelihoodandothermodelbased methods in phylogenetic analysis is that the assumedmodels are too simple, violating known aspects of sequence evolution in several ways (e.g., Sanderson and Kim, 2000). For example,we know that sites do not truly evolve independently, that the substitution process is not completely homogeneous across sites and through time, and that simple stochastic models fail to adequately represent the complexity of the nucleotide substitution process. Many advances have been made in an effort to model sequence evolution more realistically, including allowance for unequal base composition (Felsenstein, 1981), complex substitution patterns (e.g., Tavare, 1986; Yang, 1994), among-site rate variation (Yang, 1993), compensatory base substitution (Schoniger and Von Haeseler, 1994), and heterogeneity of the substitution process across the tree (Galtier and Gouy, 1998). Nevertheless, even with these improvements, there is no reason to believe that even the most general and parameter-rich evolutionarymodels currently available capture all the nuances of the processes that have generated any particular set of sequences. Furthermore, accurate estimation of the parameters of complex evolutionary models can be very difŽcult, anddata sets containing many taxa and sites may be needed for likelihood estimation to consistently yield reliable parameter estimates (Nielsen, 1997; Sullivan et al., 1999). Fortunately, perfect models are not necessarily a prerequisite for reliable statistical inference (e.g., Burnham and Andersen, 1998). Many statistical methods performwell in the face ofviolationof commondistributional assumptions such as normality (e.g., Boneau, 1960; Donaldson, 1968). However, this observation does not guarantee the robustness of maximum likelihood when applied to the phylogenetic estimation problem. It is therefore important to examine the impact of various model violations on the accuracy of phylogenetic estimation (e.g., Huelsenbeck, 1995). In some cases, biases introduced by violation of a model’s assumptions can actually favor recovery of the true tree rather than an incorrect tree (Yang, 1997; Siddall, 1998). For example, Yang (1997) simulated data under a Jukes–Cantor (JC) model with variable rates across sites (following a gamma distribution) and demonstrated that estimation under the JC model with equal rates could be more efŽcient than estimation in which among-site rate variationwas correctly modeled by a gamma distribution. However, the reasons for the improvement in efŽciency shouldbedisconcerting (Bruno andHalpern, 1999; Swofford et al., 2001). These results suggest that application of increasingly general and complex models would sometimes lead to decreased efŽciency, despite the fact that themore complex models almost always provide signiŽcantly better Žt to real data than the simplermodels. However, in Yang’s study, the improvement

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call