Abstract

The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in an analysis that lack data for some characters (incomplete taxa) or to include characters that lack data for some species. Given the difficulty of obtaining data from all characters for all taxa (e.g., fossils), missing data might seriously impede efforts to reconstruct a comprehensive phylogeny that includes all species. Fortunately, recent simulations and empirical analyses suggest that missing data cells are not themselves problematic, and that in- complete taxa can be accurately placed as long as the overall number of characters in the analysis is large. How- ever, these studies have so far only been conducted on parsimony, likelihood, and neighbor-joining methods. Although Bayesian phylogenetic methods have become widely used in recent years, the effects of missing data on Bayesian analysis have not been adequately studied. Here, we conduct simulations to test whether Bayesian analyses can accurately place incomplete taxa despite extensive missing data. In agreement with previous studies of other methods, we find that Bayesian analyses can accurately reconstruct the position of highly incomplete taxa (i.e., 95% missing data), as long as the overall number of characters in the analysis is large. These results suggest that highly incomplete taxa can be safely included in many Bayesian phylogenetic analyses. The impact of missing data is a potentially im- portant issue in phylogenetic analyses, particularly if the goal is to reconstruct a comprehensive Tree of Life that includes both fossil and living taxa. Missing data are often encountered when combining data from two or more different genes, when some of the taxa have sequence data available for one gene but not the other. If the taxa lacking data for a gene are included in the combined analysis, then the characters associated with this gene are typically coded as missing or unknown (often denoted with a ?). Similarly, missing data are often encountered in analyses that include fossil taxa, when certain taxa must be scored as unknown for certain characters because the relevant features have not been adequately preserved. Concerns about missing data may often deter- mine what characters and taxa will be included in an analysis (Wiens, 2006), even if this is not always stated explicitly by researchers. For example, if missing data are considered to be problematic, then one should only include species that have complete data for all characters or else only include characters that have complete data for all species. Thus, one may have to reduce the number of taxa or characters in an

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call