Abstract
Non-random distributions of missing data are a general problem for likelihood-based statistical analyses, including those in a phylogenetic context. Extensive non-randomly distributed missing data are particularly problematic in supermatrix analyses that include many terminals and/or loci. It has been widely reported that missing data can lead to loss of resolution, but only very rarely create misleading or otherwise unsupported results in a parsimony context. Yet this does not hold for all parametric-based analyses because of their assumption of homogeneity across characters and lineages, which can lead to both long-branch attraction and long-branch repulsion. Contrived examples were used to demonstrate that non-random distributions of missing data, even without rate heterogeneity among characters and a well fitting model, can provide misleading likelihood-based topologies and branch-support values that are radically unstable based on slight modifications to character sampling. The same can occur despite complete absence of parsimony-informative characters. Otherwise unsupported resolution and high branch support for these clades were found to occur frequently in 22 empirical examples derived from a published supermatrix. Partitioning characters based on the distribution of missing data helped to decrease, but did not eliminate, these artifacts. These artifacts were exacerbated by low quality tree searches, particularly when holding only a single optimal tree that must be fully resolved.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have