Until recently, it was believed that complex phylogenies might be extremely difficult to reconstruct due to the phenomenal rate of increase in the number of possible phylogenies as the number of taxa increases. However, Hillis (1996) showed through simulation that, for at least one complex phylogeny of angiosperms with 228 taxa, reconstruction was far more accurate than expected, even with relatively modest amounts of DNA sequence data. This led to a flurry of papers on the subject of taxon sampling and phylogenetic reconstruction, with focus quickly shifting from the question of whether complex phylogenies can be reconstructed to whether and how much an existing phylogeny can be improved through increased taxon sampling (Hillis, 1998; Kim, 1998; Poe, 1998; Poe and Swofford, 1999; Pollock and Bruno, 2000; Rannala et al., 1998; Yang, 1998). Although a statistician might intuitively believe that it is generally better (or at least no worse) to increase the amount of data to resolve a question in statistical inference, the benefits of taxon addition for phylogenetic inference remain controversial. Some researchers have argued that taxon addition can decrease accuracy (Kim, 1996,1998), while others believe that increased sampling improves accuracy (Graybeal, 1998; Hillis, 1996, 1998; Murphy et al., 2001; Poe, 1998; Pollock and Bruno, 2000; Pollock et al., 2000; Soltis et al., 1999). The reasons that different papers come to apparently contradictory conclusions deserve careful consideration. An often cited factor affecting the benefits of taxon addition is the phenomenon of long-branch attraction (LBA). Some phylogenetic methods have a bias toward preferential clustering of long branches, leading to erroneous results when those long branches do not actually represent a monophyletic assemblage (Felsenstein, 1978; Hendy and Penny, 1989). This phenomenon has been cited in favor of increased taxon sampling, since sampling can be designed to break up long branches (Hillis, 1998). However, increased sampling has also been implicated as a potential cause of LBA because addition of a new long branch may wrongly attract a pre-existing long branch that had previously been inferred correctly (Poe and Swofford, 1999; Rannala et al., 1998). LBA may also explain some simulations that have found problems in phylogeny estimation when sampling outside the taxonomic group of interest (but see Pollock and Bruno [2000] for an alternative explanation). Outside sampling in these simulations tended to add long branches, which tended to attract the longest unbroken branch in the group of interest (Hillis, 1998; Rannala et al., 1998). The degree to which LBA is a problem depends greatly on the method of analysis, and LBA is much less of a problem for maximum likelihood (ML) than for parsimony or distance methods (Bruno and Halpern, 1999). A recent paper on the subject of taxon addition (Rosenberg and Kumar, 2001) concludes that increased taxon sampling is of little benefit to phylogenetic inference when compared to increasing sequence length. We disagree with their interpretation and believe that their data support the importance of increased taxon sampling. In addition, some of their data were simulated under extreme conditions (i.e., substitution rates that were very high or low, or sequences that were unreasonably short). Large error values and nonlinear relationships at these extremes make it difficult to interpret effects for the majority of the range, and averaging across the entire range is inappropriate. Moreover, we do not believe that Rosenberg and Kumar (2001) used the most appropriate metric to measure the relative effect of taxon addition. Our reanalysis of their simulated data indicates that increased taxon sampling is highly beneficial for phylogenetic inference.
Read full abstract