An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines

Michael Degiorgio,Richard Cronn,John Syring,David B Neale,Andrew J Eckert,Noah A Rosenberg,Aaron Liston

doi:10.1186/1471-2148-14-67

Abstract

BackgroundAs it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size.ResultsHere, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ∼47 kilobases of sequence at 121 loci. Each “strategy” for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies.ConclusionsWhen constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.

Highlights

As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data
As was observed with M, strategies based on STAR and Rooted Triple Consensus (RTC) together with maximum likelihood (ML), maximum parsimony (MP), or NJ share similar numbers of clades among other such strategies as with the remaining strategies. These results suggest that strategies that are topologically-based (i.e., STAR and RTC) tend to infer clades that are supported both by other topologically-based strategies and by strategies that are not strictly topologically-based, whereas strategies that are not strictly topologically-based (i.e., Concatenation, SuperMatrix Rooted Triple (SMRT), and STEAC) tend to infer clades that are not supported by strategies that are strictly topologicallybased (i.e., STAR, RTC, and Minimize Deep Coalescences (MDC))
We have found that MDC tends to infer balanced topologies, whereas SMRT and STEAC tend to infer more unbalanced topologies

Summary

Introduction

As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. It has become increasingly common to sequence large numbers of individuals at many loci (e.g., [1,2,3,4]) While these multilocus datasets provide the potential to improve the accuracy of phylogeny inferences over large sets of taxa, for a variety of reasons, topologies of trees inferred at different loci might not match [5]. A third class of approaches can be labeled “single-stage” methods, in which species trees are inferred by simultaneously modeling the evolution of sequences among all sampled loci to output a species tree estimate [17,18,19,20] These single-stage modelbased methods often have desirable statistical properties, but because they typically explore large spaces of possibilities rather than algorithmically constructing estimated trees, they can be computationally intensive and applicable only to smaller datasets

Objectives

Methods

Results

Discussion

Conclusion