Identifying the closest living relative(s) of tetrapods is an important, yet still contested question in vertebrate phylogenetics. Three hypotheses are possible and ruling out alternatives has proven difficult even with large molecular data sets due to weak phylogenetic signal coupled nonphylogenetic noise resulting from relatively rapid speciation events that occurred a long time ago ([Formula: see text]400 Ma). Here, we revisit the identity of the closest living relative of land vertebrates from a phylogenomic perspective and include new genomic data for all extant lungfish genera. RNA-seq proves to be a great alternative to genomic sequencing, which currently is technically not feasible in lungfishes due to their huge (50-130 Gb) and repetitive genomes. We examined the most important sources of systematic error, namely long-branch attraction (LBA), compositional heterogeneity and distribution of missing data and applied different correction techniques. A multispecies coalescent approach is used to account for deep coalescence that might come from the short and deep internodes separating early sarcopterygian splits. Concatenation methods favored lungfishes as the closest living relatives of tetrapods with strong statistical support. Amino acid profile mixture models can unambiguously resolve this difficult internode thanks to their ability to avoid systematic error. We assessed the performance of different site-heterogeneous models and data partitioning and compared the ability of different strategies designed to overcome LBA, including taxon manipulation, reduction of among-lineage rate heterogeneity and removal of fast-evolving or compositionally heterogeneous positions. The identification of lungfish as sister group of tetrapods is robust regarding the effects of nonstationary composition and distribution of missing data. The multispecies coalescent method reconstructed strongly supported topologies that were congruent with concatenation, despite pervasive gene tree heterogeneity. We reject alternative topologies for early sarcopterygian relationships by increasing the signal-to-noise ratio in our alignments. The analytical pipeline outlined here combines probabilistic phylogenomic inference with methods for evaluating data quality, model adequacy, and assessing systematic error, and thus is likely to help resolve similarly difficult internodes in the tree of life. [Coalescence; coelacanth; compositional heterogeneity; gene tree; long-branch attraction; lungfish; missing data; model misspecification; phylogenomic; species tree; systematic error.].
Read full abstract