Abstract
BackgroundThe flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. It is widely appreciated that automated alignment protocols can yield inaccuracies, but the relative impact of various sources error on phylogenomic analysis is not yet known. This study employs an updated mammal data set of 5162 coding loci sampled from 90 species to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation. Additionally, a novel coalescent likelihood ratio test is introduced for comparing competing species trees against a given set of gene trees.ResultsThe aligned DNA sequences of 5162 loci from 90 species were trimmed and filtered using trimAL and two filtering protocols. The final dataset contains 4 sets of alignments - before trimming, after trimming, filtered by a recently proposed pipeline, and further filtered by comparing ML gene trees for each locus with the concatenation tree. Our analyses suggest that the average discordance among the coalescent trees is significantly smaller than that among the concatenation trees estimated from the 4 sets of alignments or with different substitution models. There is no significant difference among the divergence times estimated with different substitution models. However, the divergence dates estimated from the alignments after trimming are more recent than those estimated from the alignments before trimming.ConclusionsOur results highlight that alignment uncertainty of the updated mammal data set and the choice of substitution models have little impact on tree topologies yielded by coalescent methods for species tree estimation, whereas they are more influential on the trees made by concatenation. Given the choice of calibration scheme and clock models, divergence time estimates are robust to the choice of substitution models, but removing alignments deemed problematic by trimming algorithms can lead to more recent dates. Although the fossil prior is important in divergence time estimation, Bayesian estimates of divergence times in this data set are driven primarily by the sequence data.
Highlights
The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment
The linear regression analysis suggests that the average bootstrap percentages estimated with the Jukes-Cantor model (JC), Kimura model (K80), and General time reversible model (GTR) models are consistent across 5162 loci (Additional file 1: Figure S2b)
The final dataset containing 4 sets of alignments was used to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation
Summary
The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. There is a growing interest in developing unified models for phylogenomic data analysis that acknowledge a multitude of biological processes One such model, the multispecies coalescent model, integrates nucleotide substitution processes [1] and the coalescence process [2,3,4,5,6] to deliver estimates of phylogenies that are. Most efforts have been devoted to evaluating the effects of alignment uncertainty and substitution models in the context of gene tree estimation; their impacts on species tree inference under the multispecies coalescent have not been fully explored, despite the fact that coalescent species tree estimation appears to be more robust than concatenation methods to taxon sampling, long branch attraction, missing data and other biases [24,25,26], especially but not exclusively when information content of individual alignments is moderate or high [27, 28]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.