Genomic data has become ubiquitous in phylogenomic studies, including divergence time estimation, but provide new challenges. These challenges include, amongst others, biological gene tree discordance, methodological gene tree estimation error, and computational limitations on performing full Bayesian inference under complex models. In this study, we use a recently published firefly (Coleoptera: Lampyridae) anchored hybrid enrichment dataset (AHE; 436 loci for 88 Lampyridae species and 10 outgroup species) as a case study to explore gene tree estimation error and the robustness of divergence time estimation. First, we explored the amount of model violation using posterior predictive simulations because model violations are likely to bias phylogenetic inferences and produce gene tree estimation error. We specifically focused on missing data (either uniformly distributed or systematically) and the distribution of highly variable and conserved sites (either uniformly distributed or clustered). Our assessment of model adequacy showed that standard phylogenetic substitution models are not adequate for any of the 436 AHE loci. We tested if the model violations and alignment errors resulted indeed in gene tree estimation error by comparing the observed gene tree discordance to simulated gene tree discordance under the multispecies coalescent model. Thus, we show that the inferred gene tree discordance is not only due to biological mechanism but primarily due to inference errors. Lastly, we explored if divergence time estimation is robust despite the observed gene tree estimation error. We selected four subsets of the full AHE dataset, concatenated each subset and performed a Bayesian relaxed clock divergence estimation in RevBayes. The estimated divergence times overlapped for all nodes that are shared between the topologies. Thus, divergence time estimation is robust using any well selected data subset as long as the topology inference is robust.
Read full abstract