Sources of Error and Incongruence in Phylogenomic Analyses

Christoph Bleidorn

doi:10.1007/978-3-319-54064-1_9

Abstract

Phylogenomic analyses can be performed by analysing gene trees separately and using coalescent or supertree analyses to retrieve a tree or using the supermatrix approach. In the latter case, all gene partitions are concatenated into a single dataset before conducting a phylogenetic analysis. Even though massive amounts of data help to reduce sampling error, several sources of errors may bias phylogenomic studies. Especially problematic is systematic error, which is due to the violation of substitution model assumptions, including problems with compositional heterogeneity, among-lineage rate variation and heterotachy. Several methods to detect and deal with these systematic errors have been and are being developed. Furthermore, large-scale phylogenomic studies sometimes exhibit large amounts of missing data, which are generally less problematic as shown in real data and simulation studies. Taxon sampling is another critical issue for phylogenomics, as sparsely sampled analyses might be affected by long-branch attraction artefacts. Data and taxa included should be carefully selected and highly saturated genes should be avoided, as well as phylogenetically unstable (rogue) taxa. Several methods are available to estimate and visualize the information content of genes, as well as the phylogenetic stability of taxa selected for the analysis. Finally, discordance of gene trees and species trees is not rare, and potential causes are incongruent lineage sorting, hybridization or horizontal gene transfer. Coalescent-based methods for species tree inference based on separate or binned gene tree analyses are able to deal with incomplete lineage sorting, whereas network analyses can be used to visualize conflict between gene trees in general.

Full Text