Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

Béatrice Roure,Hervé Philippe,Denis Baurain

doi:10.1093/molbev/mss208

Béatrice Roure, Hervé Philippe + Show 1 more

Open Access

https://doi.org/10.1093/molbev/mss208

Copy DOI

Abstract

Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes × 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (~50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

Abstract

Talk to us

Similar Papers

More From: Molecular Biology and Evolution

Lead the way for us

Journal: Molecular Biology and Evolution	Publication Date: Aug 28, 2012
Citations: 325

Similar Papers

What is missing from my missing data plan?
Sharon D Yeatts ... Renée H Martin
Stroke | VOL. 46
Sharon D Yeatts, et. al.Sharon D Yeatts ... Renée H Martin
07 May 2015
Stroke | VOL. 46

Author response: Comprehensive phylogenetic analysis of the ribonucleotide reductase family reveals an ancestral clade
Audrey A Burnim ... Nozomi Ando
-
Audrey A Burnim, et. al.Audrey A Burnim ... Nozomi Ando
11 Aug 2022
11 Aug 2022

Bayesian and maximum likelihood phylogenetic analyses of protein sequence data under relative branch-length differences and model violation
Jessica C Mar ... Mark A Ragan
BMC Evolutionary Biology | VOL. 5
Jessica C Mar, et. al.Jessica C Mar ... Mark A Ragan
01 Jan 2004
BMC Evolutionary Biology | VOL. 5

Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos
Luciana Neves Nunes ... Jandyra Maria Guimarães Fachel
Cadernos de Saúde Pública | VOL. 25
Luciana Neves Nunes, et. al.Luciana Neves Nunes ... Jandyra Maria Guimarães Fachel
01 Feb 2009
Cadernos de Saúde Pública | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

Abstract

Talk to us

Similar Papers

More From: Molecular Biology and Evolution