Abstract

BackgroundMost statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated.ResultsWe prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data.ConclusionsOur result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.

Highlights

  • Most statistical methods for phylogenetic estimation in use today treat a gap within the input sequence alignment as missing data

  • Simulation studies [3][6][7] have shown that highly accurate trees can be estimated from sequences, especially when phylogenies are estimated using statistical methods that are based upon statistical models, such as General Time Reversible (GTR) [8]

  • We focus our discussion on the impact of treating gaps as missing data in phylogenetic analyses based upon Maximum likelihood (ML)

Read more

Summary

Background

We know a great deal about phylogenetic estimation. We know, for example, that when sequences evolve with only substitutions (but no indels) under models such as the General Time Reversible (GTR) model, accurate estimation of trees (with high probability) is guaranteed, provided that appropriate methods (such as maximum likelihood) are used and the sequences are “long enough” [1][2][3][4][5]. If the mechanism generating the data has a high probability of producing aligned sequences that are monotypic for some parameter values, it will be difficult to reliably infer the underlying phylogenetic tree if the gaps are treated merely as missing data rather than features of the data that are informative about the path that evolution has taken. For those models of evolution for which monotypic alignments have non-zero probability, ML, treating gaps as missing data, may not be statistically consistent

Results
Conclusions
Discussion
31. Muller K
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.