Abstract

Pairwise evolutionary distances are a model-based summary statistic for a set of molecular sequences. They represent the leaf-to-leaf path lengths of the underlying phylogenetic tree. Estimates of pairwise distances with overlapping paths covary because of shared mutation events. It is desirable to take these covariance structure into account to increase precision in any process that compares or combines distances. This paper introduces a fast estimator for the covariance of two pairwise maximum likelihood distances, estimated under general Markov models. The estimator is based on a conjecture (going back to Nei & Jin, 1989) which links the covariance to path lengths. It is proven here under a simple symmetric substitution model. A simulation shows that the estimator outperforms previously published ones in terms of the mean squared error.

Highlights

  • Phylogenetic trees are one of the most important representations of the evolutionary relationship between homologous genomic sequences

  • Evaluation of basic components of branch-covariance We have tested the validity of the conjecture in Eq (26), which was derived under the Nr model, and the accuracy of the maximum likelihood (ML) variance (Eq (5)) in a simulation under the GCB model with long sequences (10,000 amino-acids)

  • A plot of the Monte Carlo-variance of the ML estimate of a pairwise distance δm versus the Monte Carlo-covariance between the ML estimates of two pairwise distances in the dependence case with a shared path of length δm corroborates the conjecture and suggests that the result is valid in general (Fig 2A)

Read more

Summary

Introduction

Phylogenetic trees are one of the most important representations of the evolutionary relationship between homologous genomic sequences. Their relatedness can be summarized by a set of pairwise evolutionary distances representing the leaf-to-leaf path lengths of the underlying tree. Such distances are usually estimated by maximum likelihood (ML) assuming a Markovian model of character substitution (Yang, 2006). A consistent hypothesis of character homology is provided by multiple sequence alignments (MSAs). The sequences can be aligned pairwise, for instance, by dynamic programming to obtain optimal pairwise alignments (OPAs) in quadratic time in the length of the input sequences (Needleman & Wunsch, 1970)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call