Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

Orion Penner,Peter Grassberger,Maya Paczuski

doi:10.1371/journal.pone.0014373

Abstract

BackgroundExisting sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results.ResultsWe describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances.ConclusionsSeveral versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis.

Highlights

Sequence alignment achieves many purposes and comes in several different varieties [1]: Local versus global, pairwise versus multiple, and DNA/RNA versus proteins
It is well known that DNA and amino acid sequences are hard to compress [18,19], one might expect that Icompr depends strongly on the compression algorithm used
Note that it is very likely that an imperfect compression algorithm underestimates rather than overestimates mutual information (MI) – we do not know a rigorous theorem to this effect

Summary

Introduction

Sequence alignment achieves many purposes and comes in several different varieties [1]: Local versus global (and even ‘‘glocal’’: [2]), pairwise versus multiple, and DNA/RNA versus proteins. Each position at which the two sequences agree is rewarded by a positive score, while each disagreement (‘‘mutation’’) and each insertion of a blank (‘‘gap’’) is punished by a negative one. One aligns only subsequences against each other and looks for the highest scores between any pairs of subsequences. Existing algorithms use either heuristic scoring schemes or scores derived from explicit probabilistic models [6]. Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Jan 4, 2011
Citations: 48	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

On the Calculation of Mutual Information
Tyrone E Duncan
SIAM Journal on Applied Mathematics | VOL. 19
Tyrone E DuncanTyrone E Duncan
01 Jul 1970
SIAM Journal on Applied Mathematics | VOL. 19

Mutual Information Is Copula Entropy
Jian Ma ... Zengqi Sun
Tsinghua Science and Technology | VOL. 16
Jian Ma, et. al.Jian Ma ... Zengqi Sun
01 Feb 2011
Tsinghua Science and Technology | VOL. 16

Estimation of mutual information by the fuzzy histogram
Maryam Amir Haeri ... Mohammad Mehdi Ebadzadeh
Fuzzy Optimization and Decision Making | VOL. 13
Maryam Amir Haeri, et. al.Maryam Amir Haeri ... Mohammad Mehdi Ebadzadeh
13 Feb 2014
Fuzzy Optimization and Decision Making | VOL. 13

Multifeature mutual information
Dejan Tomazevic ... J Michael Fitzpatrick
-
Dejan Tomazevic, et. al.Dejan Tomazevic ... J Michael Fitzpatrick
12 May 2004
12 May 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE