Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores

Eric Maréchal,Olivier Bastien

doi:10.1186/1471-2105-9-332

Abstract

BackgroundConfidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support.ResultsWe built a model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage (i.e., mutual information in Information Theory) is a decreasing function of time. This quantity is simply measured by a sequence alignment score. In systems aging, the failure rate is related to the systems longevity. The system can be a machine with structured components, or a living entity or population. "Reliability" refers to the ability to operate properly according to a standard. Here, the "reliability" of a sequence refers to the ability to conserve a sufficient functional level at the folded and maturated protein level (positive selection pressure). Homologous sequences were considered as systems 1) having a high redundancy of information reflected by the magnitude of their alignment scores, 2) which components are the amino acids that can independently be damaged by random DNA mutations. From these assumptions, we deduced that information shared at each amino acid position evolved with a constant rate, corresponding to the information hazard rate, and that pairwise sequence alignment scores should follow a Gumbel distribution, which parameters could find some theoretical rationale. In particular, one parameter corresponds to the information hazard rate.ConclusionExtreme value distribution of alignment scores, assessed from high scoring segments pairs following the Karlin-Altschul model, can also be deduced from the Reliability Theory applied to molecular sequences. It reflects the redundancy of information between homologous sequences, under functional conservative pressure. This model also provides a link between concepts of biological sequence analysis and of systems biology.

Highlights

Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data
A score between two amino acids i and j can be interpreted as the mutual information between these two residues
Applications and Conclusion We built a model of evolution of the information shared between an initial molecular sequence and the sequences of its lineage

Summary

Introduction

Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. The Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Simulations of Z-value distribution is known to fit with a Gumbel law This remarkable property was not demonstrated and had no obvious biological support. Popular sequence alignment methods, such as Blast [3] or Smith-Waterman [4] algorithms are used as a starting point for homology searches. All these methods computes a score s(a, b) between two sequences a and b.

Methods

Results

Conclusion