Abstract

Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments. This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. Supplementary data are available at Bioinformatics online.

Highlights

  • The main way of analyzing nucleotide and protein sequences is by comparing them to related sequences

  • This study describes the equivalence between the partition function approach and pair hidden Markov model (PHMM)

  • This study describes the many-to-one relationship between probability models and score parameters for sequence alignment

Read more

Summary

Introduction

The main way of analyzing nucleotide and protein sequences is by comparing them to related sequences This is usually done by defining scores for aligned monomers, insertions, and deletions, finding alignments with maximal total score. Alignment models typically omit rapid evolution of tandem repeats, neighbordependence of substitutions, etc., but have proven useful. Another approach is to define alignment probabilities as exponentiated scores [27, 18]: prob(alignment) ∝ exp(alignment score/t),. This study clarifies the notion of alignment models with balanced length probability, i.e. no bias towards either longer or shorter alignments It concludes with a discussion of the best way to score alignments and assess their significance, depending on our precise aim. One previous study describes a one-to-many relationship between alignment scores and probabilities [1], but lacks most of the results presented here

Review of score-based alignment
Degrees of freedom
Algorithms for local alignment
Review of alignment probability models
Degrees of freedom in the gapless model
Homogeneous letter probabilities
Uniform length probability
Examples
Linear gap costs
Balanced length probability
Affine gap costs
Limits to degrees of freedom
Non-uniqueness of t
Sum of alignment probabilities
Discussion
Useful probability calculations
Sequences with multiple similar segments
Alignment significance
Aims of sequence comparison
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call