The construction and use of log-odds substitution scores for multiple sequence alignment.

Stephen F Altschul,Yi-Kuo Yu,John C Wootton,Elena Zaslavsky

doi:10.1371/journal.pcbi.1000852

Abstract

Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct “BILD” (“Bayesian Integral Log-odds”) substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.

Highlights

Protein and DNA sequence alignment is a fundamental tool of computational molecular biology
We consider an alternative approach that allows log-odds column scores to be derived from any pairwise substitution matrix. Given their form, multiple alignment log-odds scores can be used directly to define the proper extent of multiple alignment blocks, and to derive natural scores for profile-profile comparison. We show that they arise from the perspective of the Minimum Description Length Principle [33], which allows them to be combined naturally with other information theoretic measures
As we describe in Text S5, with Tables S1 and S2, BILD scores achieve success on two fronts

Summary

Introduction

Protein and DNA sequence alignment is a fundamental tool of computational molecular biology. It is used for functional prediction, genome annotation, the discovery of functional elements and motifs, homology-based structure prediction and modeling, phylogenetic reconstruction, and in numerous other applications. Most useful local pairwise alignment algorithms allow gaps and explicitly assign them scores [1,2,3,4]. Many local multiple alignment algorithms do not allow gaps, or allow them only implicitly as spacers between distinct ungapped alignment blocks. The alignments recorded in some protein family databases are explicitly constructed with ungapped alignment blocks separated by variable length spacers [5], and it has been argued that this formalism corresponds well to the observed relationships imposed by protein structure [6]. Short ungapped blocks are used in the DNA context, to represent, for example, transcription factor binding sites

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS Computational Biology	Publication Date: Jul 15, 2010
Citations: 161	License type: CC0 1.0

R Discovery Prime

R Discovery Prime

The construction and use of log-odds substitution scores for multiple sequence alignment.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology

Lead the way for us

Similar Papers

Multiple sequence alignment by a pairwise algorithm.
William Ramsay Taylor
Computer applications in the biosciences : CABIOS | VOL. 3
William Ramsay TaylorWilliam Ramsay Taylor
01 Jan 1987
Computer applications in the biosciences : CABIOS | VOL. 3

A multiple sequence alignment algorithm for homologous proteins using secondary structure information and optionally keying alignments to functionally important sites.
Christina M Henneke
Computer applications in the biosciences : CABIOS | VOL. 5
Christina M HennekeChristina M Henneke
01 Jan 1989
Computer applications in the biosciences : CABIOS | VOL. 5

A new protein linear motif benchmark for multiple sequence alignment software
Emmanuel Perrodou ... Julie D Thompson
BMC Bioinformatics | VOL. 9
Emmanuel Perrodou, et. al.Emmanuel Perrodou ... Julie D Thompson
25 Apr 2008
BMC Bioinformatics | VOL. 9

A Web Server for Multiple Sequence Alignment Using Genetic Algorithm
...
Genome Informatics | VOL. 12
, et. al. ...
01 Jan 2001
Genome Informatics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The construction and use of log-odds substitution scores for multiple sequence alignment.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS Computational Biology