Pairwise sequence alignment—it's all about us!

Lisa Mullan

doi:10.1093/bib/bbk008

Abstract

Pairwise alignment is one of the most fundamental tools of bioinformatics and underpins a variety of other, more sophisticated methods of annotation. Pairwise alignment in its most rigorous form uses a method called ‘dynamic programming’, which is highly accurate, but also incredibly costly to compute. In order to align anything other than an exact alphabetic match, the algorithm has to know what it is looking for and how it can evaluate the worth of what it finds. To this end, ‘comparison matrices’ have been created which define a score for every possible match possibility—an effective tally of how well the computational alignment is doing. The software will search for the highest score available. The final score is relevant only with its resulting alignment and cannot be used outside this context. In the case of DNA, comparison values are generated using a simple identity matrix of the type that allows one (positive) score for a correct match within the alignment, and a different (zero or negative) score for a mismatch. A fuller comparison matrix allowing ambiguities, alters these basic values as potential transitions and transversions are taken into account—but essentially there is very little mathematical difference that can be achieved between one alignment and another similar one. Protein matrices, on the other hand, offer a greater breadth of calculation as not only are there five times as many common amino acid residues as there are DNA bases, they incorporate a significant amount of evolutionary information. The most common matrices here are the position accepted mutation (PAM) [1, 2] and BLOSUM [3] comparison tables. The first PAM matrix was created in the late 1970s and relied on noting accepted residue substitutions within protein sequences to produce the PAM 1 table. Subsequent tables in the PAM family have been created by multiplication models based on that first matrix. The greater the multiplication, the higher the number in the PAM series and the greater the number of accepted mutations which have been involved in the proteins used to create the tables, and thus the greater the evolutionary divergence of those proteins. One of the more common matrices—the PAM 250 matrix— represents a subset of proteins of approximately 80% diversity. The BLOSUM matrices were created in the early 90s and relied on the presence of residues within the blocks of conserved regions of related proteins to create the matrix. These blocks can be accessed in the BLOCKS [4, 5] database. There is also a family of BLOSUM matrices which are differentiated with numbers. These numbers, however, represent the minimum percentage identity of the BLOCKS used to create the matrix. The most common matrix of this set is BLOSUM 62, a default setting for many protein alignment applications. It indicates that BLOCKS of at least 62% identity were used in the creation of this matrix. In the case of the BLOSUM matrices, the higher the number connected to the BLOSUM matrix, the smaller the evolutionary divergence between sequences. Once the comparison matrix has been established, the computer can make its own matrix based on the two sequences to be aligned—inserting a ‘score’ for each potential base or residue alignment. However, to allow the computer to score each comparison and select the one match—or run of matches—that gives the highest score would not necessarily yield the best alignment as it ignores biological insertions and

Full Text