Algorithms for Sequence Alignment

David Powell

doi:10.4225/03/59c9f6db2a2d2

Abstract

Sequence alignment is an important tool for describing relationships between sequences. Many sequence alignment algorithms exist, differing in efficiency, and in their models of the sequences and of the relationship between sequences. The focus of this thesis is on algorithms for the optimal alignment of two or three sequences of biological data, particularly DNA sequences. The algorithms are discussed with particular emphasis on space and time complexity. A divide-and-conquer method is presented for use with a number of different alignment algorithms. This method may be used to reduce the space complexity of an alignment algorithm with little or no effect to the time complexity. The advantages of this divide-and-conquer method include its simplicity and the ease with which it can be applied to many different alignment algorithms. These advantages are demonstrated by using the divide-and-conquer method in conjunction with several known alignment algorithms. An efficient alignment algorithm is presented for the important problem of optimally aligning three sequences using a linear function for costing gaps in the alignment. For sequences of length n, and a minimum edit cost of d, this new algorithm has a time complexity of O(d + n). The algorithm is further developed by using the aforementioned divide-andconquer method to improve its space complexity. This combination results in a time and space efficient algorithm, while also illustrating the usefulness of the divide-and-conquer method. It is important when aligning sequences to correctly account for any non-randomness that is significant in the sequences. For example, if certain statistical patterns appear throughout sequences from a certain family, it is important to make use of this information when aligning sequences from this family. Common, unsurprising, patterns provide less evidence for the relatedness of sequences than more surprising regions provide. A new algorithm is presented to align optimally two non-random sequences. For a particular sequence model, this new algorithm apportions weight to every part of the alignment dependent on the importance of that part as determined by the sequence model. This algorithm is then developed further so that it can be used to infer whether two non-random sequences are related.

Full Text