Abstract
Multiple sequence alignment (MSA) is a well-known problem in bioinformatics whose main goal is the identification of evolutionary, structural or functional similarities in a set of three or more related genes or proteins. We present a parallel approach for the global alignment of multiple protein sequences that combines dynamic programming, heuristics, and parallel programming techniques in an iterative process. In the proposed algorithm, the longest common subsequence technique is used to generate a first MSA by aligning identical residues. An iterative process improves the MSA by applying a number of operators that were defined in the present work, in order to produce more accurate alignments. The accuracy of the alignment was evaluated through the application of optimization functions. In the proposed algorithm, a number of processes work independently at the same time searching for the best MSA of a set of sequences. There exists a process that acts as a coordinator, whereas the rest of the processes are considered slave processes. The resulting algorithm was called PaMSA, which stands for Parallel MSA. The MSA accuracy and response time of PaMSA were compared against those of Clustal W, T-Coffee, MUSCLE, and Parallel T-Coffee on 40 datasets of protein sequences. When run as a sequential application, PaMSA turned out to be the second fastest when compared against the nonparallel MSA methods tested (Clustal W, T-Coffee, and MUSCLE). However, PaMSA was designed to be executed in parallel. When run as a parallel application, PaMSA presented better response times than Parallel T-Cofffee under the conditions tested. Furthermore, the sum-of-pairs scores achieved by PaMSA when aligning groups of sequences with an identity percentage score from approximately 70% to 100%, were the highest in all cases. PaMSA was implemented on a cluster platform using the C++ language through the application of the standard Message Passing Interface (MPI) library.
Highlights
A fundamental research subarea of bioinformatics is biological sequence alignment and analysis, which focuses on developing algorithms and tools for comparing and finding similarities in nucleic acid (DNA and RNA), and amino acid sequences [1]
The main contribution of the present work is the development of a parallel algorithm—PaMSA, which stands for Parallel Multiple sequence alignment (MSA)—for the global alignment of multiple protein sequences
We present results obtained from alignments using PaMSA, as well as comparisons made against several methods commonly used for MSA, namely MUSCLE, Clustal www.ijacsa.thesai.org
Summary
A fundamental research subarea of bioinformatics is biological sequence alignment and analysis, which focuses on developing algorithms and tools for comparing and finding similarities in nucleic acid (DNA and RNA), and amino acid (protein) sequences [1]. The sequence similarities found are used for identifying evolutionary, structural or functional similarities among sequences in a set of related genes or proteins [2]. The set of sequences to be aligned are assumed to have an evolutionary relationship. Multiple sequence alignment (MSA) can be defined as the problem of comparing and finding which parts of the sequences are similar and which parts are different in a set of three or more biological sequences. The resulting alignment can be used to infer sequence homology. Homologous sequences are sequences that share a common ancestor and usually share common functions
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have