Abstract

BackgroundThe alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences is still a challenging problem.ResultsWe propose a novel and efficient algorithm called, MSAIndelFR, for multiple sequence alignment using the information on the predicted locations of IndelFRs and the computed average log–loss values obtained from IndelFR predictors, each of which is designed for a different protein fold. We demonstrate that the introduction of a new variable gap penalty function based on the predicted locations of the IndelFRs and the computed average log–loss values into the proposed algorithm substantially improves the protein alignment accuracy. This is illustrated by evaluating the performance of the algorithm in aligning sequences belonging to the protein folds for which the IndelFR predictors already exist and by using the reference alignments of the four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABRE (SABmark 1.65).ConclusionsWe have proposed a novel and efficient algorithm, the MSAIndelFR algorithm, for multiple protein sequence alignment incorporating a new variable gap penalty function. It is shown that the performance of the proposed algorithm is superior to that of the most–widely used alignment algorithms, Clustal W2, Clustal Omega, Kalign2, MSAProbs, MAFFT, MUSCLE, ProbCons and Probalign, in terms of both the sum–of–pairs and total column metrics.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0826-3) contains supplementary material, which is available to authorized users.

Highlights

  • The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics

  • Clustal Omega is the latest multiple sequence alignment (MSA) algorithm in the Clustal family, and the main improvements of Clustal Omega over Clustal W2 are as follows: (i) it can align any number of protein sequences, (ii) it allows the use of a profile hidden Markov model, derived from an alignment of protein sequences related to the input sequences, and (iii) it allows the user to choose the number of iterations, in the absence of which it is a progressive algorithm by default

  • For MAFFT, auto option is used with the maximum iterative refinement set to 1000, while the default options are used for all the other algorithms, including the proposed MSAIndelFR

Read more

Summary

Introduction

The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. Clustal Omega is the latest MSA algorithm in the Clustal family, and the main improvements of Clustal Omega over Clustal W2 are as follows: (i) it can align any number of protein sequences, (ii) it allows the use of a profile hidden Markov model, derived from an alignment of protein sequences related to the input sequences, and (iii) it allows the user to choose the number of iterations, in the absence of which it is a progressive algorithm by default. In Kalign, the pairwise distances between all pairs of sequences are estimated based on the the Muth–Manber string matching algorithm [18] and the guide tree constructed using UPGMA. The alignment algorithms MAFFT, MUSCLE, ProbCons and Probalign are not fully progressive In these algorithms, iterative refinement is performed to improve the alignment and the guide tree constructed using UPGMA for the iteration

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call