Abstract

Imposing constraints that influence multiple sequence alignment (MSA) algorithms can often produce more biologically meaningful alignments. In this paper, a modularized program of constrained multiple sequence alignment (CMSA) called CSA-X is created that accepts constraints in the form of regular expressions. It uses arbitrary underlying MSA programs to generate alignments, and is therefore modular. The accuracy of CSA-X with different underlying MSA algorithms is compared, and also with another CMSA program called RE-MuSiC that similarly uses regular expressions for constraints. A technique is also developed to test the accuracies of CMSA algorithms with regular expression constraints using the BAliBASE 3.0 benchmark database. For verification, ProbCons and T-Coffee are used as the underlying MSA programs in CSA-X, and the accuracy of the alignments are measured in terms of Q score and TC score. Based on the results presented herein, CSA-X significantly outperforms RE-MuSiC. On average, CSA-X used with constraints that were algorithmically created from the least conserved regions of the correct alignments achieves results that are 17.65% higher for Q score, and 23.7% higher for TC score compared to RE-MuSiC. Further, CSA-X with ProbCons (CSA-PC) achieves a higher score in over 97.9% of the cases for Q score, and over 96.4% of the cases for TC score. It also shows that the use of regular expression constraints, if chosen well, created from accurate knowledge regarding a lesser conserved region can improve alignment accuracy. Statistical significance is measured using the Wilcoxon rank-sum test and Wilcoxon signed-rank test. An open source implementation of CSA-X is also provided.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call