CombAlign: a code for generating a one-to-many sequence alignment from a set of pairwise structure-based sequence alignments.

Carol L Ecale Zhou

doi:10.1186/s13029-015-0039-1

Abstract

BackgroundIn order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins. Few codes exist for constructing a one-to-many multiple sequence alignment derived from a set of structure or sequence alignments, and a need was evident for creating such a tool for combining pairwise structure alignments that would allow for insertion of gaps in the reference structure.ResultsThis report describes a new Python code, CombAlign, which takes as input a set of pairwise sequence alignments (which may be structure based) and generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA). The use and utility of CombAlign was demonstrated by generating gapped MSSAs using sets of pairwise structure-based sequence alignments between structure models of the matrix protein (VP40) and pre-small/secreted glycoprotein (sGP) of Reston Ebolavirus and the corresponding proteins of several other filoviruses. The gapped MSSAs revealed structure-based residue-residue correspondences, which enabled identification of structurally similar versus differing regions in the Reston proteins compared to each of the other corresponding proteins.ConclusionsCombAlign is a new Python code that generates a one-to-many, gapped, multiple structure- or sequence-based sequence alignment (MSSA) given a set of pairwise sequence alignments (which may be structure based). CombAlign has utility in assisting the user in distinguishing structurally conserved versus divergent regions on a reference protein structure relative to other closely related proteins. CombAlign was developed in Python 2.6, and the source code is available for download from the GitHub code repository.Electronic supplementary materialThe online version of this article (doi:10.1186/s13029-015-0039-1) contains supplementary material, which is available to authorized users.

Highlights

In order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins
CombAlign takes as input a set of pairwise structure-based sequence alignments and generates a one-to-many, gapped, multiple structure-based sequence alignment (MSSA, see Methods) whereby the user can readily identify regions on the reference structure that have residue-residue correspondences with each of the other proteins against which the reference was structurally aligned
The intent in developing CombAlign was to construct multiplesequence alignments from structure data, the code is agnostic to the program that is used to generate pairwise alignments used as input

Summary

Introduction

In order to better define regions of similarity among related protein structures, it is useful to identify the residue-residue correspondences among proteins. Residue-residue correspondences can be readily extracted from pairwise structure-based alignments, yielding correspondences in space, which may differ from those obtained by aligning proteins at the sequence level, or even differing from those obtained using standard multiple structure-based alignment programs [1, 2], as these may adjust local alignments between any two proteins in order to refine a consensus or define an optimal simultaneous alignment for the set. Few codes exist for constructing a one-to-many structure-based sequence alignment derived from a set of pairwise structure-based sequence alignments, and no open-source code was found that generated an alignment allowing for gaps to be inserted into the reference sequence. The code was applied to help identify structure features that distinguish two proteins of Reston Ebolavirus (a species that is not pathogenic to human) from the corresponding proteins of several other closely related pathogenic filoviruses

Methods

Results

Discussion

Conclusion