Abstract

An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm's utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.

Highlights

  • Determining the evolutionary relatedness of two protein sequences is most successfully performed by amino acid sequence comparison [1,2,3,4,5]

  • We have developed a novel tool that discovers significantly similar trends shared between two numerical data sets

  • Since we are a protein biophysics group, we are most naturally interested in discovering new similarities between proteins, and we have discovered a interesting, statistically significant similarity between a protein unique to Chlamydia and a bacterial pore-forming protein, colicin

Read more

Summary

Introduction

Determining the evolutionary relatedness of two protein sequences is most successfully performed by amino acid sequence comparison [1,2,3,4,5]. Similar properties could exist horizontally in a sequence even when recognizable vertical conservation is lost [7] Even if such similarities are due to analogy rather than homology [8], approaches are needed that can augment sequence based analysis by matching patterns that may be independent of amino acid conservation at each position. It may be the case that proteins can be meaningfully characterized by other attributes, such as the energetic contributions to stability [19] or the predicted codon translation efficiency along the mRNA [20,21,22] Such attributes are not accommodated by simple adaptation of current algorithms, largely because the scoring systems for such algorithms are based on positional sequence identity (amino acid substitution matrices) or absolute geometric structural similarity (Euclidean distance)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call