Abstract

BackgroundThe inference of homology from statistically significant sequence similarity is a central issue in sequence alignments. So far the statistical distribution function underlying the optimal global alignments has not been completely determined.ResultsIn this study, random and real but unrelated sequences prepared in six different ways were selected as reference datasets to obtain their respective statistical distributions of global alignment scores. All alignments were carried out with the Needleman-Wunsch algorithm and optimal scores were fitted to the Gumbel, normal and gamma distributions respectively. The three-parameter gamma distribution performs the best as the theoretical distribution function of global alignment scores, as it agrees perfectly well with the distribution of alignment scores. The normal distribution also agrees well with the score distribution frequencies when the shape parameter of the gamma distribution is sufficiently large, for this is the scenario when the normal distribution can be viewed as an approximation of the gamma distribution.ConclusionWe have shown that the optimal global alignment scores of random protein sequences fit the three-parameter gamma distribution function. This would be useful for the inference of homology between sequences whose relationship is unknown, through the evaluation of gamma distribution significance between sequences.

Highlights

  • The inference of homology from statistically significant sequence similarity is a central issue in sequence alignments

  • Sequence alignment is a central problem in computational molecular biology

  • To understand whether an observed sequence similarity implies a functional or evolutionary link, or is just a chance event is the central problem for the evaluation of the statistical significance of sequence alignment scores

Read more

Summary

Introduction

The inference of homology from statistically significant sequence similarity is a central issue in sequence alignments. Sequence alignment is a central problem in computational molecular biology. Every branch of molecular biology- from database search, phylogeny reconstruction to protein structure prediction- takes sequence alignments as its foundation. The functional and/or structural properties of a new sequence could be predicted from its degree of similarity with some clearly defined and known sequences. If the similarity between two sequences is statistically significant and does not stem from sequence repeats of low complexity, the two sequences are likely to be homologous and may have similar functions and/or structures. To understand whether an observed sequence similarity implies a functional or evolutionary link, or is just a chance event is the central problem for the evaluation of the statistical significance of sequence alignment scores. The basic question to be answered is: what is the probability that a similarity score as great as that observed (page number not for citation purposes)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call