Abstract

One of the major goals of computational sequence analysis is to find sequence similarities, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Since the degree of similarity is usually assessed by the sequence alignment score, it is necessary to know if a score is high enough to indicate a biologically interesting alignment. A powerful approach to defining score cutoffs is based on the evaluation of the statistical significance of alignments. The statistical significance of an alignment score is frequently assessed by its P-value, which is the probability that this score or a higher one can occur simply by chance, given the probabilistic models for the sequences. In this review we discuss the general role of P-value estimation in sequence analysis, and give a description of theoretical methods and computational approaches to the estimation of statistical signifiance for important classes of sequence analysis problems. In particular, we concentrate on the P-value estimation techniques for single sequence studies (both score-based and score-free), global and local pairwise sequence alignments, multiple alignments, sequence-to-profile alignments and alignments built with hidden Markov models. We anticipate that the review will be useful both to researchers professionally working in bioinformatics as well as to biomedical scientists interested in using contemporary methods of DNA and protein sequence analysis.

Highlights

  • Computational methods of biological sequence analysis have become an indispensable part of the modern scientist’s research arsenal

  • Many results have been obtained in the area of assessment of statistical significance of biosequence analysis, many important questions remain open

  • The picture is fuzzy both for pairwise and for multiple alignments, and general efficient methods are absent

Read more

Summary

INTRODUCTION

Computational methods of biological sequence analysis have become an indispensable part of the modern scientist’s research arsenal. Waterman and Vingron [16, 18] showed that (9) can be extended to gapped local alignments in the logarithmic region, but not in the linear region They used Poisson clumping heuristic to model the distinct HSPs. The basic idea of the approximation is that the number of nonintersecting segment pairs with score greater than or equal to s is approximately Poisson distributed with mean Kgnn0eÀgs, where n, n0 are the sequence lengths, and Kg, g are the parameters to be estimated from simulations (notice the similarity with the Poisson approximation in the single sequence case). The formula for the total expected number of ORFs of any length with scores higher than À can be derived from (21) by summing over l0

CONCLUSION
Key Points
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call