Statistics of large-scale sequence searching.

R Spang,M Vingron

doi:10.1093/bioinformatics/14.3.279

Abstract

Database search programs such as FASTA, BLAST or a rigorous Smith-Waterman algorithm produce lists of database entries, which are assumed to be related to the query. The computation of statistical significance of similarity scores is well established for single pairs of sequences and using purely random models. However, the multi-trial context of a database search poses new problems. The credibility of a certain score obtained in a database search decreases with the amount of data that is compared. To improve p-value computation for database search experiments, statistical properties of the databases, such as the distribution of sequence length and effects induced by frequently repeated sequence patterns, need to be taken into account. We investigated the SWISS-PROT protein database Release 31.0 running extensive simulations of database searches. A discrepancy is observed between the theoretical predictions and the empirical distribution. To correct for this, we evaluate the statistical significance of scores in the context of a database search by a contrasting semi-random model. This model enhances purely random models by one additional parameter reflecting individual statistical properties of real databases. We call this parameter the effective size of the database. r.spang@dkfz-heidelberg.de;m.vingron@dkfz-hei del berg.de

Full Text