Parameterizing sequence alignment with an explicit evolutionary model.

Elena Rivas,Sean R Eddy

doi:10.1186/s12859-015-0832-5

Abstract

BackgroundInference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence. However, the insertion and deletion penalties in the most widely used methods for inferring homology by sequence alignment, including BLAST and profile hidden Markov models (profile HMMs), are not based on any explicitly time-dependent evolutionary model. Using one fixed score system (BLOSUM62 with some gap open/extend costs, for example) corresponds to making an unrealistic assumption that all sequence relationships have diverged by the same time. Adoption of explicit time-dependent evolutionary models for scoring insertions and deletions in sequence alignments has been hindered by algorithmic complexity and technical difficulty.ResultsWe identify and implement several probabilistic evolutionary models compatible with the affine-cost insertion/deletion model used in standard pairwise sequence alignment. Assuming an affine gap cost imposes important restrictions on the realism of the evolutionary models compatible with it, as single insertion events with geometrically distributed lengths do not result in geometrically distributed insert lengths at finite times. Nevertheless, we identify one evolutionary model compatible with symmetric pair HMMs that are the basis for Smith-Waterman pairwise alignment, and two evolutionary models compatible with standard profile-based alignment.We test different aspects of the performance of these “optimized branch length” models, including alignment accuracy and homology coverage (discrimination of residues in a homologous region from nonhomologous flanking residues). We test on benchmarks of both global homologies (full length sequence homologs) and local homologies (homologous subsequences embedded in nonhomologous sequence).ConclusionsContrary to our expectations, we find that for global homologies a single long branch parameterization suffices both for distant and close homologous relationships. In contrast, we do see an advantage in using explicit evolutionary models for local homologies. Optimal branch parameterization reduces a known artifact called “homologous overextension”, in which local alignments erroneously extend through flanking nonhomologous residues.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0832-5) contains supplementary material, which is available to authorized users.

Highlights

Inference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence
Constraints on evolutionary models compatible with affine gap cost The frequency of insertions and deletions in biological sequences deviates significantly from the geometric length distribution implicit in any affine gap cost model [30, 31]
We have tried to elucidate what kind of evolutionary model is compatible with well-established computationally efficient comparative models, such as pair HMMs [3] or profile HMMs [2, 3, 25, 32]

Summary

Introduction

Inference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence. The insertion and deletion penalties in the most widely used methods for inferring homology by sequence alignment, including BLAST and profile hidden Markov models (profile HMMs), are not based on any explicitly time-dependent evolutionary model. Despite the apparent maturity of both the sequence similarity searching and phylogenetic inference fields, and despite the fact that inference of sequence homology via sequence alignment is itself obviously an evolutionary question, standard sequence comparison methods for homology search such as BLAST [1] or profile HMMs [2] still do not depend on an explicitly divergence-dependent. Imposing an evolutionary model in standard affine pair and profile HMMs would allow us to optimize the parameterization (branch length) to the apparent relatedness of each comparison, instead of assuming that all sequences are at the same evolutionary distance from each other. An optimal scoring system can find the best compromise given the length of the compared sequences and their relatedness

Methods

Results

Conclusion